PostgreSQL checkpoint 逻辑分析
源码版本:PostgreSQL 13.3
1. checkpoint 触发条件
- 参数 checkpoint_timeout 控制,默认 5 分钟,范围 30s ~ 1 day
- wal 日志间隔,当最新的 wal 日志与上次 checkpoint 时的 wal 日志的距离大于指定点时,触发。相关参数:checkpoint_complete_target
- 手动触发,执行命令:checkpoint(超级用户)
- 此外,数据库正常关闭,备份,崩溃恢复后,都会做 checkpoint
2. checkpoint 主要过程
PG 有专门负责做 checkpoint 的后台进程,其入口函数为 CheckpointerMain(),该函数会进入一个无限循环,在满足一定条件时,就会自动做 checkpoint 操作。
checkpoint 也可以手动执行,在 psql 终端输入 checkpoint 命令时,调用函数 RequestCheckpoint(),然后给 PG 的 checkpoint 进程发送 SIGINT 信号,请求该进程执行 checkpoint 操作。
checkpoint 的进程在收到 SIGINT 信号后,会在 for 循环中被唤醒,开始执行 checkpoint 操作,实际调用的函数为 CreateCheckPoint()。
3. CreateCheckPoint() 主要逻辑
- 获取 checkpoint 的 redo 点,即当前 wal 日志的最新写入点,在 checkpoint 完成后,如果发生崩溃,将从该位置点进行恢复
- 调用 CheckPointGuts() 函数刷新脏页
- 遍历 buffer,将 DIRTY 的块状态改为 CHECKPOINT_NEEDED,此步骤在内存中完成,并不涉及磁盘操作
- 刷物理文件,从缓存中将脏块 fsync 到磁盘,此步骤涉及磁盘,将标记为 CHECKPOINT_NEEDED 的 block 写出到磁盘
- checkpoint 本身作为一条 wal 记录被记录到 wal 中,该记录的内容为 checkpoint 结构体
- 更新控制文件 pg_control
- 删除或者重命名不再需要的的 wal 文件
4. 源码片断
DDL 语句 checkpoint,调用函数 RequestCheckpoint() 请求执行 checkpoint 操作。
# 源码路径:src/backend/tcop/utility.c
case T_CheckPointStmt:
if (!superuser())
ereport(ERROR,
(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
errmsg("must be superuser to do CHECKPOINT")));
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
(RecoveryInProgress() ? 0 : CHECKPOINT_FORCE));
break;
在 RequestCheckpoint() 函数中,如果 checkpoint 进程存在,则给该进程发送 SIGINT 信号,请求该进程执行 checkpoint 操作。代码如下:
# 源码路径:src/backend/postmaster/checkpointer.c
if (CheckpointerShmem->checkpointer_pid == 0)
{
if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
{
elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
"could not signal for checkpoint: checkpointer is not running");
break;
}
}
else if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
{
if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
{
elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
"could not signal for checkpoint: %m");
break;
}
}
else
break; /* signal sent successfully */
在 CheckpointerMain() 函数中,checkpoint 进程收到 SIGINT 信号被唤醒,调用 CreateCheckPoint() 函数执行 checkpoint 操作。
# 源码路径:src/backend/postmaster/checkpointer.c
/*
* Do the checkpoint.
*/
if (!do_restartpoint)
{
CreateCheckPoint(flags);
ckpt_performed = true;
}
else
ckpt_performed = CreateRestartPoint(flags);
CreateCheckPoint() 函数即为 checkpoint 的主要实现函数,具体可查看其源码
checkpoint 完成后,会向 wal 写入一条记录,其主要内容为 CheckPoint 结构体,如下:
# 源码路径:/src/backend/access/transam/xlog.c
/*
* Now insert the checkpoint record into XLOG.
*/
XLogBeginInsert();
XLogRegisterData((char *) (&checkPoint), sizeof(checkPoint));
recptr = XLogInsert(RM_XLOG_ID,
shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
XLOG_CHECKPOINT_ONLINE);
XLogFlush(recptr);
CheckPoint 结构体:
/*
* Body of CheckPoint XLOG records. This is declared here because we keep
* a copy of the latest one in pg_control for possible disaster recovery.
* Changing this struct requires a PG_CONTROL_VERSION bump.
*/
typedef struct CheckPoint
{
XLogRecPtr redo; /* next RecPtr available when we began to
* create CheckPoint (i.e. REDO start point) */
TimeLineID ThisTimeLineID; /* current TLI */
TimeLineID PrevTimeLineID; /* previous TLI, if this record begins a new
* timeline (equals ThisTimeLineID otherwise) */
bool fullPageWrites; /* current full_page_writes */
FullTransactionId nextFullXid; /* next free full transaction ID */
Oid nextOid; /* next free OID */
MultiXactId nextMulti; /* next free MultiXactId */
MultiXactOffset nextMultiOffset; /* next free MultiXact offset */
TransactionId oldestXid; /* cluster-wide minimum datfrozenxid */
Oid oldestXidDB; /* database with minimum datfrozenxid */
MultiXactId oldestMulti; /* cluster-wide minimum datminmxid */
Oid oldestMultiDB; /* database with minimum datminmxid */
pg_time_t time; /* time stamp of checkpoint */
TransactionId oldestCommitTsXid; /* oldest Xid with valid commit
* timestamp */
TransactionId newestCommitTsXid; /* newest Xid with valid commit
* timestamp */
/*
* Oldest XID still running. This is only needed to initialize hot standby
* mode from an online checkpoint, so we only bother calculating this for
* online checkpoints and only when wal_level is replica. Otherwise it's
* set to InvalidTransactionId.
*/
TransactionId oldestActiveXid;
} CheckPoint;
文章评论