PostgreSQL checkpoint 逻辑分析

源码版本:PostgreSQL 13.3

1. checkpoint 触发条件

  • 参数 checkpoint_timeout 控制,默认 5 分钟,范围 30s ~ 1 day
  • wal 日志间隔,当最新的 wal 日志与上次 checkpoint 时的 wal 日志的距离大于指定点时,触发。相关参数:checkpoint_complete_target
  • 手动触发,执行命令:checkpoint(超级用户)
  • 此外,数据库正常关闭,备份,崩溃恢复后,都会做 checkpoint

2. checkpoint 主要过程

PG 有专门负责做 checkpoint 的后台进程,其入口函数为 CheckpointerMain(),该函数会进入一个无限循环,在满足一定条件时,就会自动做 checkpoint 操作。

checkpoint 也可以手动执行,在 psql 终端输入 checkpoint 命令时,调用函数 RequestCheckpoint(),然后给 PG 的 checkpoint 进程发送 SIGINT 信号,请求该进程执行 checkpoint 操作。

checkpoint 的进程在收到 SIGINT 信号后,会在 for 循环中被唤醒,开始执行 checkpoint 操作,实际调用的函数为 CreateCheckPoint()。

3. CreateCheckPoint() 主要逻辑

  1. 获取 checkpoint 的 redo 点,即当前 wal 日志的最新写入点,在 checkpoint 完成后,如果发生崩溃,将从该位置点进行恢复
  2. 调用 CheckPointGuts() 函数刷新脏页
    1. 遍历 buffer,将 DIRTY 的块状态改为 CHECKPOINT_NEEDED,此步骤在内存中完成,并不涉及磁盘操作
    2. 刷物理文件,从缓存中将脏块 fsync 到磁盘,此步骤涉及磁盘,将标记为 CHECKPOINT_NEEDED 的 block 写出到磁盘
  3. checkpoint 本身作为一条 wal 记录被记录到 wal 中,该记录的内容为 checkpoint 结构体
  4. 更新控制文件 pg_control
  5. 删除或者重命名不再需要的的 wal 文件

4. 源码片断

DDL 语句 checkpoint,调用函数 RequestCheckpoint() 请求执行 checkpoint 操作。

# 源码路径:src/backend/tcop/utility.c
case T_CheckPointStmt:
	if (!superuser())
		ereport(ERROR,
				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
				 errmsg("must be superuser to do CHECKPOINT")));

	RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT |
					  (RecoveryInProgress() ? 0 : CHECKPOINT_FORCE));
	break;

在 RequestCheckpoint() 函数中,如果 checkpoint 进程存在,则给该进程发送 SIGINT 信号,请求该进程执行 checkpoint 操作。代码如下:

# 源码路径:src/backend/postmaster/checkpointer.c
if (CheckpointerShmem->checkpointer_pid == 0)
{
	if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
	{
		elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
			 "could not signal for checkpoint: checkpointer is not running");
		break;
	}
}
else if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
{
	if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
	{
		elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
			 "could not signal for checkpoint: %m");
		break;
	}
}
else
	break;				/* signal sent successfully */

在 CheckpointerMain() 函数中,checkpoint 进程收到 SIGINT 信号被唤醒,调用 CreateCheckPoint() 函数执行 checkpoint 操作。

# 源码路径:src/backend/postmaster/checkpointer.c
/*
 * Do the checkpoint.
 */
if (!do_restartpoint)
{
	CreateCheckPoint(flags);
	ckpt_performed = true;
}
else
	ckpt_performed = CreateRestartPoint(flags);

CreateCheckPoint() 函数即为 checkpoint 的主要实现函数,具体可查看其源码

checkpoint 完成后,会向 wal 写入一条记录,其主要内容为 CheckPoint 结构体,如下:

# 源码路径:/src/backend/access/transam/xlog.c
/*
 * Now insert the checkpoint record into XLOG.
 */
XLogBeginInsert();
XLogRegisterData((char *) (&checkPoint), sizeof(checkPoint));
recptr = XLogInsert(RM_XLOG_ID,
					shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
					XLOG_CHECKPOINT_ONLINE);

XLogFlush(recptr);

CheckPoint 结构体:

/*
 * Body of CheckPoint XLOG records.  This is declared here because we keep
 * a copy of the latest one in pg_control for possible disaster recovery.
 * Changing this struct requires a PG_CONTROL_VERSION bump.
 */
typedef struct CheckPoint
{
	XLogRecPtr	redo;			/* next RecPtr available when we began to
								 * create CheckPoint (i.e. REDO start point) */
	TimeLineID	ThisTimeLineID; /* current TLI */
	TimeLineID	PrevTimeLineID; /* previous TLI, if this record begins a new
								 * timeline (equals ThisTimeLineID otherwise) */
	bool		fullPageWrites; /* current full_page_writes */
	FullTransactionId nextFullXid;	/* next free full transaction ID */
	Oid			nextOid;		/* next free OID */
	MultiXactId nextMulti;		/* next free MultiXactId */
	MultiXactOffset nextMultiOffset;	/* next free MultiXact offset */
	TransactionId oldestXid;	/* cluster-wide minimum datfrozenxid */
	Oid			oldestXidDB;	/* database with minimum datfrozenxid */
	MultiXactId oldestMulti;	/* cluster-wide minimum datminmxid */
	Oid			oldestMultiDB;	/* database with minimum datminmxid */
	pg_time_t	time;			/* time stamp of checkpoint */
	TransactionId oldestCommitTsXid;	/* oldest Xid with valid commit
										 * timestamp */
	TransactionId newestCommitTsXid;	/* newest Xid with valid commit
										 * timestamp */

	/*
	 * Oldest XID still running. This is only needed to initialize hot standby
	 * mode from an online checkpoint, so we only bother calculating this for
	 * online checkpoints and only when wal_level is replica. Otherwise it's
	 * set to InvalidTransactionId.
	 */
	TransactionId oldestActiveXid;
} CheckPoint;

文章评论

0条评论