本文介绍percona修复的两个binlog cache相关的bug,percona bug#1162085 (mysql bug#66237) 和percona bug#1070856,其中percona bug#1070856是修复percona bug#1162085时引入的
percona bug#1162085
这个bug会导致tmp目录被写满,对于一个写入较大的操作,如LOAD DATA / INSERT large_dataset, 全表UPDATE / DELETE, 事务对应的binlog cache会使用到临时文件(ML开头,类似:/tmp/MLjw4ecJ,用lsof查看,状态为deleted)作为缓冲(IO_CACHE),事务在提交后,binlog_cache_data.truncate函数被被调用,但是此函数只会初始化IO_CACHE中的内存缓冲,临时缓冲文件不做清理,只有在session被关闭时,close_cached_file被调用,IO_CACHE对象被销毁,这种策略设计初衷应该是:session退出之前,临时缓冲文件能够一直被使用;然而会出现一个问题:临时缓冲文件会一直存在,且只会越变越大,如果有很多session一直不退出且执行过很大的操作,就会有tmp目录被写满的风险
percona bug#1162085对应的patch的解决办法是:binlog_cache_data.truncate中增加truncate临时缓冲文件的操作,即对于大操作(超过binlog_cache_size),每次完成后对临时缓冲文件做一次truncate操作: binlog_cache_mngr.reset –> binlog_cache_data.reset –> binlog_cache_data.truncate,部分patch:
--- Percona-Server/mysys/mf_cache.c 2011-06-30 15:46:53 +0000 +++ Percona-Server/mysys/mf_cache.c 2012-10-26 05:11:24 +0000 @@ -119,3 +119,21 @@ } DBUG_VOID_RETURN; } + +/* + Truncate the cached file to a given offset. The cache must be reinitialized + with reinit_io_cache() after this call. +*/ + +my_bool truncate_cached_file(IO_CACHE *cache, my_off_t pos) +{ + DBUG_ENTER("truncate_cached_file"); + + if (my_b_inited(cache) && cache->file > -1) + { + if (my_chsize(cache->file, pos, 0, MYF(MY_WME))) + DBUG_RETURN(TRUE); + } + + DBUG_RETURN(FALSE); +} --- Percona-Server/sql/log.cc 2012-10-17 03:47:45 +0000 +++ Percona-Server/sql/log.cc 2012-10-26 05:11:24 +0000 @@ -395,6 +395,11 @@ delete pending(); set_pending(0); } + /* + Truncate the temporary file to reclaim disk space occupied by cached + transactions on COMMIT/ROLLBACK. + */ + truncate_cached_file(&cache_log, pos); reinit_io_cache(&cache_log, WRITE_CACHE, pos, 0, 0); cache_log.end_of_file= saved_max_binlog_cache_size; } |
percona bug#1070856
然而,percona bug#1162085的patch却引入了percona bug#1070856,binlog_cache_data.truncate除了被binlog_cache_data.reset调用之外,还会被binlog_cache_data.restore_savepoint/restore_prev_posiiton调用,这两个函数都是处理savepoint/rollback相关的,新建一个savepoint的调用函数栈:
#0 binlog_savepoint_set (hton=0x2002d60, thd=0x62637c0, sv=0x7fb6e0006d50) at sql/log.cc:2043 #1 0x0000000000768bde in ha_savepoint (thd=0x62637c0, sv=0x7fb6e0006d20) at sql/handler.cc:1876 #2 0x00000000006ed477 in trans_savepoint (thd=0x62637c0, name=...) at sql/transaction.cc:422 #3 0x00000000005fdd10 in mysql_execute_command (thd=0x62637c0) at sql/sql_parse.cc:3961 #4 0x0000000000602478 in mysql_parse (thd=0x62637c0, rawbuf=0x7fb6e0004c90 "SAVEPOINT savepoint_1", length=21, parser_state=0x7fb6ec5d0ae0) at sql/sql_parse.cc:5805 #5 0x00000000005f5fa1 in dispatch_command (command=COM_QUERY, thd=0x62637c0, packet=0x62fefe1 "SAVEPOINT savepoint_1", packet_length=21) at sql_parse.cc:1060 #6 0x00000000005f5240 in do_command (thd=0x62637c0) at sql/sql_parse.cc:788 #7 0x00000000006dbfc0 in do_handle_one_connection (thd_arg=0x62637c0) at sql/sql_connect.cc:1409 #8 0x00000000006dba77 in handle_one_connection (arg=0x62637c0) at sql/sql_connect.cc:1315 #9 0x0000003659e077e1 in start_thread () from /lib64/libpthread.so.0 #10 0x0000003659ae68ed in clone () from /lib64/libc.so.6 |
binlog_savepoint_set 函数中my_off_t *sv是传入参数,记录了当前binlog cache写入位置,事务中所有savepoint通过thd->transaction.savepoints链表管理,当rollback时,先在链表中找到对应的savepoint节点,然后得到其在binlog cache中的位置,对该位置之后的binlog做truncate操作,ROLLBACK TO SAVEPOINT sv_name对应的函数栈:
#0 0x000000000081b4f3 in binlog_cache_data::truncate (this=0x7fb6e000a0c0, pos=49558) at sql/log.cc:400 #1 0x000000000081b323 in binlog_cache_data::restore_savepoint (this=0x7fb6e000a0c0, pos=49558) at sql/log.cc:286 #2 0x000000000080c2e9 in binlog_trans_log_truncate (thd=0x62637c0, pos=49558) at sql/log.cc:1582 #3 0x000000000080d56f in binlog_savepoint_rollback (hton=0x2002d60, thd=0x62637c0, sv=0x7fb6e0006e30) at sql/log.cc:2083 #4 0x0000000000768935 in ha_rollback_to_savepoint (thd=0x62637c0, sv=0x7fb6e0006e00) at sql/handler.cc:1819 #5 0x00000000006ed605 in trans_rollback_to_savepoint (thd=0x62637c0, name=...) at sql/transaction.cc:479 #6 0x00000000005fdcbe in mysql_execute_command (thd=0x62637c0) at sql/sql_parse.cc:3956 #7 0x0000000000602478 in mysql_parse (thd=0x62637c0, rawbuf=0x7fb6e0004c90 "ROLLBACK TO SAVEPOINT savepoint_1", length=33, parser_state=0x7fb6ec5d0ae0) at sql/sql_parse.cc:5805 #8 0x00000000005f5fa1 in dispatch_command (command=COM_QUERY, thd=0x62637c0, packet=0x62fefe1 "ROLLBACK TO SAVEPOINT savepoint_1", packet_length=33) at sql/sql_parse.cc:1060 #9 0x00000000005f5240 in do_command (thd=0x62637c0) at sql/sql_parse.cc:788 #10 0x00000000006dbfc0 in do_handle_one_connection (thd_arg=0x62637c0) at sql/sql_connect.cc:1409 #11 0x00000000006dba77 in handle_one_connection (arg=0x62637c0) at sql/sql_connect.cc:1315 #12 0x0000003659e077e1 in start_thread () from /lib64/libpthread.so.0 #13 0x0000003659ae68ed in clone () from /lib64/libc.so.6 |
问题出现在patch之后的binlog_cache_data::truncate函数,由于增加了truncate_cached_file(在linux环境下最终会调用ftruncate),如果传入参数pos大于IO_CACHE中的临时缓冲文件大小,则文件末尾会被填充0,填充后的文件大小为pos,此时事务对应的binlog cache就被写脏数据了,原因是IO_CACHE的写入策略是:如果内存缓冲区有空闲,写到内存,内存缓冲区写满后写回临时缓冲文件,因此实际上整个binlog cache的大小为:临时缓冲文件+内存缓冲区,从获取IO_CACHE中当前数据大小的宏my_b_tell的计算方法中可以看出来,因此在内存缓冲中的数据刷到临时缓冲文件之前,整个缓冲其实是两部分,只不过对外部调用是透明的
而实际上,只要IO_CACHE中的内存缓冲区有数据没有刷到临时缓冲文件中,此时truncate_cached_file的传入参数pos就会大于临时缓冲文件大小,truncate_cached_file调用结束后,binlog cache中临时缓冲文件和内存缓冲文件之间被插入了很多0,最终会导致mysqld hang,程序进入了死循环:
MYSQL_BIN_LOG::write_cache ... while (hdr_offs < length) { /* partial header only? save what we can get, process once we get the rest. */ if (hdr_offs + LOG_EVENT_HEADER_LEN > length) { carry= length - hdr_offs; memcpy(header, (char *)cache->read_pos + hdr_offs, carry); length= hdr_offs; } else { /* we've got a full event-header, and it came in one piece */ uchar *log_pos= (uchar *)cache->read_pos + hdr_offs + LOG_POS_OFFSET; /* fix end_log_pos */ val= uint4korr(log_pos) + group; int4store(log_pos, val); /* next event header at ... */ log_pos= (uchar *)cache->read_pos + hdr_offs + EVENT_LEN_OFFSET; hdr_offs += uint4korr(log_pos); } } ... |
MySQL hang原因
将IO_CACHE中的binlog cache写入到binlog中时,会通过每个binlog event的头部来获取binlog event的长度的,由于binlog cache临时缓冲文件末尾被填充了很多0,导致while循环无法退出,通过perf top可以看出MYSQL_BIN_LOG::write_cache占用了96%的cpu
samples pcnt function DSO _______ _____ ___________________________________________________ _________________________________________________________________________________ 14095.00 96.4% _ZN13MYSQL_BIN_LOG11write_cacheEP3THDP11st_io_cache /u01/mysql/bin/mysqld 61.00 0.4% find_busiest_group [kernel.kallsyms] 42.00 0.3% _spin_lock [kernel.kallsyms] 41.00 0.3% port_inb /lib/modules/2.6.32-220.17.1.tb619.el6.x86_64/kernel/drivers/char/ipmi/ipmi_si.ko 36.00 0.2% sync_array_print_long_waits /u01/mysql/bin/mysqld 33.00 0.2% srv_lock_timeout_thread /u01/mysql/bin/mysqld |
解决方案
truncate_cached_file中,在对cache file进行truncate之前,判断传入参数pos与临时缓冲文件大小的关系,只有当pos小于文件大小时才做ftruncate操作,完整patch