percona bug#1162085, bug#1070856

本文介绍percona修复的两个binlog cache相关的bug，percona bug#1162085 (mysql bug#66237) 和percona bug#1070856，其中percona bug#1070856是修复percona bug#1162085时引入的

percona bug#1162085

这个bug会导致tmp目录被写满，对于一个写入较大的操作，如LOAD DATA / INSERT large_dataset, 全表UPDATE / DELETE, 事务对应的binlog cache会使用到临时文件（ML开头，类似：/tmp/MLjw4ecJ，用lsof查看，状态为deleted）作为缓冲（IO_CACHE），事务在提交后，binlog_cache_data.truncate函数被被调用，但是此函数只会初始化IO_CACHE中的内存缓冲，临时缓冲文件不做清理，只有在session被关闭时，close_cached_file被调用，IO_CACHE对象被销毁，这种策略设计初衷应该是：session退出之前，临时缓冲文件能够一直被使用；然而会出现一个问题：临时缓冲文件会一直存在，且只会越变越大，如果有很多session一直不退出且执行过很大的操作，就会有tmp目录被写满的风险

percona bug#1162085对应的patch的解决办法是：binlog_cache_data.truncate中增加truncate临时缓冲文件的操作，即对于大操作（超过binlog_cache_size），每次完成后对临时缓冲文件做一次truncate操作: binlog_cache_mngr.reset –> binlog_cache_data.reset –> binlog_cache_data.truncate，部分patch：

--- Percona-Server/mysys/mf_cache.c	2011-06-30 15:46:53 +0000
+++ Percona-Server/mysys/mf_cache.c	2012-10-26 05:11:24 +0000
@@ -119,3 +119,21 @@
   }
   DBUG_VOID_RETURN;
 }
+
+/*
+  Truncate the cached file to a given offset. The cache must be reinitialized
+  with reinit_io_cache() after this call.
+*/
+
+my_bool truncate_cached_file(IO_CACHE *cache, my_off_t pos)
+{
+  DBUG_ENTER("truncate_cached_file");
+
+  if (my_b_inited(cache) && cache->file > -1)
+  {
+    if (my_chsize(cache->file, pos, 0, MYF(MY_WME)))
+      DBUG_RETURN(TRUE);
+  }
+
+  DBUG_RETURN(FALSE);
+}
 
--- Percona-Server/sql/log.cc	2012-10-17 03:47:45 +0000
+++ Percona-Server/sql/log.cc	2012-10-26 05:11:24 +0000
@@ -395,6 +395,11 @@
       delete pending();
       set_pending(0);
     }
+    /*
+      Truncate the temporary file to reclaim disk space occupied by cached
+      transactions on COMMIT/ROLLBACK.
+    */
+    truncate_cached_file(&cache_log, pos);
     reinit_io_cache(&cache_log, WRITE_CACHE, pos, 0, 0);
     cache_log.end_of_file= saved_max_binlog_cache_size;
   }

percona bug#1070856

然而，percona bug#1162085的patch却引入了percona bug#1070856，binlog_cache_data.truncate除了被binlog_cache_data.reset调用之外，还会被binlog_cache_data.restore_savepoint/restore_prev_posiiton调用，这两个函数都是处理savepoint/rollback相关的，新建一个savepoint的调用函数栈：

#0  binlog_savepoint_set (hton=0x2002d60, thd=0x62637c0, sv=0x7fb6e0006d50) at sql/log.cc:2043
#1  0x0000000000768bde in ha_savepoint (thd=0x62637c0, sv=0x7fb6e0006d20) at sql/handler.cc:1876
#2  0x00000000006ed477 in trans_savepoint (thd=0x62637c0, name=...) at sql/transaction.cc:422
#3  0x00000000005fdd10 in mysql_execute_command (thd=0x62637c0) at sql/sql_parse.cc:3961
#4  0x0000000000602478 in mysql_parse (thd=0x62637c0, rawbuf=0x7fb6e0004c90 "SAVEPOINT savepoint_1", length=21, parser_state=0x7fb6ec5d0ae0) at sql/sql_parse.cc:5805
#5  0x00000000005f5fa1 in dispatch_command (command=COM_QUERY, thd=0x62637c0, packet=0x62fefe1 "SAVEPOINT savepoint_1", packet_length=21) at sql_parse.cc:1060
#6  0x00000000005f5240 in do_command (thd=0x62637c0) at sql/sql_parse.cc:788
#7  0x00000000006dbfc0 in do_handle_one_connection (thd_arg=0x62637c0) at sql/sql_connect.cc:1409
#8  0x00000000006dba77 in handle_one_connection (arg=0x62637c0) at sql/sql_connect.cc:1315
#9  0x0000003659e077e1 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003659ae68ed in clone () from /lib64/libc.so.6

binlog_savepoint_set 函数中my_off_t *sv是传入参数，记录了当前binlog cache写入位置，事务中所有savepoint通过thd->transaction.savepoints链表管理，当rollback时，先在链表中找到对应的savepoint节点，然后得到其在binlog cache中的位置，对该位置之后的binlog做truncate操作，ROLLBACK TO SAVEPOINT sv_name对应的函数栈：

#0  0x000000000081b4f3 in binlog_cache_data::truncate (this=0x7fb6e000a0c0, pos=49558) at sql/log.cc:400
#1  0x000000000081b323 in binlog_cache_data::restore_savepoint (this=0x7fb6e000a0c0, pos=49558) at sql/log.cc:286
#2  0x000000000080c2e9 in binlog_trans_log_truncate (thd=0x62637c0, pos=49558) at sql/log.cc:1582
#3  0x000000000080d56f in binlog_savepoint_rollback (hton=0x2002d60, thd=0x62637c0, sv=0x7fb6e0006e30) at sql/log.cc:2083
#4  0x0000000000768935 in ha_rollback_to_savepoint (thd=0x62637c0, sv=0x7fb6e0006e00) at sql/handler.cc:1819
#5  0x00000000006ed605 in trans_rollback_to_savepoint (thd=0x62637c0, name=...) at sql/transaction.cc:479
#6  0x00000000005fdcbe in mysql_execute_command (thd=0x62637c0) at sql/sql_parse.cc:3956
#7  0x0000000000602478 in mysql_parse (thd=0x62637c0, rawbuf=0x7fb6e0004c90 "ROLLBACK TO SAVEPOINT savepoint_1", length=33, parser_state=0x7fb6ec5d0ae0) at sql/sql_parse.cc:5805
#8  0x00000000005f5fa1 in dispatch_command (command=COM_QUERY, thd=0x62637c0, packet=0x62fefe1 "ROLLBACK TO SAVEPOINT savepoint_1", packet_length=33) at sql/sql_parse.cc:1060
#9  0x00000000005f5240 in do_command (thd=0x62637c0) at sql/sql_parse.cc:788
#10 0x00000000006dbfc0 in do_handle_one_connection (thd_arg=0x62637c0) at sql/sql_connect.cc:1409
#11 0x00000000006dba77 in handle_one_connection (arg=0x62637c0) at sql/sql_connect.cc:1315
#12 0x0000003659e077e1 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003659ae68ed in clone () from /lib64/libc.so.6

问题出现在patch之后的binlog_cache_data::truncate函数，由于增加了truncate_cached_file（在linux环境下最终会调用ftruncate），如果传入参数pos大于IO_CACHE中的临时缓冲文件大小，则文件末尾会被填充0，填充后的文件大小为pos，此时事务对应的binlog cache就被写脏数据了，原因是IO_CACHE的写入策略是：如果内存缓冲区有空闲，写到内存，内存缓冲区写满后写回临时缓冲文件，因此实际上整个binlog cache的大小为：临时缓冲文件+内存缓冲区，从获取IO_CACHE中当前数据大小的宏my_b_tell的计算方法中可以看出来，因此在内存缓冲中的数据刷到临时缓冲文件之前，整个缓冲其实是两部分，只不过对外部调用是透明的

而实际上，只要IO_CACHE中的内存缓冲区有数据没有刷到临时缓冲文件中，此时truncate_cached_file的传入参数pos就会大于临时缓冲文件大小，truncate_cached_file调用结束后，binlog cache中临时缓冲文件和内存缓冲文件之间被插入了很多0，最终会导致mysqld hang，程序进入了死循环：

MYSQL_BIN_LOG::write_cache
...
      while (hdr_offs < length)
      {
        /*
          partial header only? save what we can get, process once
          we get the rest.
        */
 
        if (hdr_offs + LOG_EVENT_HEADER_LEN > length)
        {
          carry= length - hdr_offs;
          memcpy(header, (char *)cache->read_pos + hdr_offs, carry);
          length= hdr_offs;
        }
        else
        {
          /* we've got a full event-header, and it came in one piece */
          uchar *log_pos= (uchar *)cache->read_pos + hdr_offs + LOG_POS_OFFSET;
 
          /* fix end_log_pos */
          val= uint4korr(log_pos) + group;
          int4store(log_pos, val);
 
          /* next event header at ... */
          log_pos= (uchar *)cache->read_pos + hdr_offs + EVENT_LEN_OFFSET;
          hdr_offs += uint4korr(log_pos);
        }
      }
...

MySQL hang原因

将IO_CACHE中的binlog cache写入到binlog中时，会通过每个binlog event的头部来获取binlog event的长度的，由于binlog cache临时缓冲文件末尾被填充了很多0，导致while循环无法退出，通过perf top可以看出MYSQL_BIN_LOG::write_cache占用了96%的cpu

   samples  pcnt function                                            DSO
   _______ _____ ___________________________________________________ _________________________________________________________________________________
 
  14095.00 96.4% _ZN13MYSQL_BIN_LOG11write_cacheEP3THDP11st_io_cache /u01/mysql/bin/mysqld                                                            
     61.00  0.4% find_busiest_group                                  [kernel.kallsyms]                                                                
     42.00  0.3% _spin_lock                                          [kernel.kallsyms]                                                                
     41.00  0.3% port_inb                                            /lib/modules/2.6.32-220.17.1.tb619.el6.x86_64/kernel/drivers/char/ipmi/ipmi_si.ko
     36.00  0.2% sync_array_print_long_waits                         /u01/mysql/bin/mysqld                                                            
     33.00  0.2% srv_lock_timeout_thread                             /u01/mysql/bin/mysqld

解决方案

truncate_cached_file中，在对cache file进行truncate之前，判断传入参数pos与临时缓冲文件大小的关系，只有当pos小于文件大小时才做ftruncate操作，完整patch