InnoDB使用linux native aio源码分析

By | 2013 年 12 月 22 日

好记性不如烂笔头,这篇文章是一个学习和总结,包括linux native aio API介绍以及InnoDB中如何使用native aio

异步io

linux上异步io有两套API,posix aio和native aio,其中native aio具有更好的性能,但要求文件打开方式必须是O_DIRECT,简单列下两套API的接口:

posix aio
function desc
aio_read
aio_write
aio_return io请求返回
aio_error io返回结果
aio_suspend
aio_cancel io
lio_listio io请求

native aio
function desc
io_setup io上下文
io_submit io请求
io_getevents io请求返回
io_cancel io
io_destroy io上下文

本文不打算继续介绍posix aio相关内容,将重点集中在native aio上,列出函数原型:

1. int io_setup(int maxevents, io_context_t *ctxp);
|- 创建异步io上下文,其中io_context_t是句柄,具体实现时应该对应一个整数,maxevents限制了可以同时提交的异步io数量

2. long io_submit(aio_context_t ctx_id, long nr, struct iocb **iocbpp);
|- 提交异步io请求,请求可以是读或者写,nr是iocbpp数组的大小,io请求入队后即刻返回

3. long io_getevents(aio_context_t ctx_id, long min_nr, long nr, struct io_event *events, struct timespec *timeout)
|- 提供一个io_event数组给内核来copy完成的io请求,数组的大小是io_setup时指定的maxevents,阻塞等待至少min_nr个io请求返回,如果timeout非NULL,超时后即刻返回,返回值为完成的io请求个数

4. int io_destroy(io_context_t ctx);
|- 销毁异步io上下文,如果还有未完成的io,取消这些io

5. long io_cancel(aio_context_t ctx_id, struct iocb *iocb, struct io_event *result);
|- 取消一个异步io请求,结果放到result中

其中:1~4是一个完整的异步io调用过程:初始化context,提交io请求,/*do something else… */, 等待io完成,销毁contex

iocb/io_event

io_context_t本质上就是一个句柄,异步io中用到的两个重要结构体是iocb与io_event,重点深入一下:

iocb描述了一个异步io请求

struct iocb {
  void *data; /* return in the io completion event */
  unsigned key; /* use in identifying io requests */
  short aio_lio_opcode;
  short aio_reqprio;
  int aio_fildes;
  union {
    struct io_iocb_common c;
    struct io_iocb_vector v;
    struct io_iocb_poll poll;
    struct io_iocb_sockaddr saddr;
  } u;
};
 
struct io_iocb_common {
  void *buf;
  unsigned long nbytes;
  long long offset;
  unsigned flags;
  unsigned resfd;
};

其中,data由用户自定义,可以为回调函数,aio_lio_opcode有两个值:IOCB_CMD_PREAD/IOCB_CMD_PWRITE,aio_fildes是对应的文件fd,io_iocb_common中的成员描述了要提交的io请求的细节: [buf,nbytes] –> [aio_fildes,offset]

io_event描述了异步io返回信息

struct io_event {
  void *data;
  struct iocb *obj;
  unsigned long res;
  unsigned long res2;
};

其中,obj指向之前提交的异步io对应的iocb,res和res2表示异步io完成的状态

初始化iocb结构体(宏定义)

void io_prep_pwrite(struct iocb *iocb, int fd, void *buf, size_t count, long long offset);
void io_prep_pread(struct iocb *iocb, int fd, void *buf, size_t count, long long offset);

列出io_prep_pread的code,io_prep_pwrite中iocb->aio_lio_opcode = IO_CMD_PWRITE;

void io_prep_pread(struct iocb *iocb, int fd, void *buf, size_t count, long long offset)
{
  memset(iocb, 0, sizeof(*iocb));
  iocb->aio_fildes = fd;
  iocb->aio_lio_opcode = IO_CMD_PREAD;
  iocb->aio_reqprio = 0;
  iocb->u.c.buf = buf;
  iocb->u.c.nbytes = count;
  iocb->u.c.offset = offset;
}

Innodb native AIO

InnoDB在linux native aio出来之前曾经采用simulated aio来模拟异步io,一次提交多个io请求提高效率,但是与native aio比起来,性能还不够好,文章后续内容focus在InnoDB如何使用linux native aio上面

两个结构体:os_aio_slot_struct/os_aio_array_struct

os_aio_array_struct

管理了同一类异步io,InnoDB中一共有四类异步io,分别负责data page/insert buffer/redo log的异步io

static os_aio_array_t*     os_aio_read_array     = NULL;     /*!< Reads */
static os_aio_array_t*     os_aio_write_array    = NULL;     /*!< Writes */
static os_aio_array_t*     os_aio_ibuf_array     = NULL;     /*!< Insert buffer */
static os_aio_array_t*     os_aio_log_array      = NULL;     /*!< Redo log */

os_aio_array_struct部分成员:

/** The asynchronous i/o array structure */
struct os_aio_array_struct{
     os_mutex_t     mutex;     /*!< the mutex protecting the aio array */
     ulint          n_slots;/*!< Total number of slots in the aio
                    array.  This must be divisible by
                    n_threads. */
     ulint          n_segments;
                    /*!< Number of segments in the aio
                    array of pending aio requests. A
                    thread can wait separately for any one
                    of the segments. */
     ulint          n_reserved;
                    /*!< Number of reserved slots in the
                    aio array outside the ibuf segment */
     os_aio_slot_t*     slots;     /*!< Pointer to the slots in the array */
 
#if defined(LINUX_NATIVE_AIO)
     io_context_t*          aio_ctx;
                    /* completion queue for IO. There is
                    one such queue per segment. Each thread
                    will work on one ctx exclusively. */
     struct io_event*     aio_events;
                    /* The array to collect completed IOs.
                    There is one such event for each
                    possible pending IO. The size of the
                    array is equal to n_slots. */
#endif
};

关键成员:
slots:os_aio_slot_t类型数组,每个异步io过程中需要使用一个slot,完成后释放
n_slots:数组slots的长度,代表最大peding io数目
n_reserved:当前slots数组中已经占用的slot数目,当n_reserved到达n_slots,提交异步io需要等待
aio_events:io_event类型数组,数组长度为n_slots,保存对应异步io的返回结果(io_getevents)
n_segments:slots数组被分成m个区,单个区中的异步io(n_slots/n_segments个)由一个线程来访问和操作
aio_ctx:io_context_t类型数组,数组长度为n_segments,每个io_context对应slots数组中的(n_slots/n_segments)个slot,一个io_context被一个线程使用

os_aio_ibuf_array与os_aio_log_array中包含一个segment(hard code),os_aio_read_array与os_aio_write_array中的segment数量由innodb_read_io_threads与innodb_write_io_threads指定(默认都是4,启动后不可修改),系统会为每个segment创建一个后台线程用于处理异步io,每个后台(linux native aio)线程最多处理(32*8 = 256)个pending io,具体参考innodb_start_or_create_for_mysql

    io_limit = 8 * SRV_N_PENDING_IOS_PER_THREAD;
    os_aio_init(io_limit,
                srv_n_read_io_threads,
                srv_n_write_io_threads,
                SRV_MAX_N_PENDING_SYNC_IOS);

os_aio_slot_t

用于提交一个异步io操作,部分成员:

/** The asynchronous i/o array slot structure */
struct os_aio_slot_struct{
     ibool          is_read;     /*!< TRUE if a read operation */
     ulint          pos;          /*!< index of the slot in the aio array */
     ibool          reserved;     /*!< TRUE if this slot is reserved */
     ulint          len;          /*!< length of the block to read or write */
     byte*          buf;          /*!< buffer used in i/o */
     ulint          type;          /*!< OS_FILE_READ or OS_FILE_WRITE */
     ulint          offset;          /*!< 32 low bits of file offset in bytes */
     ulint          offset_high;     /*!< 32 high bits of file offset */
     os_file_t     file;          /*!< file where to read or write */
     ulint          space_id;
     fil_node_t*     message1;     /*!< message which is given by the */
 
#elif defined(LINUX_NATIVE_AIO)
     struct iocb     control;     /* Linux control block for aio */
     int          n_bytes;     /* bytes written/read. */
     int          ret;          /* AIO return code */
#endif
};

关键成员:
pos:其在对应的sync array中的位置,通过slot->pos/array->n_segment可以得到其对应的io_context
control:指向一个iocb结构体,在提交异步io时,control->data指向该slot,io_getevents后,可以通过event->obj->data访问slot

异步io初始化及函数调用关系

innodb_start_or_create_for_mysql
 
   srv_n_file_io_threads = 2 + srv_n_read_io_threads
                    + srv_n_write_io_threads;
 
   os_aio_init()
 
   for (i = 0; i < srv_n_file_io_threads; i++) {
        n[i] = i;
        os_thread_create(io_handler_thread, n + i, thread_ids + i);
   }

异步io后台线程

io_handler_thread
   while (srv_shutdown_state != SRV_SHUTDOWN_EXIT_THREADS) {
        fil_aio_wait(segment);
   }
 
fil_aio_wait
   os_aio_linux_handle
   fil_node_complete_io

native aio API调用堆栈

os_aio_init --> os_aio_array_create --> os_aio_linux_create_io_ctx --> io_setup
(_fil_io | fil_extend_space_to_desired_size) --> os_aio --> os_aio_func --> os_aio_linux_dispatch --> io_submit
io_handler_thread --> fil_aio_wait --> os_aio_linux_handle --> os_aio_linux_collect --> io_getevents

发表评论

电子邮件地址不会被公开。 必填项已用*标注