好记性不如烂笔头,这篇文章是一个学习和总结,包括linux native aio API介绍以及InnoDB中如何使用native aio
异步io
linux上异步io有两套API,posix aio和native aio,其中native aio具有更好的性能,但要求文件打开方式必须是O_DIRECT,简单列下两套API的接口:
function | desc |
---|---|
aio_read | 发起一个异步读请求 |
aio_write | 发起一个异步写请求 |
aio_return | 阻塞等待异步io请求返回 |
aio_error | 检查异步io返回结果 |
aio_suspend | 挂起调用进程,直到一个或多个异步请求已经完成(或失败) |
aio_cancel | 取消异步io |
lio_listio | 发起一系列异步io请求 |
function | desc |
---|---|
io_setup | 创建异步io上下文 |
io_submit | 发起一个异步io请求 |
io_getevents | 阻塞等待异步io请求返回 |
io_cancel | 取消异步io |
io_destroy | 销毁异步io上下文 |
本文不打算继续介绍posix aio相关内容,将重点集中在native aio上,列出函数原型:
1. int io_setup(int maxevents, io_context_t *ctxp);
|- 创建异步io上下文,其中io_context_t是句柄,具体实现时应该对应一个整数,maxevents限制了可以同时提交的异步io数量
2. long io_submit(aio_context_t ctx_id, long nr, struct iocb **iocbpp);
|- 提交异步io请求,请求可以是读或者写,nr是iocbpp数组的大小,io请求入队后即刻返回
3. long io_getevents(aio_context_t ctx_id, long min_nr, long nr, struct io_event *events, struct timespec *timeout)
|- 提供一个io_event数组给内核来copy完成的io请求,数组的大小是io_setup时指定的maxevents,阻塞等待至少min_nr个io请求返回,如果timeout非NULL,超时后即刻返回,返回值为完成的io请求个数
4. int io_destroy(io_context_t ctx);
|- 销毁异步io上下文,如果还有未完成的io,取消这些io
5. long io_cancel(aio_context_t ctx_id, struct iocb *iocb, struct io_event *result);
|- 取消一个异步io请求,结果放到result中
其中:1~4是一个完整的异步io调用过程:初始化context,提交io请求,/*do something else… */, 等待io完成,销毁contex
iocb/io_event
io_context_t本质上就是一个句柄,异步io中用到的两个重要结构体是iocb与io_event,重点深入一下:
iocb描述了一个异步io请求
struct iocb { void *data; /* return in the io completion event */ unsigned key; /* use in identifying io requests */ short aio_lio_opcode; short aio_reqprio; int aio_fildes; union { struct io_iocb_common c; struct io_iocb_vector v; struct io_iocb_poll poll; struct io_iocb_sockaddr saddr; } u; }; struct io_iocb_common { void *buf; unsigned long nbytes; long long offset; unsigned flags; unsigned resfd; }; |
其中,data由用户自定义,可以为回调函数,aio_lio_opcode有两个值:IOCB_CMD_PREAD/IOCB_CMD_PWRITE,aio_fildes是对应的文件fd,io_iocb_common中的成员描述了要提交的io请求的细节: [buf,nbytes] –> [aio_fildes,offset]
io_event描述了异步io返回信息
struct io_event { void *data; struct iocb *obj; unsigned long res; unsigned long res2; }; |
其中,obj指向之前提交的异步io对应的iocb,res和res2表示异步io完成的状态
初始化iocb结构体(宏定义)
void io_prep_pwrite(struct iocb *iocb, int fd, void *buf, size_t count, long long offset); void io_prep_pread(struct iocb *iocb, int fd, void *buf, size_t count, long long offset); |
列出io_prep_pread的code,io_prep_pwrite中iocb->aio_lio_opcode = IO_CMD_PWRITE;
void io_prep_pread(struct iocb *iocb, int fd, void *buf, size_t count, long long offset) { memset(iocb, 0, sizeof(*iocb)); iocb->aio_fildes = fd; iocb->aio_lio_opcode = IO_CMD_PREAD; iocb->aio_reqprio = 0; iocb->u.c.buf = buf; iocb->u.c.nbytes = count; iocb->u.c.offset = offset; } |
Innodb native AIO
InnoDB在linux native aio出来之前曾经采用simulated aio来模拟异步io,一次提交多个io请求提高效率,但是与native aio比起来,性能还不够好,文章后续内容focus在InnoDB如何使用linux native aio上面
两个结构体:os_aio_slot_struct/os_aio_array_struct
os_aio_array_struct
管理了同一类异步io,InnoDB中一共有四类异步io,分别负责data page/insert buffer/redo log的异步io
static os_aio_array_t* os_aio_read_array = NULL; /*!< Reads */ static os_aio_array_t* os_aio_write_array = NULL; /*!< Writes */ static os_aio_array_t* os_aio_ibuf_array = NULL; /*!< Insert buffer */ static os_aio_array_t* os_aio_log_array = NULL; /*!< Redo log */ |
os_aio_array_struct部分成员:
/** The asynchronous i/o array structure */ struct os_aio_array_struct{ os_mutex_t mutex; /*!< the mutex protecting the aio array */ ulint n_slots;/*!< Total number of slots in the aio array. This must be divisible by n_threads. */ ulint n_segments; /*!< Number of segments in the aio array of pending aio requests. A thread can wait separately for any one of the segments. */ ulint n_reserved; /*!< Number of reserved slots in the aio array outside the ibuf segment */ os_aio_slot_t* slots; /*!< Pointer to the slots in the array */ #if defined(LINUX_NATIVE_AIO) io_context_t* aio_ctx; /* completion queue for IO. There is one such queue per segment. Each thread will work on one ctx exclusively. */ struct io_event* aio_events; /* The array to collect completed IOs. There is one such event for each possible pending IO. The size of the array is equal to n_slots. */ #endif }; |
关键成员:
slots:os_aio_slot_t类型数组,每个异步io过程中需要使用一个slot,完成后释放
n_slots:数组slots的长度,代表最大peding io数目
n_reserved:当前slots数组中已经占用的slot数目,当n_reserved到达n_slots,提交异步io需要等待
aio_events:io_event类型数组,数组长度为n_slots,保存对应异步io的返回结果(io_getevents)
n_segments:slots数组被分成m个区,单个区中的异步io(n_slots/n_segments个)由一个线程来访问和操作
aio_ctx:io_context_t类型数组,数组长度为n_segments,每个io_context对应slots数组中的(n_slots/n_segments)个slot,一个io_context被一个线程使用
os_aio_ibuf_array与os_aio_log_array中包含一个segment(hard code),os_aio_read_array与os_aio_write_array中的segment数量由innodb_read_io_threads与innodb_write_io_threads指定(默认都是4,启动后不可修改),系统会为每个segment创建一个后台线程用于处理异步io,每个后台(linux native aio)线程最多处理(32*8 = 256)个pending io,具体参考innodb_start_or_create_for_mysql
io_limit = 8 * SRV_N_PENDING_IOS_PER_THREAD; os_aio_init(io_limit, srv_n_read_io_threads, srv_n_write_io_threads, SRV_MAX_N_PENDING_SYNC_IOS); |
os_aio_slot_t
用于提交一个异步io操作,部分成员:
/** The asynchronous i/o array slot structure */ struct os_aio_slot_struct{ ibool is_read; /*!< TRUE if a read operation */ ulint pos; /*!< index of the slot in the aio array */ ibool reserved; /*!< TRUE if this slot is reserved */ ulint len; /*!< length of the block to read or write */ byte* buf; /*!< buffer used in i/o */ ulint type; /*!< OS_FILE_READ or OS_FILE_WRITE */ ulint offset; /*!< 32 low bits of file offset in bytes */ ulint offset_high; /*!< 32 high bits of file offset */ os_file_t file; /*!< file where to read or write */ ulint space_id; fil_node_t* message1; /*!< message which is given by the */ #elif defined(LINUX_NATIVE_AIO) struct iocb control; /* Linux control block for aio */ int n_bytes; /* bytes written/read. */ int ret; /* AIO return code */ #endif }; |
关键成员:
pos:其在对应的sync array中的位置,通过slot->pos/array->n_segment可以得到其对应的io_context
control:指向一个iocb结构体,在提交异步io时,control->data指向该slot,io_getevents后,可以通过event->obj->data访问slot
异步io初始化及函数调用关系
innodb_start_or_create_for_mysql srv_n_file_io_threads = 2 + srv_n_read_io_threads + srv_n_write_io_threads; os_aio_init() for (i = 0; i < srv_n_file_io_threads; i++) { n[i] = i; os_thread_create(io_handler_thread, n + i, thread_ids + i); } |
异步io后台线程
io_handler_thread while (srv_shutdown_state != SRV_SHUTDOWN_EXIT_THREADS) { fil_aio_wait(segment); } fil_aio_wait os_aio_linux_handle fil_node_complete_io |
native aio API调用堆栈
os_aio_init --> os_aio_array_create --> os_aio_linux_create_io_ctx --> io_setup (_fil_io | fil_extend_space_to_desired_size) --> os_aio --> os_aio_func --> os_aio_linux_dispatch --> io_submit io_handler_thread --> fil_aio_wait --> os_aio_linux_handle --> os_aio_linux_collect --> io_getevents |