Linux DirectIO机制分析
DirectIO是write函数的一个选项,用来确定数据内容直接写到磁盘上,而非缓存中,保证即是系统异常了,也能保证紧要数据写到磁盘上,具体写文件的机制流程可以参考前面写的<Linux内核写文件流程>,DirectIO流程也是接续着写文件流程而来的。
内核走到__generic_file_aio_write函数时,系统根据file->f_flags & O_DIRECT判断进入DirectIO处理的分支:
if (unlikely(file->f_flags & O_DIRECT)) { loff_t endbyte; ssize_t written_buffered; written = generic_file_direct_write(iocb, iov, &nr_segs, pos, ppos, count, ocount); if (written < 0 || written == count) goto out; /* * direct-io write to a hole: fall through to buffered I/O * for completing the rest of the request. */pos += written; count -= written; written_buffered = generic_file_buffered_write(iocb, iov, nr_segs, pos, ppos, count, written); /* * If generic_file_buffered_write() retuned a synchronous error * then we want to return the number of bytes which were * direct-written, or the error code if that was zero. Note * that this differs from normal direct-io semantics, which * will return -EFOO even if some bytes were written. */if (written_buffered < 0) { err = written_buffered; goto out; } /* * We need to ensure that the page cache pages are written to * disk and invalidated to preserve the expected O_DIRECT * semantics. */endbyte = pos + written_buffered - written - 1; err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte); if (err == 0) { written = written_buffered; invalidate_mapping_pages(mapping, pos >> PAGE_CACHE_SHIFT, endbyte >> PAGE_CACHE_SHIFT); } else { /* * We don't know how much we wrote, so just return * the number of bytes which were direct-written */} }
依次先看generic_file_direct_write函数,主要有filemap_write_and_wait_range,invalidate_inode_pages2_range和mapping->a_ops->direct_IO起作用。
filemap_write_and_wait_range主要用来刷mapping下的脏页,在__filemap_fdatawrite_range下调用do_writepages实现:
int do_writepages(struct address_space *mapping, struct writeback_control *wbc) { int ret; if (wbc->nr_to_write <= 0) return 0; if (mapping->a_ops->writepages) ret = mapping->a_ops->writepages(mapping, wbc); else ret = generic_writepages(mapping, wbc); return ret; }
filemap_write_and_wait_range如果有写入量则返回,后续的两个函数则不执行。我的理解是直写后相关数据都要一起刷到磁盘上,避免direct_IO的已经在磁盘上,而之前缓存的则不在,系统异常后文件系统就挂了。
如果没有写入量,则根据mapping->nrpages判断进入invalidate_inode_pages2_range,作用就是检查当前内存中是否由对应将要direct_IO的缓存页,如果有,则将其缓存标记为无效。目的是,因为direct_IO写入的数据并不缓存,如果direct_IO写入数据之前有对应缓存,而且是clean的,direct_IO完成之后,缓存和磁盘数据就不一致了,读取缓存的时候,如果没有保护,获取的数据就不是磁盘上的数据。如果的确有对应缓存标记为无效,则返回不执行后面的函数。
后面才到真正的主题,mapping->a_ops->direct_IO,在struct address_space_operations ext3_ordered_aops结构体里面有定义,是ext3_direct_IO,核心通过__blockdev_direct_IO实现,在direct_io_worker中组装了dio结构,然后通过dio_bio_submit,本质就是通过submit_bio(dio->rw, bio)提交到io层。所谓direct_io和其他读写比较就是跨过了buffer层,不要中间线程pdflush和kjournald定期刷盘到IO层。这个时候也不一定数据就在磁盘上了,direct_IO就是先假定IO的设备驱动没有较大延时的。
mapping->a_ops->direct_IO执行完成了,invalidate_inode_pages2_range又搞了一边,理由如下:
/* Finally, try again to invalidate clean pages which might have been cached by non-direct readahead, or faulted in by get_user_pages(), if the source of the write was an mmap'ed region of the file , we're writing. Either one is a pretty crazy thing to do, so we don't support it 100%. If this invalidation fails, tough, the write still worked...*/
系统复杂度很高的时候,就很难找到完全的数字式的过程保证,有时候土法炼钢也是简单有效的。
再次退回到__generic_file_aio_write函数,
written = generic_file_direct_write(iocb, iov, &nr_segs, pos, ppos, count, ocount); if (written < 0 || written == count) goto out; /* * direct-io write to a hole: fall through to buffered I/O * for completing the rest of the request. */pos += written; count -= written; written_buffered = generic_file_buffered_write(iocb, iov, nr_segs, pos, ppos, count, written);
如果generic_file_direct_write返回值不为count,则重新执行缓存写generic_file_buffered_write,前面已经分析过,如果写入数据有相关的脏页,或者有对应的缓存即是clean,写入量则不是期待的count,此处要重新进行缓存写入。
结果我们就看到,所谓的direct_IO并不完全保证跨越buffer,在某些条件下,也是buffer写入。所以在极端要求directIO情况下,就要对应的规避掉这两种情况,控制缓存映射。
小工具vmtouch对于缓存控制还是简单有效
Linux DirectIO机制分析来自于OenHan
链接为:https://oenhan.com/ext3-fs-directio
你好,我想问个问题,Direct IO是会写到一个page里,然后通过address_space的Direct IO方法写入磁盘的吗?
@PEDIA0992 Buffer内容肯定在page里面,通过mapping->a_ops->direct_IO执行完成
@OENHAN 假如是用户打开的文件已经带上了标志Direct IO, 在调用write()写入数据的时候还会经过Buffer吗,我看代码在这种情况下,数据应该直接被写入了块设备,没有经过Buffer,所以也不会写到一个page里?
@PEDIA0992 终于明白你纠结什么了,Direct IO在最新代码里面是直接写入到块设备中,并不是写到cached page,但是如果generic_file_direct_write没有完成,就会在来一次buffer write:
@OENHAN 感谢阿!