DirectIO是write函数的一个选项,用来确定数据内容直接写到磁盘上,而非缓存中,保证即是系统异常了,也能保证紧要数据写到磁盘上,具体写文件的机制流程可以参考前面写的<Linux内核写文件流程>,DirectIO流程也是接续着写文件流程而来的。

内核走到__generic_file_aio_write函数时,系统根据file->f_flags & O_DIRECT判断进入DirectIO处理的分支:

if (unlikely(file->f_flags & O_DIRECT)) {
loff_t endbyte;
ssize_t written_buffered;

written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
ppos, count, ocount);
if (written < 0 || written == count)
goto out;
/*
 * direct-io write to a hole: fall through to buffered I/O
 * for completing the rest of the request.
 */pos += written;
count -= written;
written_buffered = generic_file_buffered_write(iocb, iov,
nr_segs, pos, ppos, count,
written);
/*
 * If generic_file_buffered_write() retuned a synchronous error
 * then we want to return the number of bytes which were
 * direct-written, or the error code if that was zero.  Note
 * that this differs from normal direct-io semantics, which
 * will return -EFOO even if some bytes were written.
 */if (written_buffered < 0) {
err = written_buffered;
goto out;
}

/*
 * We need to ensure that the page cache pages are written to
 * disk and invalidated to preserve the expected O_DIRECT
 * semantics.
 */endbyte = pos + written_buffered - written - 1;
err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
if (err == 0) {
written = written_buffered;
invalidate_mapping_pages(mapping,
 pos >> PAGE_CACHE_SHIFT,
 endbyte >> PAGE_CACHE_SHIFT);
} else {
/*
 * We don't know how much we wrote, so just return
 * the number of bytes which were direct-written
 */}
}

依次先看generic_file_direct_write函数,主要有filemap_write_and_wait_range,invalidate_inode_pages2_range和mapping->a_ops->direct_IO起作用。

filemap_write_and_wait_range主要用来刷mapping下的脏页,在__filemap_fdatawrite_range下调用do_writepages实现:

int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
int ret;

if (wbc->nr_to_write <= 0)
return 0;
if (mapping->a_ops->writepages)
ret = mapping->a_ops->writepages(mapping, wbc);
else
ret = generic_writepages(mapping, wbc);
return ret;
}

filemap_write_and_wait_range如果有写入量则返回,后续的两个函数则不执行。我的理解是直写后相关数据都要一起刷到磁盘上,避免direct_IO的已经在磁盘上,而之前缓存的则不在,系统异常后文件系统就挂了。

如果没有写入量,则根据mapping->nrpages判断进入invalidate_inode_pages2_range,作用就是检查当前内存中是否由对应将要direct_IO的缓存页,如果有,则将其缓存标记为无效。目的是,因为direct_IO写入的数据并不缓存,如果direct_IO写入数据之前有对应缓存,而且是clean的,direct_IO完成之后,缓存和磁盘数据就不一致了,读取缓存的时候,如果没有保护,获取的数据就不是磁盘上的数据。如果的确有对应缓存标记为无效,则返回不执行后面的函数。

后面才到真正的主题,mapping->a_ops->direct_IO,在struct address_space_operations ext3_ordered_aops结构体里面有定义,是ext3_direct_IO,核心通过__blockdev_direct_IO实现,在direct_io_worker中组装了dio结构,然后通过dio_bio_submit,本质就是通过submit_bio(dio->rw, bio)提交到io层。所谓direct_io和其他读写比较就是跨过了buffer层,不要中间线程pdflush和kjournald定期刷盘到IO层。这个时候也不一定数据就在磁盘上了,direct_IO就是先假定IO的设备驱动没有较大延时的。

mapping->a_ops->direct_IO执行完成了,invalidate_inode_pages2_range又搞了一边,理由如下:

/* Finally, try again to invalidate clean pages which might have been cached by non-direct readahead, or faulted in by get_user_pages(), if the source of the write was an mmap'ed region of the file , we're writing. Either one is a pretty crazy thing to do, so we don't support it 100%. If this invalidation fails, tough, the write still worked...*/

系统复杂度很高的时候,就很难找到完全的数字式的过程保证,有时候土法炼钢也是简单有效的。

再次退回到__generic_file_aio_write函数,

written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
ppos, count, ocount);
if (written < 0 || written == count)
goto out;
/*
 * direct-io write to a hole: fall through to buffered I/O
 * for completing the rest of the request.
 */pos += written;
count -= written;
written_buffered = generic_file_buffered_write(iocb, iov,
nr_segs, pos, ppos, count,
written);

如果generic_file_direct_write返回值不为count,则重新执行缓存写generic_file_buffered_write,前面已经分析过,如果写入数据有相关的脏页,或者有对应的缓存即是clean,写入量则不是期待的count,此处要重新进行缓存写入。

结果我们就看到,所谓的direct_IO并不完全保证跨越buffer,在某些条件下,也是buffer写入。所以在极端要求directIO情况下,就要对应的规避掉这两种情况,控制缓存映射。

小工具vmtouch对于缓存控制还是简单有效


Linux DirectIO机制分析来自于OenHan

链接为:http://oenhan.com/ext3-fs-directio

5 thoughts on “Linux DirectIO机制分析”

  1. 你好,我想问个问题,Direct IO是会写到一个page里,然后通过address_space的Direct IO方法写入磁盘的吗?

      1. @OENHAN 假如是用户打开的文件已经带上了标志Direct IO, 在调用write()写入数据的时候还会经过Buffer吗,我看代码在这种情况下,数据应该直接被写入了块设备,没有经过Buffer,所以也不会写到一个page里?

        1. @PEDIA0992 终于明白你纠结什么了,Direct IO在最新代码里面是直接写入到块设备中,并不是写到cached page,但是如果generic_file_direct_write没有完成,就会在来一次buffer write:

          /*
           * If the write stopped short of completing, fall back to
           * buffered writes.  Some filesystems do this for writes to
           * holes, for example.  For DAX files, a buffered write will
           * not succeed (even if it did, DAX does not handle dirty
           * page-cache pages correctly).
           */if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
          goto out;
          
          status = generic_perform_write(file, from, pos = iocb->ki_pos);
          

发表回复