iowait的形成原因和内核分析

Posted on:2013-10-182013-10-18
Categories:FileSystem
Tags:IO, Kernel

经常我们碰到一些问题，进程的写文件，写入的速度非常慢，而且当时总的IO的数量（BI，BO）也非常低，但是vmstat显示的iowait却非常高，就是下图中的wa选项。

man vmstat手册如下解释：
wa: Time spent waiting for IO. Prior to Linux 2.5.41, included in idle.
CPU花费到IO等待上的时间，也就是说进程的io被CPU调度出去了，并没有进行立即执行，导致整个执行时间延长，程序性能降低。

那么wa是如何统计的呢？
vmstat是procps内实现的

oenhan@oenhan ~ $ dpkg-query -S /usr/bin/vmstat
procps: /usr/bin/vmstat

在procps中，wa中通过getstat读取/proc/stat实现的。

void getstat(jiff *restrict cuse, jiff *restrict cice,
             jiff *restrict csys, jiff *restrict cide,
             jiff *restrict ciow, jiff *restrict cxxx,
             jiff *restrict cyyy, jiff *restrict czzz,
     unsigned long *restrict pin, unsigned long *restrict pout,
             unsigned long *restrict s_in,unsigned long *restrict sout,
     unsigned *restrict intr, unsigned *restrict ctxt,
     unsigned int *restrict running, unsigned int *restrict blocked,
     unsigned int *restrict btime, unsigned int *restrict processes)
{
  static int fd;
  unsigned long long llbuf = 0;
  int need_vmstat_file = 0;
  int need_proc_scan = 0;
  const char* b;
  buff[BUFFSIZE-1] = 0;  /* ensure null termination in buffer */
  if(fd){
    lseek(fd, 0L, SEEK_SET);
  }else{
    fd = open("/proc/stat", O_RDONLY, 0);
    if(fd == -1) crash("/proc/stat");
  }
  read(fd,buff,BUFFSIZE-1);
  *intr = 0;
  *ciow = 0;  /* not separated out until the 2.5.41 kernel */  *cxxx = 0;  /* not separated out until the 2.6.0-test4 kernel */  *cyyy = 0;  /* not separated out until the 2.6.0-test4 kernel */  *czzz = 0;  /* not separated out until the 2.6.11 kernel */
  b = strstr(buff, "cpu ");
  if(b) sscanf(b,  "cpu  %Lu %Lu %Lu %Lu %Lu %Lu %Lu %Lu",
          cuse, cice, csys, cide, ciow, cxxx, cyyy, czzz);

proc/stat在内核中通过show_stat实现，统计cpustat.iowait获取最终结果：

iowait = cputime64_add(iowait, get_iowait_time(i));

static cputime64_t get_iowait_time(int cpu)
{
u64 iowait_time = -1ULL;
cputime64_t iowait;

if (cpu_online(cpu))
iowait_time = get_cpu_iowait_time_us(cpu, NULL);

if (iowait_time == -1ULL)
/* !NO_HZ or cpu offline so we can rely on cpustat.iowait */iowait = kstat_cpu(cpu).cpustat.iowait;
else
iowait = usecs_to_cputime64(iowait_time);

return iowait;
}

而cpustat.iowait的值在account_system_time统计计算，通过统计rq->nr_iowait的值判断来递增。

void account_system_time(struct task_struct *p, int hardirq_offset,
 cputime_t cputime)
{
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
runqueue_t *rq = this_rq();
cputime64_t tmp;

p->stime = cputime_add(p->stime, cputime);

/* Add system time to cpustat. */tmp = cputime_to_cputime64(cputime);
if (hardirq_count() - hardirq_offset)
cpustat->irq = cputime64_add(cpustat->irq, tmp);
else if (softirq_count())
cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
else if (p != rq->idle)
cpustat->system = cputime64_add(cpustat->system, tmp);
else if (atomic_read(&rq->nr_iowait) > 0)
cpustat->iowait = cputime64_add(cpustat->iowait, tmp);
else
cpustat->idle = cputime64_add(cpustat->idle, tmp);
/* Account for system time used */acct_update_integrals(p);
}

附加一句：account_system_time可以用来具体统计更详细的CPU占用情况。

而控制的rq->nr_iowait的函数是io_schedule和io_schedule_timeout。

void __sched io_schedule(void)
{
struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id());

delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
schedule();
atomic_dec(&rq->nr_iowait);
delayacct_blkio_end();
}

long __sched io_schedule_timeout(long timeout)
{
struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id());
long ret;

delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
ret = schedule_timeout(timeout);
atomic_dec(&rq->nr_iowait);
delayacct_blkio_end();
return ret;
}

因为没有实际应用场景，针对调用代码做一下分析，具体的应用场景可以参考
查看调用代码如下：

先看io_schedule函数，调用它的有sync_buffer,sync_io,dio_awit_one,direct_io_worker，get_request_wait等，只要你有对磁盘的写入操作，就会调用io_schedule,尤其以直接写入(DIO)最明显，每次写入都会调用。普通写入也只是积攒到buffer里面，统一刷新buffer的时候才会调用一次io_schedule。即使没有主动的写入IO，也有可能产生iowait，在sync_page函数中，当缓存和硬盘数据进行同步的时候，就会被调用，比如常用的mlock锁内存时就会__lock_page触发缓存同步，一样会增加iowait，一些读硬盘也会触发。

虽然io_schedule调用的非常多，但真正的大杀器还是io_schedule_timeout，io_schedule由内核调用，整个调度时间还是比较短的，io_schedule_timeout则基本都是指定了HZ/10的时间段。如balance_dirty_pages中每次写入文件都会对脏页进行平衡处理，wb_kupdate定时器般进行缓存刷新，try_to_free_pages则在释放内存时刷缓存时使用，虽然不是每次都使用io_schedule_timeout，但综合各个的判断条件看，当内存中的缓存足够多的时候，会极大触发io_schedule_timeout，水线一般是dirty_background_ratio和dirty_ratio。

如上所分析的，避免iowait需要注意的有：

1.高性能要求的进程最好在内存中一次读完所有数据，全部mlock住，不需要后面再读磁盘，

2.不需要写或者刷新磁盘（一般搞不定，就让单独的进程专门干这个活）。

3.控制整个缓存的使用情况，但从工程上不容易处理，虚拟化估计更有效，相对与系统虚拟化（KVM），进程虚拟化（cgroup）应该更简单有效。

iowait的形成原因和内核分析来自于OenHan

链接为:https://oenhan.com/iowait-wa-vmstat

OenHan

Sun @ KVM源代码分析4:内存虚拟化
博主，您好！想问一下qemu负责模拟vcpu的线程参与正常的线程调度吗？如果参与，其是需要先从非根模式退出到根模式再进行…6 月 4, 15:38
GGG @ ext4 mballoc源代码分析
@OENHAN 您好，“目的就是要尽可能的紧凑，所谓的针对2的幂的削峰填谷”---我对这个结论还是不太能理解，请问下在分…10 月 19, 20:36
Gary @ KVM源代码分析3:CPU虚拟化
@JOE 老哥，最近还在读KVM源码吗，可以交流一下不8 月 22, 19:30
Joe @ KVM源代码分析3:CPU虚拟化
了解了，谢谢！2 月 17, 14:06
Joe @ KVM源代码分析3:CPU虚拟化
最近在学习CPU虚拟化，有个问题想请问一下博主，在x86 cpu虚拟化中，每个VCPU有一个VMCS，然后每个guest…12 月 6, 10:22
Shawtao @ Job
请问之后会招暑期实习或者日常实习吗？11 月 9, 12:55
Arthur.Dayne @ KVM virtIO block源代码分析
我最近在研究virtio-blk，想搞明白guest中读写/dev/vda后，应该会跳到qemu的kvm_handle_…6 月 22, 17:21
Timelife @ Job
这并不是必然的和固定的，Arm及Intel的Page管理技术，保护模式下只是为了更合理高校利用资源和数据隔离，从最初高效…5 月 21, 14:55
Raymond @ Job
9 BIT 能表示512个ENTRIES，4KB页面，刚好8个字节一个ENTRY，512个刚好4KB，ENTRY中放64…2 月 29, 23:04
Xyz @ KVM MMU page释放机制
是的，是在没有开启ept的情况下。对mmu page的回收有些不解。2 月 25, 17:59

OenHan

发表回复 取消回复

发表回复取消回复