KVM下steal_time源代码分析
代码版本:https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git branch v4.3
刚好有人在其他文章评论下问到steal_time机制,顺便看了一下,总结如下。
steal_time原意是指在虚拟化环境下,hypervisor窃取的vm中的时间,严格讲就是VCPU没有运行的时间。
在guest中执行top选项,就可以看到一个st数据

oenhan@oenhan.com ~$ top
top - 21:04:12 up 1:24, 2 users, load average: 0.45, 0.31, 0.22
Tasks: 268 total, 1 running, 267 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.5 us, 0.2 sy, 0.3 ni, 98.0 id, 0.9 wa, 0.0 hi, 0.0 si,0.0 st


st数据的意义是给guest看到了自己真正占用CPU的时间比例,让guest根据st调整自己的行为,以免影响业务,如果st值比较高,则说明hostvm的CPU比例太小,整个hypervisor的任务比较繁重,有些高计算任务可以跟着自我限制。
同时KVM的doc文件描述如下:

MSR_KVM_STEAL_TIME: 0x4b564d03
data: 64-byte alignment physical address of a memory area which must be in guest RAM, plus an enable bit in bit 0. This memory is expected to hold a copy of the following structure:
struct kvm_steal_time { __u64 steal;
__u32 version;
__u32 flags;
__u32 pad[12]; }
whose data will be filled in by the hypervisor periodically. Only one write, or registration, is needed for each VCPU. The interval between updates of this structure is arbitrary and implementation-dependent.
The hypervisor may update this structure at any time it sees fit until anything with bit0 == 0 is written to it. Guest is required to make sure this structure is initialized to zero.
Fields have the following meanings:
version: a sequence counter. In other words, guest has to check this field before and after grabbing time information and make sure they are both equal and even. An odd version indicates an in-progress update.
flags: At this point, always zero. May be used to indicate changes in this structure in the future.
steal: the amount of time in which this vCPU did not run, in nanoseconds. Time during which the vcpu is idle, will not be reported as steal time.

下面看一下具体的源代码实现:
先说guest端的代码,steal_time本身是一个PV实现,这个应该是在AWS(XEN)开发出来后又搬到KVM上的,因为本身kernel不存在这个功能,算是修改kernel,归到PV里面了,所有一般编译内核的时候要保证CONFIG_PARAVIRT=y即开关打开。

cat /boot/config-4.2.6 | grep CONFIG_PARAVIRT
CONFIG_PARAVIRT=y

在guest kernel启动的过程中,内核初始化调用setup_arch,然后是kvm_guest_init,先调用kvm_para_available判断是否是KVM虚拟化环境,原理就是根据CPUID查询的字符串是否有"KVMKVMKVM",然后又将kvm_steal_clock注册到steal_clock,

if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
has_steal_clock = 1;
pv_time_ops.steal_clock = kvm_steal_clock;
}

在CONFIG_SMP不同的情况下分叉,但最后都是调用kvm_guest_cpu_init,进入kvm_register_steal_time函数,kvm_register_steal_time做了一件事,即把steal_time的每CPU变量的物理地址注册到MSR_KVM_STEAL_TIME中,如下

static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
static void kvm_register_steal_time(void)
{
int cpu = smp_processor_id();
struct kvm_steal_time *st = &per_cpu(steal_time, cpu);
if (!has_steal_clock)
return;
memset(st, 0, sizeof(*st));
wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
}

此时steal_time的内存地址写入到MSR_KVM_STEAL_TIME中,guest中初始化部分完成。
继续,前面提到kvm_steal_clock被注册到了steal_clock中,kvm_steal_clock本质就是计算出指定CPU的steal time,从代码上可以明显看出steal time来自每CPU变量,此处假设每CPU变量赋值正常,再看,调用注册点steal_clock则是paravirt_steal_clock,往上走有steal_account_process_tick函数,top看到的st值就是它计算出来的,account_process_tick和irqtime_account_process_tick中都有调用steal_account_process_tick,即是steal_time的更新已经集成到kenrel的定时器更新中,在steal_account_process_tick函数中,paravirt_steal_clock获取steal clock,paravirt_steal_clock的返回值是一个累积值,减去this_rq()->prev_steal_time即得出当前的steal time,然后累加到kcpustat_this_cpu->cpustat

[CPUTIME_STEAL],同时刷新this_rq()->prev_steal_time,而kcpustat_this_cpu则是top命令看到的st数据的来源。

static __always_inline bool steal_account_process_tick(void)
{
#ifdef CONFIG_PARAVIRT
if (static_key_false(&paravirt_steal_enabled)) {
u64 steal;
cputime_t steal_ct;
//获取steal time
steal = paravirt_steal_clock(smp_processor_id());
//减去上次更新的steal time就得到这次时间片(伪概念)内的steal time
steal -= this_rq()->prev_steal_time;

/*
 * cputime_t may be less precise than nsecs (eg: if it's
 * based on jiffies). Lets cast the result to cputime
 * granularity and account the rest on the next rounds.
 */steal_ct = nsecs_to_cputime(steal);
//再次刷新prev_steal_time,其实就是第一次的steal直接赋值更快
this_rq()->prev_steal_time += cputime_to_nsecs(steal_ct);
//将结果赋值到kcpustat_this_cpu
account_steal_time(steal_ct);
return steal_ct;
}
#endif
return false;
}

另外一个调用paravirt_steal_clock是update_rq_clock_task,用来更新队列中的clock_task,先看调用它的update_rq_clock,update_rq_clock_task的入参delta来自sched_clock_cpu(cpu_of(rq)) - rq->clock,如果config中CONFIG_PARAVIRT_TIME_ACCOUNTING=y,update_rq_clock_task则会在delta中减去steal time,赋值给clock_task。

void update_rq_clock(struct rq *rq)
{
s64 delta;

lockdep_assert_held(&rq->lock);
if (rq->clock_skip_update & RQCF_ACT_SKIP)
return;
delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
if (delta < 0) return; rq->clock += delta;
update_rq_clock_task(rq, delta);
}
static void update_rq_clock_task(struct rq *rq, s64 delta)
{
#if  defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
s64 steal = 0, irq_delta = 0;
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((&paravirt_steal_rq_enabled))) {
steal = paravirt_steal_clock(cpu_of(rq));
steal -= rq->prev_steal_time_rq;

if (unlikely(steal > delta))
steal = delta;

rq->prev_steal_time_rq += steal;
delta -= steal;
}
#endif
rq->clock_task += delta;
}

通过观察clock_task的调用者就可以发现,clock_task则是进程调度中rq计算task运行时间的重要数据,如此看就是steal time是被guest可以意识到的时间,但这个时间不被计算到具体的调度队列的运行时间中,虚拟化下guest中的task调度正常
最后留下的疑问就是steal_time的每CPU变量是如何刷新的,在kvm_vcpu_arch下有下面的一个结构体

struct {
u64 msr_val;
u64 last_steal;
u64 accum_steal;
struct gfn_to_hva_cache stime;
struct kvm_steal_time steal;
} st;

在vcpu_enter_guest函数中,有record_steal_time,其他地方还有另外一个函数,accumulate_steal_time,它们的调用关系如下
kvm_vcpu_ioctl->vcpu_load->kvm_arch_vcpu_load->accumulate_steal_time
kvm_vcpu_ioctl->kvm_arch_vcpu_ioctl_run->vcpu_run->vcpu_enter_guest->record_steal_time
也就是说accumulate_steal_time必然在record_steal_time之前执行,最新的4.4代码直接把accumulate_steal_time放到了record_steal_time函数的最前面。
accumulate_steal_time函数顾名思义,计算steal time,只有3行重点

delta = current->sched_info.run_delay - vcpu->arch.st.last_steal;
vcpu->arch.st.last_steal = current->sched_info.run_delay;
vcpu->arch.st.accum_steal = delta;

先理解run_delay是什么,"在运行队列上等待的时间"?

#ifdef CONFIG_SCHED_INFO
struct sched_info {
/* cumulative counters */unsigned long pcount;      /* # of times run on this cpu */unsigned long long run_delay; /* time spent waiting on a runqueue */
/* timestamps */unsigned long long last_arrival,/* when we last ran on a cpu */   last_queued;/* when we were last queued to run */};
#endif /* CONFIG_SCHED_INFO */

那么,current->sched_info.run_delay就是qemu的run_delay,也就是陷入到guest中的时间,那么在t5时刻(每次enter到guest中的时候),current->sched_info.run_delay=t2-t1+t4-t3,而vcpu->arch.st.last_steal则是上次enter guest时(t3时刻)的run_delay,即t2-t1,那么本次时间段的steal time则是vcpu->arch.st.accum_steal=t4-t3。这样就将当前时间段内的steal time存储到accum_steal中。

再看record_steal_time函数,此处使用了kvm_read_guest_cached和kvm_write_guest_cached,本质就是直接读取或写入guest的某段内存,涉及到gfn_to_hva_cache结构体,因为写入的gfn2hva映射关系一般是不变的,所以不需要在guest重复转换浪费计算能力,vcpu->arch.st.stime的gfn_to_hva_cache结构体是在kvm_set_msr_common函数下MSR_KVM_STEAL_TIME case下初始化的,函数是

case MSR_KVM_STEAL_TIME:

if (unlikely(!sched_info_on()))
return 1;

if (data & KVM_STEAL_RESERVED_MASK)
return 1;
//下面的kvm_gfn_to_hva_cache_init中第三个入参是gpa,前面提到把steal_time
//的每CPU变量的物理地址注册到MSR_KVM_STEAL_TIME中,此处则data则是
//MSR_KVM_STEAL_TIME索引的msr的值,即每CPU变量的物理地址
if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.st.stime,
data & KVM_STEAL_VALID_BITS,
sizeof(struct kvm_steal_time)))
return 1;

vcpu->arch.st.msr_val = data;
vcpu->arch.st.last_steal = current->sched_info.run_delay;
break;

kvm_gfn_to_hva_cache_init就是纯粹的转换,不提。
回到record_steal_time,kvm_read_guest_cached本质就是__copy_from_user,拷贝到vcpu->arch.st.stime,然后加上vcpu->arch.st.accum_steal,赋值给vcpu->arch.st.steal.steal,然后再通过kvm_write_guest_cached函数写入到guest中的映射地址中,这样就和steal_time每CPU变量对应起来了。

if (unlikely(kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
return;

vcpu->arch.st.steal.steal += vcpu->arch.st.accum_steal;
vcpu->arch.st.steal.version += 2;
vcpu->arch.st.accum_steal = 0;

kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));

 


KVM下steal_time源代码分析来自于OenHan

链接为:https://oenhan.com/kvm-steal-time

6 thoughts on “KVM下steal_time源代码分析”

    1. @OENHAN 我只是提一点中肯的意见…你写这么多说不定想知道的人看一眼都不多看了…

  1. 有点小问题,
    >current->sched_info.run_delay就是qemu的run_delay
    应该是vcpu线程而不是qemu主线程

回复 Sycoleo 取消回复