swappiness对swap分区的影响

Posted on:2013-10-252013-10-25
Categories:FileSystem
Tags:IO, Memory

swap分区出现是因为硬件落后的时候，被OS设计人员搞出来承担一部分内存工作的，那时候的硬盘的速度和内存速度的差别对性能的影响还不明显，但随着现在高性能程序的要求，swap分区的读写速度严重影响了性能。

swappiness又被留出来作为控制swap使用程度的接口，关于/proc/sys/vm/swappiness的介绍wiki有

Swappinessis a property of the Linux kernel that changes the balance between swapping out runtime memory, as opposed to dropping pages from the system page cache. Swappiness can be set to values between 0 and 100 inclusive. A low value means the kernel will try to avoid swapping as much as possible where a higher value instead will make the kernel aggressively try to use swap space. The default value is60, and for most desktop systems, setting it to 100 may affect the overall performance, whereas setting it lower (even 0) may improve interactivity (by decreasing response latency.)^[1]

Value	Strategy
`vm.swappiness=0`	The kernel will swap only to avoid an out of memory condition.
`vm.swappiness=60`	The default value.
`vm.swappiness=100`	The kernel will swap aggressively which may affect over all performance.

在内核上vm.swappiness的具体机制又是怎样的？

具体工作代码是在refill_inactive_zone实现的，内核使用refill_inactive_zone将缓存的内存进行优先级排队。

if (!reclaim_mapped ||
(total_swap_pages == 0 && PageAnon(page)) ||
page_referenced(page, 0)) {
list_add(&page->lru, &l_active);
continue;

if (zone_is_near_oom(zone))
goto force_reclaim_mapped;
............
if (swap_tendency >= 100)
force_reclaim_mapped:
reclaim_mapped = 1;

内核通过reclaim_mapped标记判断对page添加到active链表里面，当reclaim_mapped=1，page不会被标记为活动的，会被优先转到swap分区里面。而reclaim_mapped=1的设定首先当内核判断将要OOM的时候，直接reclaim_mapped=1，忽略swappiness的参考。

static inline int zone_is_near_oom(struct zone *zone)
{
return zone->pages_scanned >= (zone->nr_active + zone->nr_inactive)*4;
}

判断OOM的逻辑比较简单，即回收内存已扫描的page的次数活动页和非活动页总数的4倍。不确定这个倍数是怎么确定的，个人理解4倍的扫描量已经很深入了，也接近了OOM的边缘。

当系统没有OOM的时候，才会考虑swappiness的调节。

/*
 * `distress' is a measure of how much trouble we're having
 * reclaiming pages.  0 -> no problems.  100 -> great trouble.
 */distress = 100 >> zone->prev_priority;

/*
 * The point of this algorithm is to decide when to start
 * reclaiming mapped memory instead of just pagecache.  Work out
 * how much memory
 * is mapped.
 */mapped_ratio = ((sc->nr_mapped+sc->nr_anon) * 100) / total_memory;

/*
 * Now decide how much we really want to unmap some pages.  The
 * mapped ratio is downgraded - just because there's a lot of
 * mapped memory doesn't necessarily mean that page reclaim
 * isn't succeeding.
 *
 * The distress ratio is important - we don't want to start
 * going oom.
 *
 * A 100% value of vm_swappiness overrides this algorithm
 * altogether.
 */swap_tendency = mapped_ratio / 2 + distress + vm_swappiness;

/*
 * If there's huge imbalance between active and inactive
 * (think active 100 times larger than inactive) we should
 * become more permissive, or the system will take too much
 * cpu before it start swapping during memory pressure.
 * Distress is about avoiding early-oom, this is about
 * making swappiness graceful despite setting it to low
 * values.
 *
 * Avoid div by zero with nr_inactive+1, and max resulting
 * value is vm_total_pages.
 */imbalance  = zone->nr_active;
imbalance /= zone->nr_inactive + 1;

/*
 * Reduce the effect of imbalance if swappiness is low,
 * this means for a swappiness very low, the imbalance
 * must be much higher than 100 for this logic to make
 * the difference.
 *
 * Max temporary value is vm_total_pages*100.
 */imbalance *= vm_swappiness + 1;
imbalance /= 100;

/*
 * If not much of the ram is mapped, makes the imbalance
 * less relevant, it's high priority we refill the inactive
 * list with mapped pages only in presence of high ratio of
 * mapped pages.
 *
 * Max temporary value is vm_total_pages*100.
 */imbalance *= mapped_ratio;
imbalance /= 100;

/* apply imbalance feedback to swap_tendency */swap_tendency += imbalance;

/*
 * Now use this metric to decide whether to start moving mapped
 * memory onto the inactive list.
 */if (swap_tendency >= 100)
force_reclaim_mapped:
reclaim_mapped = 1;

看上面的注释应该更容易理解。当swappiness=100的时候，任何条件都不能阻止reclaim_mapped=1，也就满足正常描述的意义。当swappiness=0的时候，却不能保证reclaim_mapped=0，除去OOM的情况，mapped_ratio可能会贡献100中的一部分，但它不是最危险的，当zone->nr_active与zone->nr_inactive的比可能会差别更大，当然那个时候估计也会OOM了。

从内核代码上看，swappiness=100，swap会积极工作，当swappiness=0，swap仍可能工作，影响程序的性能。

从个人观点看，swap在高性能程序上完全没有存在的必要了，它加大系统的复杂度，在不可预测的时间段影响程序的性能，至于预防OOM，互联网业务估计会被雪崩效应搞瘫痪，参考12306。一句话，去掉swap吧，相对程序的高性能，内存条还是很便宜的，或者选择flashcache。

swappiness对swap分区的影响来自于OenHan

链接为:https://oenhan.com/swappiness-swap

OenHan

Sun @ KVM源代码分析4:内存虚拟化
博主，您好！想问一下qemu负责模拟vcpu的线程参与正常的线程调度吗？如果参与，其是需要先从非根模式退出到根模式再进行…6 月 4, 15:38
GGG @ ext4 mballoc源代码分析
@OENHAN 您好，“目的就是要尽可能的紧凑，所谓的针对2的幂的削峰填谷”---我对这个结论还是不太能理解，请问下在分…10 月 19, 20:36
Gary @ KVM源代码分析3:CPU虚拟化
@JOE 老哥，最近还在读KVM源码吗，可以交流一下不8 月 22, 19:30
Joe @ KVM源代码分析3:CPU虚拟化
了解了，谢谢！2 月 17, 14:06
Joe @ KVM源代码分析3:CPU虚拟化
最近在学习CPU虚拟化，有个问题想请问一下博主，在x86 cpu虚拟化中，每个VCPU有一个VMCS，然后每个guest…12 月 6, 10:22
Shawtao @ Job
请问之后会招暑期实习或者日常实习吗？11 月 9, 12:55
Arthur.Dayne @ KVM virtIO block源代码分析
我最近在研究virtio-blk，想搞明白guest中读写/dev/vda后，应该会跳到qemu的kvm_handle_…6 月 22, 17:21
Timelife @ Job
这并不是必然的和固定的，Arm及Intel的Page管理技术，保护模式下只是为了更合理高校利用资源和数据隔离，从最初高效…5 月 21, 14:55
Raymond @ Job
9 BIT 能表示512个ENTRIES，4KB页面，刚好8个字节一个ENTRY，512个刚好4KB，ENTRY中放64…2 月 29, 23:04
Xyz @ KVM MMU page释放机制
是的，是在没有开启ept的情况下。对mmu page的回收有些不解。2 月 25, 17:59

OenHan

发表回复 取消回复

发表回复取消回复