swap分区出现是因为硬件落后的时候,被OS设计人员搞出来承担一部分内存工作的,那时候的硬盘的速度和内存速度的差别对性能的影响还不明显,但随着现在高性能程序的要求,swap分区的读写速度严重影响了性能。

swappiness又被留出来作为控制swap使用程度的接口,关于/proc/sys/vm/swappiness的介绍wiki有

Swappinessis a property of the Linux kernel that changes the balance between swapping out runtime memory, as opposed to dropping pages from the system page cache. Swappiness can be set to values between 0 and 100 inclusive. A low value means the kernel will try to avoid swapping as much as possible where a higher value instead will make the kernel aggressively try to use swap space. The default value is60, and for most desktop systems, setting it to 100 may affect the overall performance, whereas setting it lower (even 0) may improve interactivity (by decreasing response latency.)[1]

ValueStrategy
vm.swappiness=0The kernel will swap only to avoid an out of memory condition.
vm.swappiness=60The default value.
vm.swappiness=100The kernel will swap aggressively which may affect over all performance.

在内核上vm.swappiness的具体机制又是怎样的?

具体工作代码是在refill_inactive_zone实现的,内核使用refill_inactive_zone将缓存的内存进行优先级排队。

if (!reclaim_mapped ||
(total_swap_pages == 0 && PageAnon(page)) ||
page_referenced(page, 0)) {
list_add(&page->lru, &l_active);
continue;

if (zone_is_near_oom(zone))
goto force_reclaim_mapped;
............
if (swap_tendency >= 100)
force_reclaim_mapped:
reclaim_mapped = 1;

内核通过reclaim_mapped标记判断对page添加到active链表里面,当reclaim_mapped=1,page不会被标记为活动的,会被优先转到swap分区里面。而reclaim_mapped=1的设定首先当内核判断将要OOM的时候,直接reclaim_mapped=1,忽略swappiness的参考。

static inline int zone_is_near_oom(struct zone *zone)
{
return zone->pages_scanned >= (zone->nr_active + zone->nr_inactive)*4;
}

判断OOM的逻辑比较简单,即回收内存已扫描的page的次数活动页和非活动页总数的4倍。不确定这个倍数是怎么确定的,个人理解4倍的扫描量已经很深入了,也接近了OOM的边缘。

当系统没有OOM的时候,才会考虑swappiness的调节。

/*
 * `distress' is a measure of how much trouble we're having
 * reclaiming pages.  0 -> no problems.  100 -> great trouble.
 */distress = 100 >> zone->prev_priority;

/*
 * The point of this algorithm is to decide when to start
 * reclaiming mapped memory instead of just pagecache.  Work out
 * how much memory
 * is mapped.
 */mapped_ratio = ((sc->nr_mapped+sc->nr_anon) * 100) / total_memory;

/*
 * Now decide how much we really want to unmap some pages.  The
 * mapped ratio is downgraded - just because there's a lot of
 * mapped memory doesn't necessarily mean that page reclaim
 * isn't succeeding.
 *
 * The distress ratio is important - we don't want to start
 * going oom.
 *
 * A 100% value of vm_swappiness overrides this algorithm
 * altogether.
 */swap_tendency = mapped_ratio / 2 + distress + vm_swappiness;

/*
 * If there's huge imbalance between active and inactive
 * (think active 100 times larger than inactive) we should
 * become more permissive, or the system will take too much
 * cpu before it start swapping during memory pressure.
 * Distress is about avoiding early-oom, this is about
 * making swappiness graceful despite setting it to low
 * values.
 *
 * Avoid div by zero with nr_inactive+1, and max resulting
 * value is vm_total_pages.
 */imbalance  = zone->nr_active;
imbalance /= zone->nr_inactive + 1;

/*
 * Reduce the effect of imbalance if swappiness is low,
 * this means for a swappiness very low, the imbalance
 * must be much higher than 100 for this logic to make
 * the difference.
 *
 * Max temporary value is vm_total_pages*100.
 */imbalance *= vm_swappiness + 1;
imbalance /= 100;

/*
 * If not much of the ram is mapped, makes the imbalance
 * less relevant, it's high priority we refill the inactive
 * list with mapped pages only in presence of high ratio of
 * mapped pages.
 *
 * Max temporary value is vm_total_pages*100.
 */imbalance *= mapped_ratio;
imbalance /= 100;

/* apply imbalance feedback to swap_tendency */swap_tendency += imbalance;

/*
 * Now use this metric to decide whether to start moving mapped
 * memory onto the inactive list.
 */if (swap_tendency >= 100)
force_reclaim_mapped:
reclaim_mapped = 1;

看上面的注释应该更容易理解。当swappiness=100的时候,任何条件都不能阻止reclaim_mapped=1,也就满足正常描述的意义。当swappiness=0的时候,却不能保证reclaim_mapped=0,除去OOM的情况,mapped_ratio可能会贡献100中的一部分,但它不是最危险的,当zone->nr_active与zone->nr_inactive的比可能会差别更大,当然那个时候估计也会OOM了。

内核代码上看,swappiness=100,swap会积极工作,当swappiness=0,swap仍可能工作,影响程序的性能。

从个人观点看,swap在高性能程序上完全没有存在的必要了,它加大系统的复杂度,在不可预测的时间段影响程序的性能,至于预防OOM,互联网业务估计会被雪崩效应搞瘫痪,参考12306。一句话,去掉swap吧,相对程序的高性能,内存条还是很便宜的,或者选择flashcache


swappiness对swap分区的影响来自于OenHan

链接为:http://oenhan.com/swappiness-swap

发表回复