[Linux Kernel] Kernel 분석(v5.14.16) - NUMA, Zone Allocation (2)

BlueMoon-1st 2022. 5. 9. 01:01

2022. 5. 9. 01:01

1. Dirty Page

dirty 상태란 메모리의 변수 값을 수정하였으나 아직 디스크에는 전달하여 저장하지 않은 메모리와 디스크 sync가 맞지않은 상황을 말한다.

mm/page-writeback.c

/**
* node_dirty_ok - tells whether a node is within its dirty limits
* @pgdat: the node to check
*
* Return: %true when the dirty pages in @pgdat are within the node's
* dirty limit, %false if the limit is exceeded.
*/
bool node_dirty_ok(struct pglist_data *pgdat)
{
unsigned long limit = node_dirty_limit(pgdat);
unsigned long nr_pages = 0;

nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
nr_pages += node_page_state(pgdat, NR_WRITEBACK);

return nr_pages <= limit;
}

page list를 입력으로 받아서 해당 page list가 추가로 dirty page할당이 가능한지 확인하기 위한 fucntion이다.

해당 function에서는 쓰기가 가능한 page갯수(limit 개수)와 할당된 dirty page 개수들을 불러와서 limit과 할당된 dirty page갯수를 비교해서 treu/false를 리턴한다.

mm/page-writeback.c

/**
* node_dirty_limit - maximum number of dirty pages allowed in a node
* @pgdat: the node
*
* Return: the maximum number of dirty pages allowed in a node, based
* on the node's dirtyable memory.
*/
static unsigned long node_dirty_limit(struct pglist_data *pgdat)
{
unsigned long node_memory = node_dirtyable_memory(pgdat);
struct task_struct *tsk = current;
unsigned long dirty;

if (vm_dirty_bytes)
dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE) *
node_memory / global_dirtyable_memory();
else
dirty = vm_dirty_ratio * node_memory / 100;

if (rt_task(tsk))
dirty += dirty / 4;

return dirty;
}

dirty page의 할당가능한 개수는 사용가능한 메모리 영역의 몇 %식으로 할당한다.

위 쓰기 가능한 즉 dirty limit을 알려주는 function을 보면 먼저 사용가능한 메모리를 계산하고 dirty limit값을 백분율로 %를 계산하는 것을 볼 수 있다.

2. slowpath

fastpath로 메모리 할당이 실패한 경우 동작하며 이 경우 kswapd 데몬을 통해 페이지 회수를 통해서 할당하기위한 페이지 여유 공간을 만든다.

kswapd

해당 데몬은 리눅스에서 페이지 프리를 담당하는 데몬으로 백그라운드에서 동작하는 데몬이다.

메모리 free와 재할당의 경우 많은 리소스를 필요로 하기 때문에 시간 딜레이가 발생하기 쉽다. 그래서 리눅스에서는 일정 임계치를 만들고

임계치 이하로 메모리가 부족하게 되면 kswapd 데몬이 백그라운드에서 메모리를 미리 할당하도록 한다.

이 때 기준이 되는 값이 min_free_kbytes이다.

GFP flag

__GFP_DIRECT_RECLAIM	fallback 옵션 사용가능할 때 딜레이없이 바로 회수 진행
__GFP_ATOMIC	sleep, reclaim 할 수 없음
GFP_KSWAPD_RECLAIM	low 워터마크도달 했을 때 high 워터마크 도달까지 kswapd 를 깨워서 메모리 free
__GFP_NOMEMALLOC	응급 상황에서만 이용할 수 있는 reserve 메모리 영역 페이지 할당을 금한다.

mm/page_alloc.c

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct alloc_context *ac)
{
bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
struct page *page = NULL;
unsigned int alloc_flags;
unsigned long did_some_progress;
enum compact_priority compact_priority;
enum compact_result compact_result;
int compaction_retries;
int no_progress_loops;
unsigned int cpuset_mems_cookie;
int reserve_flags;

/*
* We also sanity check to catch abuse of atomic reserves being used by
* callers that are not in atomic context.
*/
if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
gfp_mask &= ~__GFP_ATOMIC;

retry_cpuset:
compaction_retries = 0;
no_progress_loops = 0;
compact_priority = DEF_COMPACT_PRIORITY;
cpuset_mems_cookie = read_mems_allowed_begin();

/*
* The fast path uses conservative alloc_flags to succeed only until
* kswapd needs to be woken up, and to avoid the cost of setting up
* alloc_flags precisely. So we do that now.
*/
alloc_flags = gfp_to_alloc_flags(gfp_mask);

/*
* We need to recalculate the starting point for the zonelist iterator
* because we might have used different nodemask in the fast path, or
* there was a cpuset modification and we are retrying - otherwise we
* could end up iterating over non-eligible zones endlessly.
*/
ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
ac->highest_zoneidx, ac->nodemask);
if (!ac->preferred_zoneref->zone)
goto nopage;

if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);

/*
* The adjusted alloc_flags might result in immediate success, so try
* that first
*/
page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
if (page)
goto got_pg;

/*
* For costly allocations, try direct compaction first, as it's likely
* that we have enough base pages and don't need to reclaim. For non-
* movable high-order allocations, do that as well, as compaction will
* try prevent permanent fragmentation by migrating from blocks of the
* same migratetype.
* Don't try this for allocations that are allowed to ignore
* watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.
*/
if (can_direct_reclaim &&
(costly_order ||
   (order > 0 && ac->migratetype != MIGRATE_MOVABLE))
&& !gfp_pfmemalloc_allowed(gfp_mask)) {
page = __alloc_pages_direct_compact(gfp_mask, order,
alloc_flags, ac,
INIT_COMPACT_PRIORITY,
&compact_result);
if (page)
goto got_pg;

/*
* Checks for costly allocations with __GFP_NORETRY, which
* includes some THP page fault allocations
*/
if (costly_order && (gfp_mask & __GFP_NORETRY)) {
/*
* If allocating entire pageblock(s) and compaction
* failed because all zones are below low watermarks
* or is prohibited because it recently failed at this
* order, fail immediately unless the allocator has
* requested compaction and reclaim retry.
*
* Reclaim is
* - potentially very expensive because zones are far
*    below their low watermarks or this is part of very
*    bursty high order allocations,
* - not guaranteed to help because isolate_freepages()
*    may not iterate over freed pages as part of its
*    linear scan, and
* - unlikely to make entire pageblocks free on its
*    own.
*/
if (compact_result == COMPACT_SKIPPED ||
    compact_result == COMPACT_DEFERRED)
goto nopage;

/*
* Looks like reclaim/compaction is worth trying, but
* sync compaction could be very expensive, so keep
* using async compaction.
*/
compact_priority = INIT_COMPACT_PRIORITY;
}
}

retry:
/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);

reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
if (reserve_flags)
alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags);

/*
* Reset the nodemask and zonelist iterators if memory policies can be
* ignored. These allocations are high priority and system rather than
* user oriented.
*/
if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
ac->nodemask = NULL;
ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
ac->highest_zoneidx, ac->nodemask);
}

/* Attempt with potentially adjusted zonelist and alloc_flags */
page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
if (page)
goto got_pg;

/* Caller is not willing to reclaim, we can't balance anything */
if (!can_direct_reclaim)
goto nopage;

/* Avoid recursion of direct reclaim */
if (current->flags & PF_MEMALLOC)
goto nopage;

/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
&did_some_progress);
if (page)
goto got_pg;

/* Try direct compaction and then allocating */
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
compact_priority, &compact_result);
if (page)
goto got_pg;

/* Do not loop if specifically requested */
if (gfp_mask & __GFP_NORETRY)
goto nopage;

/*
* Do not retry costly high order allocations unless they are
* __GFP_RETRY_MAYFAIL
*/
if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
goto nopage;

if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
did_some_progress > 0, &no_progress_loops))
goto retry;

/*
* It doesn't make any sense to retry for the compaction if the order-0
* reclaim is not able to make any progress because the current
* implementation of the compaction depends on the sufficient amount
* of free memory (see __compaction_suitable)
*/
if (did_some_progress > 0 &&
should_compact_retry(ac, order, alloc_flags,
compact_result, &compact_priority,
&compaction_retries))
goto retry;

/* Deal with possible cpuset update races before we start OOM killing */
if (check_retry_cpuset(cpuset_mems_cookie, ac))
goto retry_cpuset;

/* Reclaim has failed us, start killing things */
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
if (page)
goto got_pg;

/* Avoid allocations with no watermarks from looping endlessly */
if (tsk_is_oom_victim(current) &&
    (alloc_flags & ALLOC_OOM ||
     (gfp_mask & __GFP_NOMEMALLOC)))
goto nopage;

/* Retry as long as the OOM killer is making progress */
if (did_some_progress) {
no_progress_loops = 0;
goto retry;
}

nopage:
/* Deal with possible cpuset update races before we fail */
if (check_retry_cpuset(cpuset_mems_cookie, ac))
goto retry_cpuset;

/*
* Make sure that __GFP_NOFAIL request doesn't leak out and make sure
* we always retry
*/
if (gfp_mask & __GFP_NOFAIL) {
/*
* All existing users of the __GFP_NOFAIL are blockable, so warn
* of any new users that actually require GFP_NOWAIT
*/
if (WARN_ON_ONCE(!can_direct_reclaim))
goto fail;

/*
* PF_MEMALLOC request from this context is rather bizarre
* because we cannot reclaim anything and only can loop waiting
* for somebody to do a work for us
*/
WARN_ON_ONCE(current->flags & PF_MEMALLOC);

/*
* non failing costly orders are a hard requirement which we
* are not prepared for much so let's warn about these users
* so that we can identify them and convert them to something
* else.
*/
WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);

/*
* Help non-failing allocations by giving them access to memory
* reserves but do not use ALLOC_NO_WATERMARKS because this
* could deplete whole memory reserves which would just make
* the situation worse
*/
page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
if (page)
goto got_pg;

cond_resched();
goto retry;
}
fail:
warn_alloc(gfp_mask, ac->nodemask,
"page allocation failure: order:%u", order);
got_pg:
return page;
}

slowpath를 동작하는 과정이다. 과정을 간단히 살펴보면 다음과 같다.

GFP 플래그들을 이용해서 direct로 할것인지 워터마크를 활용해서 어느 시점까지 free해서 확보할 것인지 등의 값들이 셋업되고

셋업된 플래그 값에 따라서 계속 memory 해제 시도를 진행한다. 이 때 원하는 만큼의 메모리 확보되었는지 체크하고 더 필요한 경우 다시 해제를 위한 동작이 시도된다.

물론 과정에서 중간중간 딜레이를 주어 다른 동작을 진행할 수 있도록 유도하거나 retry count를 max값을 정해두어 무한 retry되지 않게 하는 등의

방법을 사용한다.

마찬가지로 응급상황에서만 이용할 수 있는 reserve 메모리 영역에 대해서도 메모리가 부족시 어떻게 할지 flag를 통해서 값을 설정하고 slowpath과정에서 이를 참고한다.

https://brunch.co.kr/@alden/14

코드로 알아보는 ARM 리눅스 커널

https://elixir.bootlin.com/linux/v5.14.16/source/

저작자표시 변경금지 (새창열림)

'OS > Linux' 카테고리의 다른 글

[Linux Kernel] Kernel 분석(v5.14.16) - 슬랩 (0)	2022.06.11
[Linux Kernel] Kernel 분석(v5.14.16) - 워터마크 (0)	2022.05.29
[Linux Kernel] Kernel 분석(v5.14.16) - NUMA, Zone Allocation (1) (0)	2022.05.01
[Linux Kernel] Kernel 분석(v5.14.16) - pcp (0)	2022.03.26
[Linux Kernel] Kernel 분석(v5.14.16) - Buddy (2) (0)	2022.03.07

Blue-Moon의 정리노트!!