Ebru A.k.a.gündüz: collapse huge page

Geçtiğimiz yaz haziran sonunda mezun oldum. Mezuniyetten sonra bir işe girip çalışmak, her gün işe gidip gelmek, birkaç ay işim dışında bir şeye bakmamak, sonrasında iş temposuyla birlikte neler yapabileceğime karar vermek gibi bir düşüncem vardı. Ancak tabi ki böyle olmadı :).

Ağustos başında Google İrlanda ofisinden iş görüşmesi için e-posta aldım, benim ile başlangıç bir telefon görüşmesi yapmak istediklerini söylediler. Hemen kabul ettim. İlk üç aşamayı geçtim, son görüşme için Irlanda'ya gittim ancak son görüşmede başarılı olamadım. Sorular beklediğim gibi değildi, internetten çalıştığım gibi de değildi.

Ben evde son görüşmeye hazırlanırken Gnome OPW için başvurular da başlamıştı. Ben daha önce ilk Gnome'un araçlarına sonra da Linux Kernel'a başvurmuştum. Gnome'a katkı verirken katkı vermek için masaüstü bilgisayarınızda ortam oluşturmak zor. En son Fedora 19'un alfa sürümünü kullanmak zorunda kalmam ve alfanın hiç kullanışlı olmaması üzerine katkı vermeyi sonlandırdım :). Aslında jhubild'de kullanabiliriz ama onda ortamı hazırlaması .. bana bir tane geliştiricisi o zamanlar Fedora beta sürümü varken, beta kullanmamı önermişti, beta yine kullanılabilirdi ancak bir sonraki sürüme geçtiklerinde alfa kullanmak zorunda kalmam pek iyi olmadı. Bir de Gnome Continuous var, onu yeni gördüm ama henüz denemedim.

Linux Kernel'a katkı verdiğim sene aslında alınmayı beklemiştim gerçekten ama olmadı. Bu dönem başvuru süresi 2 ay gibi uzundu :), geçen sene 3 hafta gibi bir süreydi. Necdet hoca "Aslında sen başvursan çok şey yaparsın" dedi, ben de başvurdum ve alındım :). Alınmayı gerçekten beklemiyordum ve çok güzel bir sürpriz oldu.

Şimdi Rik van Riel ile birlikte bellek yönetimi üzerinde çalışıyorum. Bellek yönetimi katkı vermeye başlamanın en zor olduğu kısımlardan biri, çünkü çekirdeğin temel fonksiyonlarını içeriyor ve daha karmaşık. Başvurduğunuz projeye göre staj sürecinde ne kadar yama gönderebileceğinizin sayısı da değişiyor. Ben projem zor olduğundan, 4 ayda otuz satır yazabilir miyim derken şimdiden bir yamayı kabul ettiler bile :).

Linux kaynak kodunda bellek yönetimi ile ilgili dizin mm/. Eğer güncel mm dizinini takip etmek istiyorsak da Linus Torvalds'ın kullandığı dalı değil de linux-next'i takip etmek gerekiyor. Yamaları Andrew Morton kabul ediyor ve günlük olarak etiketliyor. Güncel linux-next'i nasıl takip edeceğinizi görmek için buraya bakabilirsiniz.

Anladığım kadarıyla, Linux'ta bellek yönetimi üzerine çoğunlukla Redhat ekibi bakıyor, çünkü birçok sunumu ve belgeyi o ekip hazırlamış. Ben yamaları gönderirken mm dizini bakıcılarına baktığımda genelde @redhat.com alan adlı hesaplar var. Yamaları vger.kernel.org ve akpm@linux-foundation.org listelerine gönderdiğim için de çok mutluyum. O kadar büyük listelerdeki insanların kodlara bakıyor olması oldukça heyecanlı :).

Üzerinde çalıştığım proje ise, bellek üzerinde 2kB/4kB kadar boyutlarda olabilen sayfaların dışında bir de büyük sayfalar (huge page) var. Onların boyutları ise 2MB/4MB. Peki neden büyük sayfalara ihtiyaç duyuyoruz? Çünkü sayfalar büyük olduğunda sayfa tabloları da büyük oluyor ve bir süreç için verileri bellekten atma/belleğe getirme miktarı azalıyor. Eğer sayfalar 4MB'tan büyük olursa verimsiz oluyor.

Sistemin swap kullanması gerektiğinde büyük sayfalar normal boyutlu sayfalara parçalanıyor (2kB/4kB) ve o şekilde swap alanına yerleştiriliyor. Sorun şu ki; swap alanından belleğe tekrar geri getirilmek istendiğinde büyük sayfa olarak değil, normal boyutlu sayfalar olarak getiriliyorlar bu durumda eski verim sağlanamıyor, aynı zamanda burada izlenmesi gereken bir algoritmaya da karar verilmeli. Çünkü bir süreç swap kullanıyor diyelim, ve swapte olan bir veriye ihtiyaç duyuldu, sadece tek bir verinin bulunduğu sayfa 2kB, ancak bunun yerine 2MB büyük sayfa getirmek her zaman yararlı olmaz. Burada karar verilmesi gereken noktalar var.

Geçtiğimiz dönem kod ve belge okumak üzerine geçti. Bir tane de yama gönderdim. Sadece okunabilir sayfaları büyük sayfalar şeklinde birleştirmek için. Yamayı burada görebilirsiniz.

Bundan sonraki bir süre ise; "daha önce hiç okuma/yazma isteği almamış, henüz fiziksel belleğe eşlenmemiş, sadece sanalda bulunan, bir süre sonra ilk kez okuma izni aldığında fiziksel belleğe eşlenen sayfalar" var, bunlara zero page deniliyor, bunlar üzerinde çalışacağım. Eğer sayfa içerisinde veri yoksa, ilk okuma isteği aldığında çekirdek bunu sıfırlarla dolu bir sayfa olarak üretiyor. Bunları da büyük sayfalara dahil etmek üzere çalışacağım. Aynı zamanda (emin değilim), Documentation dizininde de belgelendirme yapmam gerekecek. Muhtemelen birkaç işim daha var ancak henüz ben bilmiyorum, danışmanımla işleri bitirdikçe yenisini alma şeklinde ilerliyoruz. Henüz başlamadığımız işlerden de şuna bir ara bakarız şeklinde konuşuyoruz.

Staj sürecinde birkaç Türkçe yazı daha yazacağım (yazmadı) :).

Last week, I focused on collapsing huge pages and swap in/out. Rik have asked me some questions and I countinue to reply them. When we study, he puts some hints to questions and sometimes, I back to old questions and try to understand them better. I've started to follow linux-mm list and understand what they mention about at least :). Before I've tried to follow some staging lists but see emails a lot* everyday.

My project includes very small coding according to other kernel internship projects but the topic is too complex and its about core functionality of Linux Kernel. That's really good :). Before the internship, I was thinking "I should pick a small staging driver and study for todo issues". Because with small driver, learning and to be part of Linux Kernel can be easier and I was going to do just self study, also work at another company so have not enough time. Todays, I study on almost hardest topic with a Linux Kernel developer!

I can start coding for my project but need to discuss which policy we will follow to get swapped out pages to memory. We should discuss about trade off to collapsing pages in a 2 mb page when getting them from swap to memory. Keeping in mind the functions is a bit hard for me, I re-look a lot of time to which function what does. My memory is not good :(.

I've replied following questions which are asked me by my mentor. When I reply the questions, don't need to search just from books, look to codes and understand a bit of them :).

Following call chains for Linux Kernel - 3.18

1) How is a page swapped out?

add_to_swap()
get_swap_page()
scan_swap_map()
if(PageTransHuge(page))
if (split_huge_page_to_list(page, list))
swapcache_free()
add_to_swap_cache()

Swapped out is implemented by add_swap_page(). The function allocates swap space for a page. Swap areas includes slots, and their size are equals to page frames.

It checks whether page up to date and locked then gets swap page, get_swap_page() firstly checks is there a swap page with atomic_long_read(&nr_swap_pages) if so, decrements nr_swap_pages by one. In there used highhest_bit to decrease searching time of free slot. get_swap_page() returns type swap_entry_t. add_swap_page(), cheks whether the page is huge, if so; it splits the huge page to list with
split_huge_page_to_list(page, list). The function calls __split_huge_page() -> __split_huge_page_map().

If split_huge_page_to_list() returns 0, that means splitted pages succesfully. After that it adds the page entry to swap cache. If huge page can not splitted then called swapcache_free(entry), to release retrieved swap page.

Note: I did not understand why we set page as dirty? at line:202. The page is not changed. Rik's answer: After the page is added to the swap, it is removed from the process page tables.

Then we have a page that is only in the page cache, and not mapped in any process. The dirty bit indicates the data has to be saved to swap before the page can be freed.

scan_swap_map() is used to find free swap slot.

2) Where is the information of where a page is in swap, stored in the data structures used to track the memory of a process? In other words, how does swap in find where in swap memory is?

Using swap_info_struct swap_info[] array, it includes indexes of ptes and also offsets are stored in swap_map.

3) How is a page swapped in?
do_swap_page()
pte_to_swp_entry()
lookup_swap_cache()
if (!page)
swapin_readahead()

else if (PageHWPoison(page)) {...}

It calls pte_to_swp_entry(), pte is converted to swap entry. the function converts pte to architecture independent entry format. Then checks validation of entry with non_swap_entry(). Sets DELAYACCT_PF_SWAPIN flag to spcefiy that does in swap operation. It checks whether the page in swap cache. If so, checks HWPoison flag which is for recovering memory errors; if not runs swapin_readahead() to get probably needs to swap in pages. If HWPoison is setted, it goes to label out_release and release the page and swap cache page from page cache. I've a question here: if the page was huge, but right now it is not because splitted to locate on swap. The page have no compound flag anymore? Is that correct?

Rik's answer: Swap only ever stores single pages, indeed. That means swapped
in pages will not have the compound flag set page_cache_release() calls put_page(), and the function checks using PageCompound().

Note: do_swap_page() gets two pte argument. orig_pte is created by handle_mm_fault() using ACCESS_ONCE(). The macro is for volatile issues maybe but I didn't understand why second pte is created.

Answer: do_swap_page gets both the location of the pte (a pointer) and the value of the pte at handle_mm_fault time. This way it can verify whether the pte changed from its original value, while it is doing things (like retrieving data from swap)

4) How can you discover whether a page is present in the swap cache?

_PAGE_FILE detects whether the page in swap cache, it is into page table entry.

5) Under what conditions are 4kB pages collapsed into a 2MB THP?

In handle_pte_fault(), at line: 3219 do_swap_page() is called because system want to swapped in pages, so it wants to collapse the pages but it can't. This happens, if the cases occur: pte shouldn't be on memory and swap cache. Also pte should be belong to anonymous page, not to file.

Rik's question: Can you write down the call traces that contain the collapse functions in mm/huge_memory.c?
My answer, there are two cases for it:
do_huge_pmd_anonymous_page() -> alloc_hugepage_vma() -> .. __alloc_pages_nodemask()

khugepaged_scan_pmd() -> collapse_huge_page() -> khugepaged_alloc_page() -> alloc_pages_exact_node() -> .. -> __alloc_pages_nodemask()

Rik's answer: Firt call chain, you can leave alone during your project, and is important to understand what the data structures look like.

Second call chain: This is the one you will want to work with.

6) how many pages would be good to bring into swap when trying to collapse a huge page? how many is too much?

x86 systems use 4kB page size. It can be enlarged maximum 4mb. With 2mb page, it is enlarged 512 times, thats good. 4mb page means 1024 times enlarged, it is not so good, if size is more than 4mb huge page, it will be useless.

Rik's answer: Sure, but if all of them are swapped out, it is obviously a bad
idea to bring them all into memory just to collapse them into a 2MB THP.

Out of the 512 pages, what would be a reasonable number to not be in memory already, and bring in from swap? 1? 5? 10? 20? What would the number depend on?

My answer: do_swap_page() -> swapin_readahead() -> swapin_nr_pages()

In swapin_nr_pages(), using page_cluster it detects how many pages getting swap in.

page_cluster defined in swap.c. swap_setup() function assigns its default value.

7) what about pages that are already in the swap cache? how should we count those?

For this question, I'm a bit confused, thought swap cache is a swap slot .. and replied different thing :). Firstly, I've tried to count how many pages in swap, afterwards go ahead to correct intent.

If a page in the swap cache, that means the page up to date on memory. Also the page is on the one swap slot. Swap cache size is equal to a swap slot. struct swap_info_struct, it has array swap_map.

I've read from book, swap_duplicate() increases slot usagecounter, probably it mentions about the line:

count = p->swap_map[offset];

I've seen on comment line of the function "increment its swap map count." So I guess, swap_map count can be slot usage counter. But I know swap_info struct per swap area not per swap slot. A swap area can include set of slots, so I think, swap_map count doesn't mean swap cache usage counter. Did I misunderstand? Can you inform me about swap_map?

Rik's answer: The swap_map counter means the number of users of that swap slot, which may be different from the number of users of the swap cache page.

However, this line of research is tangential to your project, and your interpretation of question 2 differs from my intent.

My answer: I've thought, when a page will be added to swap cache, it increases count of swap cache page.

add_to_swap_cache() -> __add_to_swap_cache()

In __add_to_swap_cache(), after adding radix tree, it increases the counter at line 104:
address_space->nrpages++;

nrpages specify total number of pages in address_space which stores swap cache address that means how many pages are in swap cache.

8) what about the !pte_write test in __collapse_huge_page_isolate? how can that break collapsing of a THP? is there a simple way to improve that?

pte_present() checks whether the entry have write access. It needs to create a page using page table entry with vm_normal_page(). If write access is disallowed, can't create page.

My mentor said __collapse_huge_page_isolate() is important point your project, and I've reviewed it:

__collapse_huge_page_isolate()
for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++, address += PAGE_SIZE) {
pte_none()
!pte_present() || !pte_write()
isolate_lru_page(page)

For loop:
Firstly, it checks pte_none(), this function returns true if pte is not in Ram. There is a counter none, checks with none for allowed khugepaged_max_ptes_none. If exceeds the limit then goes to label out. pte_none() returns true if none of the bits in the pte are set, that is the pte is all zeroes. pte_present() also is used for page swapped out. pte_write() checks whether the kernel write to page. Then checks is the page compound, if so does system halt. If page is not anon nor swap backed then does system halt. The page should have only one reference, this property checked by _count variable. Then it locks the page.

I'm looking isolate_lru_page(), Lru list includes inactive file/page, active file/page. But I didn't understand why there is PG_lru flag and we check it with PageLRU(page) at line 1344, vmscan.c

Rik's answer: If the page is not currently on an LRU, a different task may have that page on a pagevec, and do something with it later. That could race with trying to collapse the hugepage, and cause all sorts of problems.

In isolate_lru_page(), it checks again VM_BUG_ONE(!page_count(page), page), but already it checked for page_count() before calling isolate_lru_page() line at: 2171. Why it does this again?

Rik's answer: The page_count check is debugging code, in case somebody changes the kernel and calls isolate_lru_page on a page that got freed. It checks PageLRU once outside the lock (it can be changed by another CPU), and once inside the lock (taking a lock is expensive, and only done conditionally).

And also it checks PageLRU two times, in isolate_lru_page(). It gets lruvec then runs page_lru(), which checks page unevictable, if so sets lru as unevictable; if not, calls page_lru_base_type(), this decides base type for page - file or anon after adds active property and returns lru. Then isolate function runs get_page(), if page head or normal, increments _count. Then clear page lru and calls del_page_from_lru_list(), this function deletes lru from page using list_del()

It checks for page young conditions, then set referenced as 1. If it goes to label out, it runs release_pte_pages().

Note: That function gets a pointer to the start of a page table page, and iterates over all the ptes inside that page table page. A 2MB THP contains 512 4kB pages, which is 1<<HPAGE_PMD_ORDER pages.

Ebru A.k.a.gündüz

31 Ocak 2015 Cumartesi

Linux Kernel Ekibiyle Staj

31 Aralık 2014 Çarşamba

Collapsing Huge Pages & Swap In/Out

Hakkımda