MM

Address Space

Page Table
Reclaim

Address Space


vm_area_struct

vm_area_struct describes

The page fault policy includes:

Reverse Mapping

page_add_new_anon_rmap()
page_add_file_rmap()

shrink_page_list()
---
        /*
         * The page is mapped into the page tables of one or more
         * processes. Try to unmap it here.
         */
        if (page_mapped(page)) {
            enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH;
            bool was_swapbacked = PageSwapBacked(page);

            if (unlikely(PageTransHuge(page)))
                flags |= TTU_SPLIT_HUGE_PMD;

            if (!try_to_unmap(page, flags)) {
                stat->nr_unmap_fail += nr_pages;
                if (!was_swapbacked && PageSwapBacked(page))
                    stat->nr_lazyfree_fail += nr_pages;
                goto activate_locked;
---            


// check if a page is mapped in vma at an address
page_vma_mapped_walk()

To be continued

Summary

                     View of a Task

    /---------------/     Hole  /-------------------/   Virtual Address what a task can see

     pud/pmd/pte                    pud/pmd/pte         Page table which  translate virtual address to physical ones

     vm_area_struct               vm_area_struct        Page fault policy behind the address

     address_space      (what is it for anonymous mapping)
                                                        Maintain pages and method to get data


When a task access the virtual address
(1) MMU lookup the TLB, if not hit, walk the page table
    If corresponding page table entry is empty, raise hardware
    exception and trigger software page fault.
(2) page fault find and check the vm_area_struct maintained in task_strcut->mm
    and decide what to do next based on the information in associated vm_area_struct.
(3) If file mapping, invoke fault callback to get the page and data
(4) install the page into page table
invalidate_inode_pages2_range is a good example to understand this, it mainly do following things, After this, page table entry is zapped and page is deleted from the
page cache. When the task access the address again, page fault will
rebuild the pages and page table entries based on the vm_area_struct

Page Table


split page table lock

Comment from https://www.kernel.org/doc/html/v4.18/vm/split_page_table_lock.html

    Originally, mm->page_table_lock spinlock protected all page tables of the
    mm_struct. But this approach leads to poor page fault scalability of
    multi-threaded applications due high contention on the lock. To improve
    scalability, split page table lock was introduced.

    With split page table lock we have separate per-table lock to serialize
    access to the table. At the moment we use split lock for PTE and PMD tables.
    Access to higher level tables protected by mm->page_table_lock.
The lock helper interfaces in kernel
static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
{
    spinlock_t *ptl = pmd_lockptr(mm, pmd);
    ---
        -> ptlock_ptr(pmd_to_page(pmd))
          -> page->ptl
    ---
    spin_lock(ptl);
    return ptl;
}

/*
 * No scalability reason to split PUD locks yet, but follow the same pattern
 * as the PMD locks to make it easier if we decide to.  The VM should not be
 * considered ready to switch to split PUD locks yet; there may be places
 * which need to be converted from page_table_lock.
 */
static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
{
    return &mm->page_table_lock;
}
The handle_mm_fault works under mmap_read_lock()
do_mmap_pgoff works under mmap_write_lock()

Reclaim


minor and major faults

There are two kinds of page faults, minor and major, or soft and hard.

We could observe these two kinds of page faults in following ways, In code, it works as following,
filemap_fault()
---
    page = find_get_page(mapping, offset);
    if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
        fpin = do_async_mmap_readahead(vmf, page);
    } else if (!page) {
        /* No page in the page cache at all */
        count_vm_event(PGMAJFAULT);
        count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
        ret = VM_FAULT_MAJOR;
        fpin = do_sync_mmap_readahead(vmf);
retry_find:
        page = pagecache_get_page(mapping, offset,
                      FGP_CREAT|FGP_FOR_MMAP,
                      vmf->gfp_mask);
        ...
    }
    if (!lock_page_maybe_drop_mmap(vmf, page, &fpin))
        goto out_retry;
---

mm_account_fault()
---
    major = (ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED);

    if (major)
        current->maj_flt++;
    else
        current->min_flt++;
---

However, there are some confusing things here,

The readahead could case the page cache has been allocated but IO
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
is still ongoing. When we fault on this page, it will be accounted
^^^^^^^^^^^^^^^^
as min fault, but we need to wait the read IO complete. This is
similar with the major faults.

Shift between active and inactive list

 
            CPU0             CPU1             CPU2             CPU3
  per-cpu   pvecs.lru_add    pvecs.lru_add    pvecs.lru_add    pvecs.lru_add
                 
  per-numa                              pglist_data

                            File/Anon Active/Inactive  LRU list


                    +-------------+               +------------+
                   /    active   /               /  inactive   /
                  +-------------+               +-------------+

Reclaim and Writeback