NVDIMM

Persistent Memory Mode
Userspace flushing to persistence
nvdimm in kernel

Persistent Memory Mode


There are two memory modes:

We could switch the memory mode in following way:
ipmctl show -topology
 DimmID | MemoryType                  | Capacity    | PhysicalID| DeviceLocator 
 ================================================================================
  0x0010 | Logical Non-Volatile Device | 252.438 GiB | 0x0045    | CPU0_B0
  0x0110 | Logical Non-Volatile Device | 252.438 GiB | 0x004b    | CPU0_E0
  0x1010 | Logical Non-Volatile Device | 252.438 GiB | 0x0051    | CPU1_B0
  0x1110 | Logical Non-Volatile Device | 252.438 GiB | 0x0057    | CPU1_E0
  N/A    | DDR4                        | 64.000 GiB  | 0x0043    | CPU0_A0
  N/A    | DDR4                        | 64.000 GiB  | 0x0049    | CPU0_D0
  N/A    | DDR4                        | 64.000 GiB  | 0x004f    | CPU1_A0
  N/A    | DDR4                        | 64.000 GiB  | 0x0055    | CPU1_D0
ipmctl create -goal PersistentMemoryType=AppDirect
sudo ipmctl create -goal PersistentMemoryType=AppDirect
The following configuration will be applied:
 SocketID | DimmID | MemorySize | AppDirect1Size | AppDirect2Size 
 ==================================================================
  0x0000   | 0x0010 | 0.000 GiB  | 252.000 GiB    | 0.000 GiB
  0x0000   | 0x0110 | 0.000 GiB  | 252.000 GiB    | 0.000 GiB
  0x0001   | 0x1010 | 0.000 GiB  | 252.000 GiB    | 0.000 GiB
  0x0001   | 0x1110 | 0.000 GiB  | 252.000 GiB    | 0.000 GiB
  Do you want to continue? [y/n] y
  Created following region configuration goal
  SocketID | DimmID | MemorySize | AppDirect1Size | AppDirect2Size 
 ==================================================================
  0x0000   | 0x0010 | 0.000 GiB  | 252.000 GiB    | 0.000 GiB
  0x0000   | 0x0110 | 0.000 GiB  | 252.000 GiB    | 0.000 GiB
  0x0001   | 0x1010 | 0.000 GiB  | 252.000 GiB    | 0.000 GiB
  0x0001   | 0x1110 | 0.000 GiB  | 252.000 GiB    | 0.000 GiB
A reboot is required to process new memory allocation goals.

Userspace flushing to persistence


The PCOMMIT has been get rid of, look at the comment in linux kernel
commit f0c98ebc57c2d5e535bc4f9167f35650d2ba3c90
Merge: d94ba9e7d 0606263
Author: Linus Torvalds 
Date:   Thu Jul 28 17:22:07 2016 -0700

Replace pcommit with ADR / directed-flushing.
         
The pcommit instruction, which has not shipped on any product, is deprecated.
Instead, the requirement is that platforms implement either ADR, or provide one
or more flush addresses per nvdimm.
ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers to the memory
controller on a power-fail event.
                                                    
Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT)
sub-structure: "Flush Hint Address Structure". A flush hint is an mmio address that
when written and fenced assures that all previous posted writes targeting a given dimm
have been flushed to media.
And an official document on intel website
https://software.intel.com/blogs/2016/09/12/deprecate-pcommit-instruction

Originally, the set of new instructions included one called PCOMMIT, intended
for use on platforms where flushing from the CPU cache was not sufficient to
reach the persistence domain.  On those platforms, an additional step using
PCOMMIT was required to ensure that stores had passed from memory controller
write pending queues to the DIMM, which is the persistence domain on those
platforms.

      +-------------+ \
      |    Core     |  |    
      | +----+----+ |  |    
      | | L1 | L1 | |  |    clflush
      | +----+----+ |  |    clflushopt + sfence
      +-------------+   >   clwb + sfence
             v         |    nt stores + sfence
      +-------------+  |
      |     L3      |  |
      +-------------+ /
             v
      +-------------+                        \
      |     WPQ     |  Write Pending Queue    |                                   
      +-------------+  in Memory Controller   |                                   
             v                                 > ADR (Asynchronous Dram Refresh)  
      +-----------------------------------+   |                                   
      |           Intel DIMM              |   |                                   
      +-----------------------------------+  /                                    
                                               
When the persistent memory programming model was first designed, there was a
concern that ADR was a rarely-available platform feature so the PCOMMIT
instruction was added to ensure there was a way to achieve persistence on
machines without ADR (platforms where the persistence domain is the smaller
dashed box in the picture above).  However, it turns out that platforms planning
to support the Intel DIMM are also planning to support ADR, so the need for
PCOMMIT is now gone.

nvdimm in kernel


There are two methods to export nvdimm in kernel, namely, PMEM and BLK

Why BLK
While PMEM provides direct byte-addressable CPU-load/store access to
NVDIMM storage, it does not provide the best system RAS (recovery,
availability, and serviceability) model.  An access to a corrupted
system-physical-address address causes a CPU exception while an access
to a corrupted address through an BLK-aperture causes that block window
to raise an error status in a register.  The latter is more aligned with
the standard error model that host-bus-adapter attached disks present.

PMEM

PMEM mode is mainly used by the DAX(Direct Access Extension)
It is setup in following code

nd_pmem_probe()
  -> pmem_attach_disk()
---
    q = blk_alloc_queue(pmem_make_request, dev_to_node(dev));
    ...

    blk_queue_write_cache(q, true, fua); //writeback cache is forced on
    blk_queue_physical_block_size(q, PAGE_SIZE);
    blk_queue_logical_block_size(q, pmem_sector_size(ndns));
    blk_queue_max_hw_sectors(q, UINT_MAX);
    blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
    if (pmem->pfn_flags & PFN_MAP)
        blk_queue_flag_set(QUEUE_FLAG_DAX, q);

    disk = alloc_disk_node(0, nid);
    pmem->disk = disk;

    disk->fops        = &pmem_fops;
    disk->queue        = q;
    disk->flags        = GENHD_FL_EXT_DEVT;
    disk->private_data    = pmem;
    disk->queue->backing_dev_info->capabilities |= BDI_CAP_SYNCHRONOUS_IO;
    nvdimm_namespace_disk_name(ndns, disk->disk_name); ///dev/pmemX
    set_capacity(disk, (pmem->size - pmem->pfn_pad - pmem->data_offset)
            / 512);

    if (is_nvdimm_sync(nd_region))
        flags = DAXDEV_F_SYNC;

    // DAX is supported in PMEM mode

    dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops, flags);

    dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
    pmem->dax_dev = dax_dev;
    gendev = disk_to_dev(disk);
    gendev->groups = pmem_attribute_groups;

    device_add_disk(dev, disk, NULL);
---
pmem_make_request handle the bios from /dev/pmemX
There are following points we need to pay attetion

DAX

DAX, Direct Access Extension, allow the filesystem, such as ext4, xfs, to
work on NVDIMM. At this moment, the NVDIMM works in two method,

The biggest benefit of DAX-aware filesystem is to allow application to access NVDIMM
directly via load/store instructions by mapping the physical NVDIMM file data pages
into the application's address space.

ext4_dax_fault()
  -> ext4_dax_huge_fault()
    -> dax_iomap_fault()
      -> dax_iomap_pte_fault()
---
    entry = grab_mapping_entry(&xas, mapping, 0);
      -> get_unlocked_entry()
        ---
            for (;;) {
                entry = xas_find_conflict(xas);
                ...

                if (!dax_is_locked(entry))

                    return entry;

                //a hash wait queue is used here

                wq = dax_entry_waitqueue(xas, entry, &ewait.key);
                prepare_to_wait_exclusive(wq, &ewait.wait,
                              TASK_UNINTERRUPTIBLE);
                xas_unlock_irq(xas);
                xas_reset(xas);
                schedule();
                finish_wait(wq, &ewait.wait);
                xas_lock_irq(xas);
            }
        ---
    ...

    //get the block offset associated with the file offset

    error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap, &srcmap);
    ...
    switch (iomap.type) {
    case IOMAP_MAPPED:
        ...

        // get the page frame number associated with the block offset of /dev/pmem

        error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn);
        ...

        // insert the pfn into page cache xarray

        entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn,
                         0, write && !sync);
        ---
            void *new_entry = dax_make_entry(pfn, flags);


            // handle this inode to writeback subsystem

            if (dirty)
                __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

            ...
            xas_reset(xas);
            xas_lock_irq(xas);
            if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
                void *old;

                // there are two sizes here, pte 4K, pmd 2M
                // update the involved pages' mapping and index field

                dax_disassociate_entry(entry, mapping, false);
                dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);


                // update the new_entry in xarray

                old = dax_lock_entry(xas, new_entry);
                entry = new_entry;
            } else {
                xas_load(xas);    /* Walk the xa_state */
            }


            // mark dirty tag on the entry in xarray, writeback subsystem could
            // use it.

            if (dirty)
                xas_set_mark(xas, PAGECACHE_TAG_DIRTY);

            xas_unlock_irq(xas);
        ---
        ...
        if (write)
            ret = vmf_insert_mixed_mkwrite(vma, vaddr, pfn);
            ---
              -> __vm_insert_mixed() //mkwrite = true
                -> insert_pfn()
                ---
                    pte = get_locked_pte(mm, addr, &ptl);
                    ...
                    if (!pte_none(*pte)) {
                        if (mkwrite) {
                            ...
                            entry = pte_mkyoung(*pte);

                            entry = maybe_mkwrite(pte_mkdirty(entry), vma);

                            if (ptep_set_access_flags(vma, addr, pte, entry, 1))
                                update_mmu_cache(vma, addr, pte);
                        }
                        goto out_unlock;
                    }

                    /* Ok, finally just insert the thing.. */
                    if (pfn_t_devmap(pfn))
                        entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
                    else
                        entry = pte_mkspecial(pfn_t_pte(pfn, prot));

                    if (mkwrite) {
                        entry = pte_mkyoung(entry);

                        entry = maybe_mkwrite(pte_mkdirty(entry), vma);

                    }

                    set_pte_at(mm, addr, pte, entry);
                    update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */

                out_unlock:
                    pte_unmap_unlock(pte, ptl);
                ---
            ---
---
In the ext4_dax_huge_fault(), we mainly do following things, ext4 has DAX-aware address_space_operations and provide ext4_dax_writepages
ext4_dax_writepages()
  -> tag_pages_for_writeback(mapping, xas.xa_index, end_index);
  -> dax_writeback_mapping_range()
  ---
    xas_lock_irq(&xas);
    xas_for_each_marked(&xas, entry, end_index, PAGECACHE_TAG_TOWRITE) {
        ret = dax_writeback_one(&xas, dax_dev, mapping, entry);
        ...
        if (++scanned % XA_CHECK_SCHED)
            continue;

        xas_pause(&xas);
        xas_unlock_irq(&xas);
        cond_resched();
        xas_lock_irq(&xas);
    }
    xas_unlock_irq(&xas);
  ---
dax_writeback_one()
---
    /* Lock the entry to serialize with page faults */
    dax_lock_entry(xas, entry);
    xas_clear_mark(xas, PAGECACHE_TAG_TOWRITE);
    xas_unlock_irq(xas);

    pfn = dax_to_pfn(entry);
    count = 1UL << dax_entry_order(entry);
    index = xas->xa_index & ~(count - 1);

    dax_entry_mkclean(mapping, index, pfn);
    ---
    i_mmap_lock_read(mapping);
    vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
        address = pgoff_address(index, vma);
        if (follow_pte_pmd(vma->vm_mm, address, &range,
                   &ptep, &pmdp, &ptl))
            continue;

        if (pmdp) {
            ...
        } else {
            if (pfn != pte_pfn(*ptep))
                goto unlock_pte;
            if (!pte_dirty(*ptep) && !pte_write(*ptep))
                goto unlock_pte;

            flush_cache_page(vma, address, pfn);
            pte = ptep_clear_flush(vma, address, ptep);

            // This means page fault will happen again after this.

            pte = pte_wrprotect(pte);
            pte = pte_mkclean(pte);
            set_pte_at(vma->vm_mm, address, ptep, pte);
unlock_pte:
            pte_unmap_unlock(ptep, ptl);
        }
        ...
    }
    i_mmap_unlock_read(mapping);

    ---
    dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE);
    ---

        arch_wb_cache_pmem(addr, size);

    ---
    xas_reset(xas);
    xas_lock_irq(xas);
    xas_store(xas, entry);
    xas_clear_mark(xas, PAGECACHE_TAG_DIRTY);
    dax_wake_entry(xas, entry, false);


---