XFS_V2

Concepts

infrastructure buffer head jbd2 Inode Dcache Quota ext4 Tmpfs
xfs Log xfs Inode xfs buf
xfs bmap

Concepts


link count

The link count of a file tells the total number of links a file has
which is nothing but the number of hard-links a file has. This count,
however, does not include the soft-link count.

Note: The soft-link is not part of the link count since the soft-link's
inode number is different from the original file. 

When a directory is created, except for the link count for the dentry in
parent directory, two extra link count is increased,
(1) one on the parent directory for ".." dentry in child directory
(2) one on the child directory for "." dentry ion child directory

unlink

Quote from https://pubs.opengroup.org/onlinepubs/9699919799/functions/unlink.html#tag_16_635

When the file's link count becomes 0 and no process has the file open, the space occupied by
the file shall be freed and the file shall no longer be accessible. If one or more processes
have the file open when the last link is removed, the link shall be removed before unlink()
                                                                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
returns, but the removal of the file contents shall be postponed until all references to the
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
file are closed.
^^^^^^^^^^^^^^
How does xfs implement this ?

asynchronous journal

Actually, both xfs and ext4-jbd2 employ asynchronous journal which
could batch IOs to journal to promote performance. Asynchronous journal
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ may lead to lose data (inode could also be lost) but still keep metadata consistent.
Let's look at how does xfs and ext4-jbd2 implement the asynchronous journal

In constrast with asynchronous journal, synchronous journal could guarantee
the metadata has been on disk after the syscall returns, but this could hurts
the performance. Batch synchronous journal IO is to promote this which has been
supported by jbd2.
jbd2_journal_stop()
---

        /*
         * Implement synchronous transaction batching.If the handle
         * was synchronous, don't force a commit immediately.    Let's
         * yield and let another thread piggyback onto this
         * transaction.    Keep doing that while new threads continue to
         * arrive.    It doesn't cost much - we're about to run a committed
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         * and sleep on IO anyway.    Speeds up many-threaded, many-dir
             ^^^^^^^^^^^^^^^^^^^^^^
                                The io here means the journal IO if we commit immediately.
         * operations by 30x or more...
         *
         * We try and optimize the sleep time against what the
         * underlying disk can do, instead of having a static sleep
         * time.    This is useful for the case where our storage is so
         * fast that it is more optimal to go ahead and force a flush
         * and wait for the transaction to be committed than it is to
         * wait for an arbitrary amount of time for new writers to
         * join the transaction.    We achieve this by measuring how
         * long it takes to commit a transaction, and compare it with
         * how long this transaction has been running, and if run time
         * < commit time then we sleep for the delta and commit.    This
         * greatly helps super fast disks that would see slowdowns as
         * more threads started doing fsyncs.
         *
         * But don't do this if this process was the most recent one '
         * to perform a synchronous write.    We do this to detect the
         * case where a single process is doing a stream of sync
         * writes.    No point in waiting for joiners in that case.
         *
         * Setting max_batch_time to 0 disables this completely.
         */

        pid = current->pid;
        if (handle->h_sync && journal->j_last_sync_writer != pid &&
                journal->j_max_batch_time) {
                u64 commit_time, trans_time;

                journal->j_last_sync_writer = pid;

                read_lock(&journal->j_state_lock);
                commit_time = journal->j_average_commit_time;
                read_unlock(&journal->j_state_lock);

                trans_time = ktime_to_ns(ktime_sub(ktime_get(),
                                                     transaction->t_start_time));

                commit_time = max_t(u64, commit_time,
                                        1000*journal->j_min_batch_time);
                commit_time = min_t(u64, commit_time,
                                        1000*journal->j_max_batch_time);

                if (trans_time < commit_time) {
                        ktime_t expires = ktime_add_ns(ktime_get(),
                                                             commit_time);
                        set_current_state(TASK_UNINTERRUPTIBLE);
                        schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
                }
        }

        if (handle->h_sync)
                transaction->t_synchronous_commit = 1;

        if (handle->h_sync ||
                time_after_eq(jiffies, transaction->t_expires)) {
                jbd2_log_start_commit(journal, tid);
                /*
                 * Special case: JBD2_SYNC synchronous updates require us
                 * to wait for the commit to complete.
                 */
                if (handle->h_sync && !(current->flags & PF_MEMALLOC))
                        wait_for_commit = 1;
        }

        stop_this_handle(handle);

        if (wait_for_commit)
                err = jbd2_log_wait_commit(journal, tid);
---

file permissions

When a process performs an operation to a file, the Linux kernel performs
the check in the following order:

unix style permissions

The classical unix style permissions is based on two points:

Through following two steps: The code of permission check is as following,
do_last()
    -> may_open()
        -> inode_permission()
            -> sb_permission()
                 ---

                        /* Nobody gets write access to a read-only fs. */

                        if (sb_rdonly(sb) && (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
                                return -EROFS;
                 ---
            -> do_inode_permission()
                -> generic_permission()
                    -> acl_permission_check()


acl_permission_check()
---
        if (likely(uid_eq(current_fsuid(), inode->i_uid)))
                mode >>= 6;
        else {
                if (IS_POSIXACL(inode) && (mode & S_IRWXG)) {
                        int error = check_acl(inode, mask);
                        if (error != -EAGAIN)
                                return error;
                }

                if (in_group_p(inode->i_gid))
                        mode >>= 3;
        }

        if ((mask & ~mode & (MAY_READ | MAY_WRITE | MAY_EXEC)) == 0)
                return 0;

        return -EACCES;
---

infrastructure


filesystem statistics

/proc/PID/io

When we read the /proc/PID/io, we would get following things,

rchar: 2814484553
wchar: 2326047
syscr: 689899
syscw: 7616
read_bytes: 507052032
write_bytes: 1105920
cancelled_write_bytes: 393216
What do these fields stand for ? Let's look at the code.
proc_tid_io_accounting/proc_tgid_io_accounting()
    -> do_io_accounting()
    ---
        struct task_io_accounting acct = task->ioac;
        ...
        if (whole && lock_task_sighand(task, &flags)) {
                struct task_struct *t = task;

                task_io_accounting_add(&acct, &task->signal->ioac);
                while_each_thread(task, t)
                        task_io_accounting_add(&acct, &t->ioac);

                unlock_task_sighand(task, &flags);
        }
        seq_printf(m,
                     "rchar: %llu\n"
                     "wchar: %llu\n"
                     "syscr: %llu\n"
                     "syscw: %llu\n"
                     "read_bytes: %llu\n"
                     "write_bytes: %llu\n"
                     "cancelled_write_bytes: %llu\n",
                     (unsigned long long)acct.rchar,
                     (unsigned long long)acct.wchar,
                     (unsigned long long)acct.syscr,
                     (unsigned long long)acct.syscw,
                     (unsigned long long)acct.read_bytes,
                     (unsigned long long)acct.write_bytes,
                     (unsigned long long)acct.cancelled_write_bytes);
    ---
Let's look into the fields only by one

iov_iter

What's iov_iter ?

struct iov_iter {
        unsigned int type;     Type, IOVEC/KVEC/BVEC
        size_t iov_offset;     current offset, like bio.bi_iter.bi_sector
        size_t count;                residual count like bio.bi_iter.bi_size
        union {
                const struct iovec *iov;     
                const struct kvec *kvec;
                const struct bio_vec *bvec;
                ...
        };

        See the bio_vec, basically, iov and kvec have similar functions.
        The bio.bi_io_vec always points the head of the bvec array.
        iov/kvec/bvec always points the current vector, like bio.bi_io_vec[bio->bi_iter.bi_idx]

        ...
};

Just refer to iterate_and_advance() to know how does the iov_iter work

Even though we try to compare the bio with iov_iter, but they are not different.
bio has a map between the block device and buffer in memory,
however, iov_iter only describes the buffer, which is more similar with sglist.
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Next, we will figure out what does the IOVEC/KVEC/BVEC mean and who dose use them. The other modules that receive iov_iter could use following interfaces to handle it

size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
                         struct iov_iter *i)
{
        if (unlikely(!page_copy_sane(page, offset, bytes)))
                return 0;

        if (i->type & (ITER_BVEC|ITER_KVEC)) {

                void *kaddr = kmap_atomic(page);
                size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
                kunmap_atomic(kaddr);
                return wanted;
        } else if (unlikely(iov_iter_is_discard(i)))
                return bytes;
        else if (likely(!iov_iter_is_pipe(i)))
                return copy_page_to_iter_iovec(page, offset, bytes, i);
        else
                return copy_page_to_iter_pipe(page, offset, bytes, i);
}


size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
{
        const char *from = addr;
        if (unlikely(iov_iter_is_pipe(i)))
                return copy_pipe_to_iter(addr, bytes, i);
        if (iter_is_iovec(i))
                might_fault();
        iterate_and_advance(i, bytes, v,

                //IOVEC, userland buffers, the 'I' in iterate_and_advance

                copyout(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len),

                //BVEC, pages, the 'B' in iterate_and_advance

                memcpy_to_page(v.bv_page, v.bv_offset,
                                     (from += v.bv_len) - v.bv_len, v.bv_len),

                //KVEC, kernel buffers, the 'K' in iterate_and_advance

                memcpy(v.iov_base, (from += v.iov_len) - v.iov_len, v.iov_len)
        )

        return bytes;
}

page fault


inode lock and page fault

We cannot take the inode lock (inode.i_rwsem) in page fault path.
For example,

ext4_dax_vm_ops.ext4_dax_huge_fault()
---
        if (write) {
                ...
        } else {
                down_read(&EXT4_I(inode)->i_mmap_sem);
        }
        result = dax_iomap_fault(vmf, pe_size, &pfn, &error, &ext4_iomap_ops);
        if (write) {
                ...
        } else {
                up_read(&EXT4_I(inode)->i_mmap_sem);
        }
---

xfs_file_vm_ops.xfs_filemap_fault()
---

        // XFS_MMAPLOCK_SHARED -> xfs_inode_t.i_mmaplock

        xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
        if (IS_DAX(inode)) {
                pfn_t pfn;

                ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
                                (write_fault && !vmf->cow_page) ?
                                 &xfs_direct_write_iomap_ops :
                                 &xfs_read_iomap_ops);
                if (ret & VM_FAULT_NEEDDSYNC)
                        ret = dax_finish_sync_fault(vmf, pe_size, pfn);
        } else {
                if (write_fault)
                        ret = iomap_page_mkwrite(vmf,
                                        &xfs_buffered_write_iomap_ops);
                else
                        ret = filemap_fault(vmf);
        }
        xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
---
Why ? Refer to link, https://lwn.net/Articles/548098
The problem has to do with the order in which locks are acquired.
For normal filesystem operations, the filesystem code will obtain any locks it requires;
the memory management subsystem will then grab mmap_sem should that be required — to bring a read or write buffer into RAM, for example. 
When a page fault happens, though, the lock ordering is reversed:
first mmap_sem is taken, then the filesystem code is invoked to bring the needed page into RAM.
Let's look at the code that could show us the lock ordering,

mmap_sem in page fault

In the previous section, we have known that the whole page fault path is
under the mm->mmap_sem. In page fault path, we could do a lot of thing,
including some read or write IO, And this could cause a lot of problems, such as

Holding the mmap_sem while doing IO is problematic because it can cause
system-wide priority inversions.  Consider some large company that does a
lot of web traffic.  This large company has load balancing logic in it's
core web server, cause some engineer thought this was a brilliant plan.
This load balancing logic gets statistics from /proc about the system,
which trip over processes mmap_sem for various reasons.  Now the web
server application is in a protected cgroup, but these other processes may
not be, and if they are being throttled while their mmap_sem is held we'll
stall, and cause this nice death spiral.
The upstream solves this problem as following,
static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
                            struct file *fpin)
{
    int flags = vmf->flags;

    if (fpin)
        return fpin;

    if (fault_flag_allow_retry_first(flags) && !(flags & FAULT_FLAG_RETRY_NOWAIT)) {

        fpin = get_file(vmf->vma->vm_file);
        mmap_read_unlock(vmf->vma->vm_mm);

    }
    return fpin;
}

(1) Get the reference of the file under mmap_sem
(2) Unlock the mmap_sem
To understand this solotuion, we need to know what does the mmap_sem protect. The common case is that the vma is backed by a regular file.
vm_mmap_pgoff()
  -> mmap_write_lock_killable()
  -> do_mmap()
    -> mmap_region()
    ---
        vma = vm_area_alloc(mm);
        ...
        if (file) {
            vma->vm_file = get_file(file);
            error = call_mmap(file, vma);
            ...
        } 
    ---

A reference is grabbed here to keep it alive during the mapping

__do_munmap()
  -> unmap_region()
    -> unmap_vmas()
       ---
        for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
            unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);

            // Page table is removed here

       ---
  -> remove_vma_list()
    -> remove_vma_list()
      -> remove_vma()
      ---
        if (vma->vm_file)

            fput(vma->vm_file)

      ---
After maybe_unlock_mmap_for_io() pins the file and unlock the mmap_sem,
it could guarantee the file will be gone during this, and it looks like
a normal IO through syscall read/write. When the IO is completed, we could
retry the page fault path, at the moment, the IO has been ready.
do_user_addr_fault()
---
    if (unlikely(!mmap_read_trylock(mm))) {
        ...
retry:
        mmap_read_lock(mm);
    } else {
        might_sleep();
    }

    vma = find_vma(mm, address);
    ...
    fault = handle_mm_fault(vma, address, flags, regs);
    ...

    /*
     * If we need to retry the mmap_lock has already been released,
     * and if there is a fatal signal pending there is no guarantee
     * that we made any progress. Handle this case first.
     */

    if (unlikely((fault & VM_FAULT_RETRY) &&
             (flags & FAULT_FLAG_ALLOW_RETRY))) {
        flags |= FAULT_FLAG_TRIED;
        goto retry;
    }
---

steps of do_shared_fault

do_shared_fault() handles the page fault on

The main steps of do_shared_fault is as following,

sendfile

sendfile is one of the feature to implement zero-copy


        read             write                            sendfile
                |uuuuuu|                ---\
                ^            \                ---/
-------/--------\--------------------------------------
            /                    v
    |ssssss|     |dddddd|                     |ssssss|    -> |dddddd|

do_sendfile()
    -> splice_direct_to_actor()
        -> do_splice_to() //read from src into pipe
            -> f_op->splice_read()
        -> do_splice_from() //write from pipe into dest
            -> f_op->splice_write()
Let's how does the pipe dance here ?

open_and_close

When you open or close a file, you could get or put many instances' reference.
Please refer to following comment from link https://lwn.net/Articles/494158/

The management of file structure reference counts is done with calls to fget() and fput().
A file structure, which represents an open file, can depend on a lot of resources:
as long as a file is open, the kernel must maintain its underlying storage device,
filesystem, network protocol information, security-related information, user-space
notification requests, and more. An fget() call will ensure that all of those resources
stay around as long as they are needed. A call to fput(), instead, might result in the
destruction of any of those resources. For example, closing the last file on an unmounted
filesystem will cause that filesystem to truly go away.
Next, let's try to figure out what they are.
We could get some hint in the __fput()
__fput is deferred to the task work context which will be invoked before the
task returns to userland. Refer to the following comment to get the point

----
What all this means is that a call to fput() can do a lot of work, and that
work may require the acquisition of a number of locks. The problem is that
fput() can also be called from any number of contexts; there are a few hundred
fput() and fput_light() calls in the kernel. Each of those call sites has its
own locking environment and, usually, no knowledge of what code in other
subsystems may be called from fput(). So the potential for problems like
locking-order violations is real.
----

void fput_many(struct file *file, unsigned int refs)
{
        if (atomic_long_sub_and_test(refs, &file->f_count)) {
                struct task_struct *task = current;

                if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
                        init_task_work(&file->f_u.fu_rcuhead, ____fput);
                        if (!task_work_add(task, &file->f_u.fu_rcuhead, TWA_RESUME))
                                return;
                }

                if (llist_add(&file->f_u.fu_llist, &delayed_fput_list))
                        schedule_delayed_work(&delayed_fput_work, 1);
        }
}

exit_to_user_mode_loop()
    -> tracehook_notify_resume()
        -> task_work_run()

Let's focus on the dput() and mntput() in __fput(). The mnt and dentry associated with the file is assigned in vfs_open()
path_openat()
    -> link_path_walk()
    -> open_last_lookups()
    -> do_open()
        -> vfs_open()
int vfs_open(const struct path *path, struct file *file)
{

        file->f_path = *path;

        return do_dentry_open(file, d_backing_inode(path->dentry), NULL);
}

mount

Let's first look at what will be done during a mount

writenotify

When a file is mapped, how does the kernel know the file is written ?
The answer is writenotify

synchronous page fault

What's synchronous page fault for ?

https://lwn.net/Articles/731706/
Normally, filesystems are in control of all I/O to the underlying storage
                                                                                ^^^^^^^
media; they use that control to ensure that the filesystem structure is
consistent at all times. Even when a file on a traditional storage device
is mapped into a process's virtual address space, the filesystem manages
the writeback of modified pages from the page cache to persistent storage.
Directly mapped persistent memory bypasses the filesystem, though, leading
to a number of potential problems including inconsistent metadata or data
corruption and loss if the filesystem relocates the file being modified.

For example, in ext4,
ext4_dax_huge_fault()
    -> dax_iomap_fault()
        -> dax_iomap_pte_fault()
            -> ext4_iomap_begin()
                -> ext4_iomap_alloc()
                    -> ext4_journal_start()
                    -> ext4_map_blocks()
                    -> ext4_journal_stop()

                    At this moment, the metadata has not been on persistent storage.

            -> vmf_insert_mixed_mkwrite() //insert page fault
     
After the page fault returns, the metadata may have not been on the page fault
due to the asynchronous journal employed by both xfs and ext4. The userland process
                     ^^^^^^^^^^^^^^^^^^^^
(usually a storage engine) can write data on the mapping. But if the system crashes
before the fs metadata get flushed the persistent storage medium, the written data
could be lost.

MAP_SYNC is such as flags that tells kernel do the fsync after before the page fault
returns.

ext4_dax_huge_fault()
    -> dax_iomap_fault()
        -> dax_iomap_pte_fault()
            -> ext4_iomap_begin()
---

                /*
                 * If we are doing synchronous page fault and inode needs fsync,
                 * we can insert PTE into page tables only after that happens.
                 * Skip insertion for now and return the pfn so that caller can
                 * insert it after fsync is done.
                 */

                if (sync) {
                        *pfnp = pfn;
                        ret = VM_FAULT_NEEDDSYNC | major;
                        goto finish_iomap;
                }
---

ext4_dax_huge_fault()
--
        result = dax_iomap_fault(vmf, pe_size, &pfn, &error, &ext4_iomap_ops);
        if (write) {
                ext4_journal_stop(handle);

                /* Handling synchronous page fault? */
                if (result & VM_FAULT_NEEDDSYNC)
                        result = dax_finish_sync_fault(vmf, pe_size, pfn);

                            //guarantee the metadat get flushed to the persistent storage medium

                            -> vfs_fsync_range()
                                -> file_write_and_wait_range()
                                -> ext4_fsync_journal()
                                -> blkdev_issue_flush()

                                // insert page table

                            -> dax_insert_pfn_mkwrite()
                up_read(&EXT4_I(inode)->i_mmap_sem);
                sb_end_pagefault(sb);
        } 
--

backing dev info (bdi)

bdi, namely backing_dev_info, represents the underlying device (maybe a virtual
device for nfs or fuse). It mainly works for writeback, but also carry the ra_pages
which is for readahead.

writeback and cgroup blkio

In the past, only direct IO and buffered read IO can be controlled by cgroup
blkio. The reason is as following diragram.


     a cgroup            VFS LAYER            root cgroup
        Task A                                         writeback kworker
            |                                                         |
            | dirtying pages                            | flush dirty inodes
            v                                                         v
[D] [D] [D] [D]            ------->     writepages
[D] [D] [D] [D]                                         |
------------------------------------^--------------------
                                    BLOCK LAYER             |
                                                                        v
                                                                 submit_bio
                                                                        | blkio qos/scheduler
                                                                        v
                                                             [R] [R] [R] [R]
---------------------------------------------------------
                                 SCSI/NVME/....
writeback kworker does the real IO but it belongs to the root cgroup.
More details, refer to Writeback and control groups
Let's look into the code, Now, we know that the bios queued by writeback kworker have owned the
information of cgroup that dirtied the pages.
And we still need to throttle the task that's dirtying the pages to match
the speed of dirtying and cleaning. And this has been done in balance_dirty_pages()
balance_dirty_pages()
---
        struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) };
        struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? &mdtc_stor : NULL;

        //mdtc is for memcg dirty throttle control

                if (mdtc) {
                        unsigned long filepages, headroom, writeback;

                        /*
                         * If @wb belongs to !root memcg, repeat the same
                         * basic calculations for the memcg domain.
                         */
                        mem_cgroup_wb_stats(wb, &filepages, &headroom,
                                                &mdtc->dirty, &writeback);
                        mdtc->dirty += writeback;
                        mdtc_calc_avail(mdtc, filepages, headroom);

                        domain_dirty_limits(mdtc);
                }
                ...
                if (mdtc) {
                        ...
                        dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) &&
                                ((mdtc->dirty > mdtc->thresh) || strictlimit);

                        wb_position_ratio(mdtc);
                        if (mdtc->pos_ratio < gdtc->pos_ratio)
                                sdtc = mdtc;
                }
---

domain_dirty_limits() is for calculating the limit of per-memcg
---
        /* gdtc is !NULL iff @dtc is for memcg domain */
        if (gdtc) {
                unsigned long global_avail = gdtc->avail;

                /*
                 * The byte settings can't be applied directly to memcg
                 * domains.    Convert them to ratios by scaling against
                 * globally available memory.    As the ratios are in
                 * per-PAGE_SIZE, they can be obtained by dividing bytes by
                 * number of pages.
                 */

                if (bytes)
                        ratio = min(DIV_ROUND_UP(bytes, global_avail),
                                        PAGE_SIZE);
                if (bg_bytes)
                        bg_ratio = min(DIV_ROUND_UP(bg_bytes, global_avail),
                                             PAGE_SIZE);
                bytes = bg_bytes = 0;
        }
---

Whether enable this feature is controlled by wb->memcg_css->parent,
namely, whether this bdi_writeback belongs to a non-root memcg.
Refer to MDTC_INIT() and mdtc_valid();

And the belonging of the bdi_writeback is decided when
balance_dirty_pages_ratelimited()
---
        if (inode_cgwb_enabled(inode))
                wb = wb_get_create_current(bdi, GFP_KERNEL);
        if (!wb)
                wb = &bdi->wb;
---

partial write

Partial Write

/------------------------------------------------------------------/
__block_write_begin_int()
---
        for(bh = head, block_start = 0; bh != head || !block_start;
                block++, block_start=block_end, bh = bh->b_this_page) {
                block_end = block_start + blocksize;
                ...
                if (!buffer_uptodate(bh) && !buffer_delay(bh) &&
                        !buffer_unwritten(bh) &&
                         (block_start < from || block_end > to)) {
                        ll_rw_block(REQ_OP_READ, 0, 1, &bh);
                        *wait_bh++=bh;
                }
        }
---
Pages will be read in befoare written.

__GFP_NOFS when alloc pagecache

When allocate page cache, what's gfp_mask is used ?

inode_init_always()
---
        mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
---

#define GFP_HIGHUSER_MOVABLE        (GFP_HIGHUSER | __GFP_MOVABLE)
                                                                     /     \
                                                        GFP_USER | __GFP_HIGHMEM
                                                         /         \
                        __GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL
ITOW, pagecache allocation use GFP_KERNEL. However, xfs is not.
xfs_setup_inode()
---

        /*
         * Ensure all page cache allocations are done from GFP_NOFS context to
         * prevent direct reclaim recursion back into the filesystem and blowing
         * stacks or deadlocking.
         */

        gfp_mask = mapping_gfp_mask(inode->i_mapping);
        mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));
---
And this can influence the action in shrink_page_list,
                if (PageWriteback(page)) {
                        /* Case 1 above */
                        if (current_is_kswapd() &&
                                PageReclaim(page) &&
                                test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
                                stat->nr_immediate++;
                                goto activate_locked;

                        /* Case 2 above */
                        } else if (writeback_throttling_sane(sc) ||
                                !PageReclaim(page) || !may_enter_fs) {
                                SetPageReclaim(page);
                                stat->nr_writeback++;
                                goto activate_locked;

                        /* Case 3 above */
                        } else {
                                unlock_page(page);
                                wait_on_page_writeback(page);
                                /* then go back and try same page again */
                                list_add_tail(&page->lru, page_list);
                                continue;
                        }
                }

This means xfs is easy to OOM with buffer IO, but ext4 is not.
[<0>] io_schedule+0x12/0x40
[<0>] wait_on_page_bit+0x137/0x230
[<0>] shrink_page_list+0xbab/0xc50
[<0>] shrink_inactive_list+0x254/0x580
[<0>] shrink_node_memcg+0x1fa/0x720
[<0>] shrink_node+0xce/0x440
[<0>] do_try_to_free_pages+0xc3/0x360
[<0>] try_to_free_mem_cgroup_pages+0xf9/0x210
[<0>] try_charge+0x192/0x780
[<0>] mem_cgroup_try_charge+0x8b/0x1a0
[<0>] __add_to_page_cache_locked+0x64/0x240
[<0>] add_to_page_cache_lru+0x64/0x100
[<0>] pagecache_get_page+0xf2/0x2c0
[<0>] grab_cache_page_write_begin+0x1f/0x40
[<0>] ext4_da_write_begin+0xce/0x470 [ext4]
[<0>] generic_perform_write+0xf4/0x1b0
[<0>] __generic_file_write_iter+0xfe/0x1c0
[<0>] ext4_file_write_iter+0xc6/0x3b0 [ext4]
[<0>] new_sync_write+0x124/0x170
[<0>] vfs_write+0xa5/0x1a0
[<0>] ksys_write+0x4f/0xb0
[<0>] do_syscall_64+0x5b/0x1b0
[<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca
[<0>] 0xffffffffffffffff

umount

When we mount a filesystem, there are mainly 3 things to be setup,

A umount mainly do following things,

buffer_head


bh vs its friends

bh is a legacy of old version kernel. Nowadays, it mainly works with following compoments,

jbd2


transaction


Order

The order here means the ext4's order mode

In data=ordered mode, ext4 only officially journals metadata, but it logically
groups metadata information related to data changes with the data blocks into a
single unit called a transaction.  When it's time to write the new metadata
out to disk, the associated data blocks are written first.
What's this mode for ?
After a system crash or a power failure, files that were written right before the
system went down could contain previously written data or other garbage.
With Ordered Mode, journal commits are deferred until the data blocks get written
to disk. This guarantees that any blocks in the file will be data written by the
application, avoiding a possibility of a security breach, which is especially
problematic on a multi-user system.
Ext3 Order VS Writeback mode
This is not an issue right now, because the unwritten statea
Both xfs and ext4 support this,
ext4_map_blocks()
---
     ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
        if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
            map->m_pblk = ext4_es_pblock(&es) +
                    map->m_lblk - es.es_lblk;
            map->m_flags |= ext4_es_is_written(&es) ?
                    EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
        ...
}
---

xfs_vm_readpage()
  -> iomap_readpage()
    -> iomap_apply()
      -> iomap_readpage_actor()
      ---
    if (iomap_block_needs_zeroing(inode, iomap, pos)) {
        zero_user(page, poff, plen);
        iomap_set_range_uptodate(page, poff, plen);
        goto done;
    }
      ---
But why does ext4 still use 'order mode' as default ?
In addition, there are some applications which depend on data=ordered
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
to automatically force data blocks to be written to disk soon after the
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
file is written. Using Writeback Mode extends the time from when a file
^^^^^^^^^^^^^^^
is written to when it is pushed out to disk to 30 seconds. This can be
surprising for some users; however, it should be noted that such problems
can still be an issue with Ordered Mode (although they are much rarer).
Again, a careful application or library should always use fsync() at points
where the application is at a stable commit point. 
Let's look at how to implement it
ext4_jbd2_inode_add_write() <- ext4_map_blocks()
                            <- __ext4_journalled_writepage()
                            <- __ext4_block_zero_page_range()
                            <- ext4_page_mkwrite()
                            <- move_extent_per_page()

We only need to order the page writting when extent changes


But ocfs2 is not different,
ocfs2_jbd2_inode_add_write()  <- ocfs2_write_failure()
                              <- ocfs2_map_and_dirty_page()
                              <- ocfs2_write_end_nolock()
                              <- ocfs2_zero_start_ordered_transaction()


Recovery


Revoke


BH

inode


reference of inode

There are two reference of the inode,

In this section, we would look into the second kind of reference above. The main user of inode is dcache.
           fd         user
------------------------------
           file       kernel
            v         file->f_path.dentry
          dcache
            v         dentry->d_inode
          inode

lazytime mode

The commit is

commit 0ae45f63d4ef8d8eeec49c7d8b44a1775fff13e8
Author: Theodore Ts'o 
Date:     Mon Feb 2 00:37:00 2015 -0500

        vfs: add support for a lazytime mount option
        
        Add a new mount option which enables a new "lazytime" mode.    This mode
        causes atime, mtime, and ctime updates to only be made to the
        in-memory version of the inode.    The on-disk times will only get
        updated when (a) if the inode needs to be updated for some non-time
        related change, (b) if userspace calls fsync(), syncfs() or sync(), or
        (c) just before an undeleted inode is evicted from memory.
        
        This is OK according to POSIX because there are no guarantees after a
        crash unless userspace explicitly requests via a fsync(2) call.
        
        For workloads which feature a large number of random write to a
        preallocated file, the lazytime mount option significantly reduces
        writes to the inode table.    The repeated 4k writes to a single block
        will result in undesirable stress on flash devices and SMR disk
        drives.    Even on conventional HDD's, the repeated writes to the inode
        table block will trigger Adjacent Track Interference (ATI) remediation
        latencies, which very negatively impact long tail latencies --- which
        is a very big deal for web serving tiers (for example).
        
        Google-Bug-Id: 18297052
        
        Signed-off-by: Theodore Ts'o 
        Signed-off-by: Al Viro 

Firstly, let's look at the 3 time fields in POSIX, And they will be modified in following code path, The time of inode is updated in generic_update_time,
generic_update_time()
    -> __mark_inode_dirty()
        -> sb->s_op->dirty_inode()
             ext4_dirty_inode()
         ---

                // If flags only contain I_DIRTY_TIME, just return and leave the
                // modified times fields in memory

                if (flags == I_DIRTY_TIME)
                        return;
                handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
                ...
                ext4_mark_inode_dirty(handle, inode);
                    -> __ext4_mark_inode_dirty()
                        -> ext4_mark_iloc_dirty()
                            -> ext4_do_update_inode() // synchronize the inode in-memory to the one on-disk

                ext4_journal_stop(handle);
         ---
If we look into the __mark_inode_dirty,
we could find out that I_DIRTY_TIME is set on inode->i_state, and the inode is inserted into the wb->b_dirty_time
There is no modifications on the on-disk inode buffer in-memory, what does the wb flush ?
Well, a normal case that flush out the time fields of inode should be in iput()
void iput(struct inode *inode)
{
        if (!inode)
                return;
        BUG_ON(inode->i_state & I_CLEAR);
retry:
        if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
                if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) {
                        atomic_inc(&inode->i_count);
                        spin_unlock(&inode->i_lock);
                        trace_writeback_lazytime_iput(inode);

                        mark_inode_dirty_sync(inode);

                        goto retry;
                }
                iput_final(inode);
        }
}

inode cache

Where is the inode cache ?

        inode_hashtable =
                alloc_large_system_hash("Inode-cache",
                                        sizeof(struct hlist_head),
                                        ihash_entries,
                                        14,
                                        HASH_ZERO,
                                        &i_hash_shift,
                                        &i_hash_mask,
                                        0
The inode would be inserted into this hash table to be looked up quickly. Some filesystems, such as xfs, maintain a inode cache itself.
xfs_lookup()
    -> xfs_dir_lookup()
    -> xfs_iget()
    ---

        // See it ? this is a fs private inode cache and it is more scalable

        pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ino));
        agino = XFS_INO_TO_AGINO(mp, ino);

again:
        error = 0;
        rcu_read_lock();
        ip = radix_tree_lookup(&pag->pag_ici_root, agino);

        if (ip) {
                error = xfs_iget_cache_hit(pag, ip, ino, flags, lock_flags);
                ...
        } else {
                rcu_read_unlock();
                ...

                error = xfs_iget_cache_miss(mp, pag, tp, ino, &ip,
                                                        flags, lock_flags);
                                                        
                ...
                ---

                // Allocate xfs_inode where a vfs inode is embedded in

                ip = xfs_inode_alloc(mp, ino);
                ...
                } else {

                        // get the xfs_buf for this inode

                        error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &dip, &bp, 0);

                        // fill the inode with data on disk

                        error = xfs_inode_from_disk(ip, dip);
                        xfs_trans_brelse(tp, bp);
                }
                ...

                iflags = XFS_INEW;

                if (flags & XFS_IGET_DONTCACHE)
                        d_mark_dontcache(VFS_I(ip));
                xfs_iflags_set(ip, iflags);

                /* insert the new inode */
                spin_lock(&pag->pag_ici_lock);
                error = radix_tree_insert(&pag->pag_ici_root, agino, ip);
                spin_unlock(&pag->pag_ici_lock);
                ---
        }
        xfs_perag_put(pag);
    ---
Another thing need to be talked is the way to reclaim indoe cache

Dcache
shrink of dcache

lookup

dcache and metadata

The main points that dcache interacts with filesystem

link_path_walk()
    -> walk_component()
        -> lookup_slow()
            -> inode_lock_shared()
            -> __lookup_slow()
                -> d_alloc_parallel() //allocate a dentry structure
                -> inode->i_op->lookup()

                    look up dentry in metadata,
                    do d_add() if found which would add dentry into dcache hashtable                         

open_last_lookups()
    -> lookup_open()
        -> d_lookup() //do look up under rename raed seqlock
        -> dir_inode->i_op->create() //do create if needed

add to dcache hash

When a dentry is in the dcache hash table, namely, dentry_hashtable
__d_lookup could find it and d_unhashed() returns false.
When is the dentry inserted to the dentry_hashtable ?

Quota

There are two basic types of disk quotas

The disk quotas can be set per 3 roles, There are two limit,

Where to store quota data


The quota data is in memory in runtime and need to be persistted on disk.
On disk disk quota entry format,

struct v2r1_disk_dqblk {
    __le32 dqb_id;        /* id this quota applies to */
    __le32 dqb_pad;
    __le64 dqb_ihardlimit;    /* absolute limit on allocated inodes */
    __le64 dqb_isoftlimit;    /* preferred inode limit */
    __le64 dqb_curinodes;    /* current # allocated inodes */
    __le64 dqb_bhardlimit;    /* absolute limit on disk space (in QUOTABLOCK_SIZE) */
    __le64 dqb_bsoftlimit;    /* preferred limit on disk space (in QUOTABLOCK_SIZE) */
    __le64 dqb_curspace;    /* current space occupied (in bytes) */
    __le64 dqb_btime;    /* time limit for excessive disk use */
    __le64 dqb_itime;    /* time limit for excessive inode use */
};

How to use quota data


Ext4


Layout

Block Group


For the filesystem block size == 4K, The 1st block group is as following,

|rrrrssss....|ggggggggg|rgrgrgrgrg|dd|ii|tttt|....|

rrrr: reserved 1024 bytes for x86 boot sectors
ssss: in the same fs block with rrrr, filesystem super block
gggg: block group descriptors in a contiguous space
rgrg: reserved group descriptors table for filesystem resize,
dd  : data block bitmap
ii  : inode block bitmap
tttt: inode table
      All of positions above can be got from block group descriptor
During .fill_super, all of the buffer_head's of block group descriptors are loaded into memory,
ext4_fill_super()
---

    /* Pre-read the descriptors into the buffer cache */

    for (i = 0; i < db_count; i++) {
        block = descriptor_loc(sb, logical_sb_block, i);
        ext4_sb_breadahead_unmovable(sb, block);
    }

    for (i = 0; i < db_count; i++) {
        struct buffer_head *bh;

        block = descriptor_loc(sb, logical_sb_block, i);
        bh = ext4_sb_bread_unmovable(sb, block);
        rcu_read_lock();
        rcu_dereference(sbi->s_group_desc)[i] = bh;
        rcu_read_unlock();
    }
    sbi->s_gdb_count = db_count;
---

descriptor_loc() is used to calculate the position of block group descriptors,
---

    //When meta block group is not enabled, it is very simple

    if (!ext4_has_feature_meta_bg(sb) || nr < first_meta_bg)
        return logical_sb_block + nr + 1;
---
ext4 use block group descriptor to get the position of data bitmap, inode bitmap
and inode table. For example,
__ext4_get_inode_loc()
---
    iloc->block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
    gdp = ext4_get_group_desc(sb, iloc->block_group, NULL);

    /*
     * Figure out the offset within the block group inode table
     */

    inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
    inode_offset = ((ino - 1) %    EXT4_INODES_PER_GROUP(sb));
    block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
    iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);

    bh = sb_getblk(sb, block);
---

Extent

Format


There are mainly 3 data structure here,
struct ext4_extent_header {
    __le16    eh_magic;    /* probably will support different formats */
    __le16    eh_entries;    /* number of valid entries */
    __le16    eh_max;            /* capacity of store in entries*/
    __le16    eh_depth;        /* has tree real underlying blocks? */
    __le32    eh_generation;    /* generation of the tree */
};

struct ext4_extent_idx {
    __le32    ei_block;    /* index covers logical blocks from 'block' */
    __le32    ei_leaf_lo;    /* pointer to the physical block of the nex level. leaf or next index could be there */
    __le16    ei_leaf_hi;    /* high 16 bits of physical block */
    __u16    ei_unused;
};

struct ext4_extent {
    __le32    ee_block;    /* first logical block extent covers */
    __le16    ee_len;        /* number of blocks covered by extent */
    __le16    ee_start_hi;    /* high 16 bits of physical block */
    __le32    ee_start_lo;    /* low 32 bits of physical block */
};

Note:
 - ext4_extent.ee_block is 32bits, so the max file size of ext4 is 2^32 * 2^12 (fs blk) = 2^44 (16TiB)
 - ee_start_hi/ee_start_lo is 48bits, so the max block device size of ext4 is 2^48 * 2^12 = 2^60 (1EiB)
 - highest bit of ee_len is to indicates unwritten state, refer to ext4_ext_mark_unwritten()
 - if eh_depth > 0, ext4_extent_idx follows it, otherwise ext4_extent
 - max level of extent btree is 5, (4 * (((2^12 - 12)/12) ^ n) >= 2^32, n = 5)


              inode->i_block  (inode inlined)
                   |            
                [hiiii] eh.eh_max == 4
                 /   \
          [hiiii]     [hiiii] 4K - sizeof(eh) / sizeof(ei)
           /  \          \
    [heeee]   [heeee]    [heeee]

ei and ee in the nodes are sorted by logical block

Split


                [hiii-]
                 /  \
          [hiii-]    [hiiii]
           /   \         \         ^
    [hee-e]    [heee-]   [heeee]   | walk from down to up

ext4_ext_create_new_leaf()
---

    /* walk up to the tree and look for where need new index entry */

    curp = path + depth;
    while (i > 0 && !EXT_HAS_FREE_INDEX(curp)) {
        i--;
        curp--;
    }
---

                [hiiii]______
                 /  \        \
          [hiii-]    [hiii] [hii]
           /   \         \     \ 
    [hee-e]    [heee-]   [hee] [hee]

Split entries from the full ee and ei's and then build a new sub tree above.
Then insert it to tree.

Note: the new node [hee] is always 1st entry of new [hii]

Let's look at the code,
ext4_ext_split()
---

    //leaf node has been setup in similar way

    while (k--) {
        oldblock = newblock;
        newblock = ablocks[--a];
        bh = sb_getblk(inode->i_sb, newblock);
        lock_buffer(bh);
        err = ext4_journal_get_create_access(handle, bh);
        neh = ext_block_hdr(bh);

        //setup the new node's header

        neh->eh_entries = cpu_to_le16(1);
        neh->eh_magic = EXT4_EXT_MAGIC;
        neh->eh_max = cpu_to_le16(ext4_ext_space_block_idx(inode, 0));
        neh->eh_depth = cpu_to_le16(depth - i);
        neh->eh_generation = 0;
        fidx = EXT_FIRST_INDEX(neh);

        //insert the 1st one, namely, the newly split child
        //logical block is same across all new ei and ee

        fidx->ei_block = border;
        ext4_idx_store_pblock(fidx, oldblock);
        /* start copy indexes */
        m = EXT_MAX_INDEX(path[i].p_hdr) - path[i].p_idx++;
        if (m) {
            memmove(++fidx, path[i].p_idx,    sizeof(struct ext4_extent_idx) * m);
            le16_add_cpu(&neh->eh_entries, m);
        }

        //zero out unused area in the extent block
        //Is it necessary ? eh->eh_entries should be enough

        ext_size = sizeof(struct ext4_extent_header) + (sizeof(struct ext4_extent) * le16_to_cpu(neh->eh_entries));
        memset(bh->b_data + ext_size, 0, inode->i_sb->s_blocksize - ext_size);
        ext4_extent_block_csum_set(inode, neh);
        set_buffer_uptodate(bh);
        unlock_buffer(bh);

        err = ext4_handle_dirty_metadata(handle, inode, bh);

        //correct old index
        //See it ? only update eh->eh_entries here

        if (m) {
            err = ext4_ext_get_access(handle, inode, path + i);
            le16_add_cpu(&path[i].p_hdr->eh_entries, -m);
            err = ext4_ext_dirty(handle, inode, path + i);
        }
        i--;
    }

    /* insert new index */
    err = ext4_ext_insert_index(handle, inode, path + at,
                    le32_to_cpu(border), newblock);

---

Grow level


All of the ei and ee are full, including the top one that's in inode->i_block
                [hiiii]                     copy inode->i_block into new block          
                 /  \                                     [hiiii]         
          [hiiii]    [hiiii]         --\                   /  \           
           /   \         \           --/              [hiiii]    [hiiii]           
    [heeee]    [heeee]   [heeee]                     /   \         \             
                                              [heeee]    [heeee]   [heeee]

          inode->i_block [hi]                           ||
                          |                             \/
                        [hiiii]      link the new block to inode->i_block
                         /  \           
                    [hiiii  [hiiii]           
                   /   \         \             
            [heeee]    [heeee]   [heeee]
Look at the code,
ext4_ext_grow_indepth()
---
    bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
    lock_buffer(bh);
    err = ext4_journal_get_create_access(handle, bh);
    ext_size = sizeof(EXT4_I(inode)->i_data);

    /* move top-level index/leaf into new block */

    memmove(bh->b_data, EXT4_I(inode)->i_data, ext_size);
    /* zero out unused area in the extent block */
    memset(bh->b_data + ext_size, 0, inode->i_sb->s_blocksize - ext_size);

    /* set size of new block */
    neh = ext_block_hdr(bh);

    //if ext_depth(inode) is zero, inode->i_block contain ee.
    //then new node carries the initial 4 ee's.

    if (ext_depth(inode))
        neh->eh_max = cpu_to_le16(ext4_ext_space_block_idx(inode, 0));
    else
        neh->eh_max = cpu_to_le16(ext4_ext_space_block(inode, 0));
    neh->eh_magic = EXT4_EXT_MAGIC;
    err = ext4_handle_dirty_metadata(handle, inode, bh);

    /* Update top-level index: num,max,pointer */

    neh = ext_inode_hdr(inode);
    neh->eh_entries = cpu_to_le16(1);
    ext4_idx_store_pblock(EXT_FIRST_INDEX(neh), newblock);
    if (neh->eh_depth == 0) {

        /* Root extent block becomes index block */

        neh->eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode, 0));
        EXT_FIRST_INDEX(neh)->ei_block = EXT_FIRST_EXTENT(neh)->ee_block;
    }
    le16_add_cpu(&neh->eh_depth, 1);
    err = ext4_mark_inode_dirty(handle, inode);
---

Remove


When we need to remove blocks ?
ext4_ext_truncate()
ext4_collapse_range()
ext4_punch_hole()

                          [hii]                             3
                        [hii] [hii]                         2
         [hii]     [hii]          [hii]      [hii]          1
    [hee] [hee]  [hee]  [hee]  [hee]  [hee]  [hee] [hee]    0
                                             <--- rm direction
                           ||
                           \/

                          [hii]                             3
                        [hii] [hii]                         2
         [hii]     [hi]               [h]      [h]          1
    [hee] [hee]  [hee]                                      0

                           ||
                           \/

                          [hii]                             3
                        [hii] [h]                          2
         [hii]     [hi]                                     1
    [hee] [hee]  [hee]                                      0


                           ||
                           \/

                          [hi]                              3
                        [hii]                               2
         [hii]     [hi]                                     1
    [hee] [hee]  [hee]                                      0


The code is very complicated, so we don't show all of it here, but just some
important part.
(1) continue or next level
                          [hii]                             3
                        [hii] [hii]                         2
         [hii]     [hii]          [hii]      [hii]          1
    [hee] [hee]  [hee]  [hee]  [hee]  [hee]  [hee] [hee]    0

ext4_ext_rm_leaf() can only handle one node, namely a [hee], when to access the
next one (the right one) ?
ext4_ext_remove_space()
---
    while (i >= 0 && err == 0) {
        if (i == depth) {
            /* this is leaf block */
            err = ext4_ext_rm_leaf(handle, inode, path,
                           &partial, start, end);
            /* root level has p_bh == NULL, brelse() eats this */
            brelse(path[i].p_bh);
            path[i].p_bh = NULL;
            i--;
            continue;
        }

        if (!path[i].p_idx) {
        } else {
            /* we were already here, see at next index */
            path[i].p_idx--;
        }


        // determine based on path->p_idx and EXT_FIRST_INDEX(path->p_hdr)
                   first  now            first 
                     v    v                v  
                    heeeeeeeeeeee         heeeeeeeeeeee
                                          ^
                                          now
        // no more need to remove         previous one is need to be removed

        if (ext4_ext_more_to_rm(path + i)) {
            struct buffer_head *bh;

            //Read in the ee or ei block

            bh = read_extent_tree_block(inode, ext4_idx_pblock(path[i].p_idx), depth - i - 1, EXT4_EX_NOCACHE);
            i++;
        } else {
        }
    }
---
   
(2) revoke ?
When free metadata block, we need to invoke them

ext4_remove_blocks()
---

    flags = get_default_free_blocks_flags(inode);
    ---
        if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode) ||
            ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE))
            return EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET;
    ---
    ...
    flags |= EXT4_FREE_BLOCKS_NOFREE_FIRST_CLUSTER;
    ext4_free_blocks(handle, inode, NULL, pblk, num, flags);
---
ext4_ext_rm_idx()
---
    ext4_free_blocks(handle, inode, NULL, leaf, 1,
             EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET);
---

Delayed Allocation

Delayed allocation, where the filesystem defers the allocation of blocks on disk for data
written by applications until that data is actually written to disk. The idea is to wait
until the application finishes its operations on the file, then allocate the actual number
of data blocks needed on the disk at once. This optimization limits unneeded operations
related to short-lived, small files, batches large writes, and helps ensure that data space
is allocated contiguously.

Xattr


In Linux, the ext2, ext3, ext4, JFS, Squashfs, Yaffs2, ReiserFS, Reiser4, XFS, Btrfs,
OrangeFS, Lustre, OCFS2 1.6, ZFS, and F2FS[11] filesystems support extended attributes 
(abbreviated xattr) when enabled in the kernel configuration. Any regular file or directory 
may have extended attributes consisting of a name and associated data. The name must be a 
null-terminated string prefixed by a namespace identifier and a dot character. Currently, 
four namespaces exist: user, trusted, security and system. The user namespace has no 
restrictions with regard to naming or contents. The system namespace is primarily used by 
the kernel for access control lists. The security namespace is used by SELinux, for example.

The Linux kernel allows extended attribute to have names of up to 255 bytes and values of 
up to 64 KiB,[15] as do XFS and ReiserFS, but ext2/3/4 and btrfs impose much smaller limits, 
requiring all the attributes (names and values) of one file to fit in one "filesystem block" 
(usually 4 KiB). Per POSIX.1e,[citation needed] the names are required to start with one of 
security, system, trusted, and user plus a period. This defines the four namespaces of 
extended attributes.[16]

Extended attributes can be accessed and modified using the getfattr and setfattr commands 
from the attr package on most distributions.[17] The APIs are called getxattr and setxattr. 

Space for xattr

Where is the xattr of ext4 stored ? There are 2 places,

Dentry


DX

First let's look at some basic data structure

The directory file can be formated as following,
R: dx_root    N: dx_node    E: dx_entry     D: ext4_dir_entry_2

       {D/D/D/} {D/D/D/D} {D/D/D/D} {D/D/D/D}

or

                         {R E/E/E/E/E }        \
                      ...  /       \ ...       |
                  {N E/E/E/E}     {N E/E/E/E}   > index
                     /  \             ...      |
           {N E/E/E/E}  {N E/E/E/E}            /
                / \        ... 
       {D/D/D/D}  {D/D/D/D}                    }  dentry


(1) The E in index is sorted by hash
(2) The D in dentry is is a list linked by rec_len

Obviously, the index tree is introduced to promote the lookup performance.
Regarding to the dentry hash tree, there are following points,

Inline

inline -> block -> index

inode inline            block             index
   +---+                +----+            +----+
   |   |            -\  |dddd|    -\      |    |
   |ddd| } dentry   -/  |dddd|    -/      +----+
   +---+           [1]  +----+   [2]      /    \
                                       +----+ +----+
                                       |dddd| |dddd|
                                       +----+ +----+
[1] inline to block
ext4_try_add_inline_entry()
  -> ext4_convert_inline_data_nolock()
  ---
    map.m_lblk = 0;
    map.m_len = 1;
    map.m_flags = 0;
    error = ext4_map_blocks(handle, inode, &map, EXT4_GET_BLOCKS_CREATE);
    ...
    data_bh = sb_getblk(inode->i_sb, map.m_pblk);
    ...
    lock_buffer(data_bh);
    error = ext4_journal_get_create_access(handle, data_bh);
    memset(data_bh->b_data, 0, inode->i_sb->s_blocksize);

    if (!S_ISDIR(inode->i_mode)) {

        // The interesting thing is both the inode and data block need
        // to be recorded in log. Because they need to be completed in
        // one transaction

        memcpy(data_bh->b_data, buf, inline_size);
        set_buffer_uptodate(data_bh);
        error = ext4_handle_dirty_metadata(handle,
                           inode, data_bh);
    } else {

        // There are 3 steps here
        // (1) create '.' and '..' dentry
        // (2) copy the inlined dentries
        // (3) set the tail dentry of which rec_len covers the whole block

        error = ext4_finish_convert_inline_dir(handle, inode, data_bh,
                               buf, inline_size);
    }
    unlock_buffer(data_bh);
  ---
[2] block to index
ext4_add_entry()
  -> make_indexed_dir()
  ---
    /* The 0th block becomes the root, move the dirents out */
    fde = &root->dotdot;
    de = (struct ext4_dir_entry_2 *)((char *)fde +
        ext4_rec_len_from_disk(fde->rec_len, blocksize));
    len = ((char *) root) + (blocksize - csum_size) - (char *) de;
    /* Allocate new block for the 0th block's dirents */
    bh2 = ext4_append(handle, dir, &block);
    ext4_set_inode_flag(dir, EXT4_INODE_INDEX);
    data2 = bh2->b_data;

    memcpy(data2, de, len);
    memset(de, 0, len); /* wipe old data */
//Get the last dentry and make its rec_len cover the whole block
    de = (struct ext4_dir_entry_2 *) data2;
    top = data2 + len;
    while ((char *)(de2 = ext4_next_entry(de, blocksize)) < top)
        de = de2;
    de->rec_len = ext4_rec_len_to_disk(data2 + (blocksize - csum_size) -
                       (char *) de, blocksize);
    /* Initialize the root; the dot dirents already exist */
    de = (struct ext4_dir_entry_2 *) (&root->dotdot);
    de->rec_len = ext4_rec_len_to_disk(
            blocksize - ext4_dir_rec_len(2, NULL), blocksize);
    memset (&root->info, 0, sizeof(root->info));
    root->info.info_length = sizeof(root->info);
    if (ext4_hash_in_dirent(dir))
        root->info.hash_version = DX_HASH_SIPHASH;
    else
        root->info.hash_version =
                EXT4_SB(dir->i_sb)->s_def_hash_version;

    entries = root->entries;
// The first leaf block's offset is 1
    dx_set_block(entries, 1);
    dx_set_count(entries, 1);
    dx_set_limit(entries, dx_root_limit(dir, sizeof(root->info)));
    ...
    retval = ext4_handle_dirty_dx_node(handle, dir, frame->bh);
    retval = ext4_handle_dirty_dirblock(handle, dir, bh2);
// Split the leaf block as it is filled up
    de = do_split(handle,dir, &bh2, frame, &fname->hinfo);

    retval = add_dirent_to_buf(handle, fname, dir, inode, de, bh2);
  ---

Blocks


Buddy

Ext4 mballoc maintains a buddy system for every block group in memory.

Every block group needs two blocks (4K), which format is as following,

                      O0                   O1        O2   O3 
         |--------------------------|-------------|------|--|..|
         \____________ ____________/ \____________ ____________/
                      v                           v
                  block 0                      block 1
                  bd_bitmap                    bd_buddy

O0: bitmap for order 0
O1: bitmap for order 1
    ...
                      Extent (bit - off) << order, 1 << order)                >

The region for every order is defined in ext4_sb_info.s_mb_offsets/s_mb_max
which is calculated in ext4_mb_init()

mb_find_buddy() helps to find the bitmap address and range (max)
---
    /* at order 0 we see each particular block */
    if (order == 0) {
        *max = 1 << (e4b->bd_blkbits + 3);
        return e4b->bd_bitmap;
    }

    bb = e4b->bd_buddy + EXT4_SB(e4b->bd_sb)->s_mb_offsets[order];
    *max = EXT4_SB(e4b->bd_sb)->s_mb_maxs[order];
---

How does this buddy system work ?

Pre Allocation

What is ext4 preallocation for ?

The main goal is to provide better allocation for small and large files.
This is achieved by using a different strategy for different allocation
requests. For a relatively small allocation request, Ext4 tries to allocate
from a per-CPU locality group, which is shared by all allocations under
the same CPU, in order to try to keep these small files close to each other.
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
A large allocation request is allocated from per-file preallocation first.

Tmpfs


Xfs Log


Let's take some relatively simple examples,

Framework

iclog ring and state machine

            .-->                                                                             
         /     ---
             /         \
            | iclog |
             \         /
                 ---    /
                    <--''                                            -->
|------------------------------------------------|
                     physical log space

The iclog has two parameters,
l_iclog_bufs            8
l_iclog_size            32K (max 256K)

iclog is allocated in xlog_alloc_log()
---
        for (i = 0; i < log->l_iclog_bufs; i++) {
                int align_mask = xfs_buftarg_dma_alignment(mp->m_logdev_targp);
                size_t bvec_size = howmany(log->l_iclog_size, PAGE_SIZE) *
                                sizeof(struct bio_vec);

                iclog = kmem_zalloc(sizeof(*iclog) + bvec_size, KM_MAYFAIL);
                ...        
                iclog->ic_data = kmem_alloc_io(log->l_iclog_size, align_mask,
                                                KM_MAYFAIL | KM_ZERO);
                ...
                iclog->ic_size = log->l_iclog_size - log->l_iclog_hsize;
                iclog->ic_state = XLOG_STATE_ACTIVE;
                ...

                // link all of the iclog in a ring

                iclogp = &iclog->ic_next;
        }
---

The iclog in the ring will be employed one by one.
And every iclog has a state machine works as following,

XLOG_STATE_ACTIVE Be able to receive log
    - xlog_alloc_log()
    - xlog_state_do_callback()
            -> xlog_state_clean_iclog()
                -> xlog_state_activate_iclogs()
                    -> xlog_state_activate_iclog()

XLOG_STATE_WANT_SYNC
    - xlog_state_switch_iclogs()
    
XLOG_STATE_SYNCING
    - xlog_state_release_iclog()
            -> __xlog_state_release_iclog()
            -> xlog_sync()

XLOG_STATE_DONE_SYNC
    - xlog_ioend_work()
        -> xlog_state_done_syncing()

XLOG_STATE_CALLBACK
    - xlog_state_done_syncing()
        -> xlog_state_do_callback()

Relog

Quote from https://www.infradead.org/~mchehab/kernel_docs/filesystems/xfs-delayed-logging-design.html

XFS allows multiple separate modifications to a single object to be carried in the log at any given time.
This allows the log to avoid needing to flush each change to disk before recording a new change to the object.
XFS does this via a method called “re-logging”. Conceptually, this is quite simple - all it requires is
that any new change to the object is recorded with a new copy of all the existing changes in the new
transaction that is written to the log.
Regarding to the comment with underline, we could refer to the implementation of jbd2,
jbd2 could be deemed as a WAL in blocks, namely, before flush the dirty blocks to real
position on disk, jbd2 would record them on journal first. In common case, jbd2 would
shadow the original buffer_head to do the journal IO.

jbd2_journal_write_metadata_buffer()
---
        spin_lock(&jh_in->b_state_lock);
repeat:
        if (jh_in->b_frozen_data) {
                ...
        } else {
                new_page = jh2bh(jh_in)->b_page;
                new_offset = offset_in_page(jh2bh(jh_in)->b_data);
        }
        ...
        set_bh_page(new_bh, new_page, new_offset);
        new_bh->b_size = bh_in->b_size;
        new_bh->b_bdev = journal->j_dev;
        new_bh->b_blocknr = blocknr;
        new_bh->b_private = bh_in;
        set_buffer_mapped(new_bh);
        set_buffer_dirty(new_bh);

        *bh_out = new_bh;

        spin_lock(&journal->j_list_lock);
        __jbd2_journal_file_buffer(jh_in, transaction, BJ_Shadow);
        spin_unlock(&journal->j_list_lock);

        set_buffer_shadow(bh_in);

        spin_unlock(&jh_in->b_state_lock); //Protect this journal buffer head
---

do_get_write_access()
---
        spin_lock(&jh->b_state_lock);
        ...
        if (buffer_shadow(bh)) {
                spin_unlock(&jh->b_state_lock);
                wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE);
                goto repeat;
        }
        ...
---

A very important thing need to be noted is that the modification has been made in
buffer_head which is the cache of the disk.

Deferred operations

Deferred Operations is a very bad name and could mislead the readers.
IMO, it should be called Big Transaction. The deferred operations,
cooperating with intent log, could split a complicated transaction into
multiple small transactions and still keep the atomicity of the original
transaction. Look at the following example,

To complete T, we need following multiple operations,
T_A, T_B, T_C, T_D. And each of them need 3 sub-operations.

Deferred Operations would complete this work as following,

Intent log for T_A \
Intent log for T_B    \ t0
Intent log for T_C    /
Intent log for T_D /
-------------------
Done log for T_A    \
Real log for T_A0    \ t1
Real log for T_A1    /
Real log for T_A2 /
-------------------
Done log for T_B    \
Real log for T_B0    \ t2
Real log for T_B1    /
Real log for T_B2 /
-------------------
Done log for T_C    \
Real log for T_C0    \ t4
Real log for T_C1    /
Real log for T_C2 /
-------------------
Done log for T_D    \
Real log for T_D0    \ t5
Real log for T_D1    /
Real log for T_D2 /
-------------------

The Intent log could guarantee the whole big transaction's atomicity.
xfs_defer_finish_noroll() is to carry out the work,
xfs_defer_finish_noroll()
---
        /* Until we run out of pending work to finish... */
        while (!list_empty(&dop_pending) || !list_empty(&(*tp)->t_dfops)) {

                //Create intent log for every deferred_operations

                xfs_defer_create_intents(*tp);
                list_splice_init(&(*tp)->t_dfops, &dop_pending);

                //Roll the transaction

                error = xfs_defer_trans_roll(tp);
                ...
                dfp = list_first_entry(&dop_pending, struct xfs_defer_pending,
                                             dfp_list);

                //Pick a defered operation and finish it
                // - create done
                // - finish_item do the real work

                error = xfs_defer_finish_one(*tp, dfp);
        }
---

log space


            tail                 head
     cycle=100         cycle=100 
                |                     |
                v                     v
|-------xxxxxxxxxxxx------------|



            head                            tail
     cycle=101                 cycle=100 
                |                                |
                v                                v
|xxxxxxx-----------------xxxxxxx|

Xfs Inode


Inode management

The inode number if xfs is composed with 3 parts,

             AG number                Bn in AG             In in B
    |----------------|---------------|-------------|

Bn : block number in an AG
In : inode number in a block (a block could carry multiple inodes)
This inode number has telled us the position of the inode on disk.
xfs's inodes are dynamically allocated instead of preallocating in static
position like ext4.
How to allocate in dynamical way ?
Allocate a block in that AG !
^^^^^^^^^^^^^^^^^^^^^^^^^^^ Then create a xfs_inobt_rec as following,
struct xfs_inobt_rec {
        __be32     ir_startino;    //start ino num of this chunk
        __be32     ir_freecount; //number of free inodes
        __be64     ir_free; //bitmap

}
and insert it into the AG inode b+tree

xfs buf


xfs uses xfs_buf to manage its metadata instead of using vfs pagecache.

xfs bmap


xfs bmap unwritten

allocated block extent has two state in xfs

The meaning of the state is as the name shows
When the block extent is newly allocated, it is unwritten.
xfs_bmapi_allocate()
---

    if (bma->flags & XFS_BMAPI_PREALLOC)
        bma->got.br_state = XFS_EXT_UNWRITTEN;

    if (bma->wasdel)
        error = xfs_bmap_add_extent_delay_real(bma, whichfork);
---
The two main allocation paths, The unwritten state can influence the xfs in following ways, After write IO completes, xfs will convert unwritten to normal state
xfs_end_bio()
  -> queue_work i_ioend_work

xfs_end_io()
  -> xfs_end_ioend()
    -> xfs_iomap_write_unwritten()
    ---
        error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb,
                    XFS_BMAPI_CONVERT, resblks, &imap,
                    &nimaps);
    ---
       -> xfs_bmapi_convert_unwritten()