ZFS

Build

objects

COW

ZIL

lwb write
log record
checkpoint
checksum of the zil block

space management

COW of spacemap
original blocks

ARC

L2ARC
arc state machine

Scrub and Resilvering

DTL
DSL_Scan
Mirror

Talking

self healing
update_uberblock
dmu transaction quiescing
multiple DVAs of blkptr
raidz dynamic stripe layout
Compression
prefetch
deduplication table
write throttle
dynamically stripes

Build


https://github.com/zfsonlinux/zfs/wiki/Building-ZFS
https://github.com/zfsonlinux/zfs/wiki/Custom-Packages

sudo dnf install autoconf automake libtool rpm-build
$ sudo dnf install zlib-devel libuuid-devel libattr-devel libblkid-devel libselinux-devel libudev-devel
$ sudo dnf install libacl-devel libaio-devel device-mapper-devel openssl-devel libtirpc-devel elfutils-libelf-devel
$ sudo dnf install kernel-devel-$(uname -r)

# To enable the pyzfs packages additionally install the following:

# Fedora
$ sudo dnf install python3 python3-devel python3-setuptools python3-cffi 

# For Red Hat / CentOS 7
$ sudo yum install epel-release
$ sudo yum install python36 python36-devel python36-setuptools python36-cffi

There are 3 kinds of mode to build the rpm packages
DKMS  kmods  kABI-tracking kmod

We use the kmods
           ^^^^^^
kmods packages are binary kernel modules which are compiled against 
a specific version of the kernel. This means that if you update the 
kernel you must compile and install a new kmod package. If you don't 
frequently update your kernel, or if you're managing a large number 
of systems, then kmod packages are a good choice.

$ cd zfs
$ ./configure --with-config=srpm
$ make -j1 pkg-utils rpm-dkms
$ sudo yum localinstall *.$(uname -p).rpm *.noarch.rpm


./configure
LC_TIME=C make -j1 pkg-utils pkg-kmod //LC_TIME=C kill the bug of bad changelog


When install the rpm packages,
The zfs module actually only need these 3 packages
zfs-0.7.12-1.el7.centos.x86_64
libzfs2-0.7.12-1.el7.centos.x86_64
kmod-zfs-3.10.0-327.el7.centos.scst72.x86_64-0.7.12-1.el7.centos.x86_64
And the zfs and kmod-zfs depends on each other

DMU

objects

A object in DMU is described by dnode


   +----------+
   | dn_type  |
   | dn_indblkshift
   | dn_nlevels = 2
   | dn_nblkptr = 2
   | .......  |     +-------+-------+-------+
   |          |   / |blkptr0|blkptr1|blkptr2|
   | dn_blkptr[3]   |       |       |       |
   |          |   \ |       |       |       |
   +----------+     +-------+-------+-------+
                       |||
                       v||  
                    +---v|-+      indirect blocks (metatdata), there are 3 replicas in blkptr for it
                    | +--v---+
                    | | +------+  the size of indirect block is determined by dn_indblkshift
                    +-| |      |  it is a array of blkptrs that point to another  level indirect block or block
                      +-|      | ---+
                        +------+    | for regular data, there is only 1 replica
                                    v
                                 +-----+
                                 |     |  the blocks described by a blkptr in the indirect block
                                 |     |
                                 +-----+


                          I I I                         level 2    Every 'I' or 'D' here is a blkptr, the
                    I I I I I I I I I                   level 1    space that the blkptr points
        D D D D D D D D D D D D D D D D D D D D D       level 0    contains an array of blkptrs
                         

        Every 'D' here points to a block, it has a block id, the linear index in level 0
                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

        We could calculate the block id of the level 1 indirect blocks  through

        level 1 blkid = (level 0 blkid) % ((size of level 1 indirect block)/(size of blkptr))

The objects are collected in a object set.
The interesting thing is the type of the object set

DMU_OST_NONE
DMU_OST_META, DSL object set
DMU_OST_ZFS, ZPL object set
DMU_OST_ZVOL, ZVOL object set


    +-----------+
    | metadnode |
    | os_zil_header
    | os_type   |
    | os_pad[]  |
    +-----------+

    The metadnode here points to an object that contains array of dnodes that
    describe the objects in this object set.


    Every object in an object set is uniquely identified by a 64bits integer
    called object number. We could address the location of the dnode structure
    of an object through this object number.

COW

How does the COW happen, especially, it need to iterate the bp tree from buttom
to top to update the new blkptr_t which includes the new position and checksum ?


spa_sync
  -> spa_sync_iterate_to_convergence
    -> dsl_pool_sync
      -> dsl_dataset_sync
        -> dmu_objset_sync
          -> dnode_sync
            -> list_t *list = &dn->dn_dirty_records[txgoff]
            -> dbuf_sync_list // itertate down to the level 0
               ---
                if (dr->dr_dbuf->db_level > 0)
                    dbuf_sync_indirect(dr, tx);
                else
                    dbuf_sync_leaf(dr, tx);
               ---
               -> dbuf_sync_indirect
                 -> dbuf_write // indirect dbuf
                 -> dbuf_sync_list(&dr->dt.di.dr_children, db->db_level - 1, tx)
                   -> dbuf_sync_leaf


The block is finally issued by dbuf_write

dbuf_write
---
    dr->dr_zio = arc_write(zio, os->os_spa, txg,
            &dr->dr_bp_copy, data, DBUF_IS_L2CACHEABLE(db),
            &zp, dbuf_write_ready,
            children_ready_cb, dbuf_write_physdone,
            dbuf_write_done, db, ZIO_PRIORITY_ASYNC_WRITE,
            ZIO_FLAG_MUSTSUCCEED, &zb);
---
  ->  arc_write
      ---
        callback->awcb_ready = ready;
        callback->awcb_children_ready = children_ready;
        callback->awcb_physdone = physdone;
        callback->awcb_done = done;
        callback->awcb_private = private;
        callback->awcb_buf = buf;
        ...
    
        zio = zio_write(pio, spa, txg, bp,
            abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),
            HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,
            (children_ready != NULL) ? arc_write_children_ready : NULL,
            arc_write_physdone, arc_write_done, callback,
                priority, zio_flags, zb);
      ---
      -> zio_write
        ---
            zio = zio_create(pio, spa, txg, bp, data, lsize, psize, done, private,
                   ZIO_TYPE_WRITE, priority, flags, NULL, 0, zb,
                   ZIO_STAGE_OPEN, (flags & ZIO_FLAG_DDT_CHILD) ?
                   ZIO_DDT_CHILD_WRITE_PIPELINE : ZIO_WRITE_PIPELINE);


            zio->io_ready = ready;
            zio->io_children_ready = children_ready;
            zio->io_physdone = physdone;
            zio->io_prop = *zp;
        ---


The pipeline of the write is
#define    ZIO_INTERLOCK_STAGES            \
    (ZIO_STAGE_READY |            \
    ZIO_STAGE_DONE)

#define    ZIO_WRITE_COMMON_STAGES            \
    (ZIO_INTERLOCK_STAGES |            \
    ZIO_VDEV_IO_STAGES |            \
    ZIO_STAGE_ISSUE_ASYNC |            \

    ZIO_STAGE_CHECKSUM_GENERATE)


#define    ZIO_WRITE_PIPELINE            \
    (ZIO_WRITE_COMMON_STAGES |        \
    ZIO_STAGE_WRITE_BP_INIT |        \
    ZIO_STAGE_WRITE_COMPRESS |        \
    ZIO_STAGE_ENCRYPT |            \
    ZIO_STAGE_DVA_THROTTLE |        \

    ZIO_STAGE_DVA_ALLOCATE)


zio_dva_allocate
  -> metaslab_alloc
    -> metaslab_alloc_dva
    ---
            DVA_SET_VDEV(&dva[d], vd->vdev_id);
            DVA_SET_OFFSET(&dva[d], offset);
            DVA_SET_GANG(&dva[d],
                ((flags & METASLAB_GANG_HEADER) ? 1 : 0));
            DVA_SET_ASIZE(&dva[d], asize);
    ---

We could see that the new checksum calculating and new block allocation all
happen during the write pipeline. These new information would be saved in the
zio->io_bp


But how does the parent block knows these newly updated zio->io_bp ?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

zio_pipeline
---
    zio_dva_allocate,
    zio_dva_free,
    zio_dva_claim,
    zio_ready,        //After the zio_dva_allocate
---

zio_ready
  -> zio->io_ready
     arc_write_ready
      -> callback->awcb_ready
         dbuf_write_ready
         ---
            rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
            *db->db_blkptr = *bp;
            rw_exit(&dn->dn_struct_rwlock);
         ---

    The db->db_blkptr points to blkptr_t structure in the dn's buffer.
    And at this moment, the new checksum has been calculated. The parent knows
    the new data location and new data checksum.

For example,
dnode_increase_indirection
---
        child->db_parent = db;
        dbuf_add_ref(db, child);
        if (db->db.db_data)
            child->db_blkptr = (blkptr_t *)db->db.db_data + i;
        else
            child->db_blkptr = NULL;

---

Or

dbuf_check_blkptr
---
    if (db->db_level == dn->dn_phys->dn_nlevels-1) {
        db->db_parent = dn->dn_dbuf;

    //dn_phys pointers into dn->dn_dbuf->db.db_data

        db->db_blkptr = &dn->dn_phys->dn_blkptr[db->db_blkid];
        DBUF_VERIFY(db);
    } 
---

Another question is that how to ensure the zios' pipeline of different level are
executed from buttom to top ?

To figure it out, what we need to know first is that no matter zio_write or
arc_write will not kick off the zio pipeline but just create a zio.
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If we want to start the zio, we need to invoke zio_wait or zio_nowait.

Look at the dbuf_sync_indirect and dbuf_sync_leaf which are typical example,

dbuf_sync_indirect
---

    // The zio is not kicked off but just created here

    dbuf_write(dr, db->db_buf, tx);

    zio = dr->dr_zio;
    mutex_enter(&dr->dt.di.dr_mtx);

    // Iterate the lower level

    dbuf_sync_list(&dr->dt.di.dr_children, db->db_level - 1, tx);
    mutex_exit(&dr->dt.di.dr_mtx);

    zio_nowait(zio);

---

dbuf_sync_leaf
---
    dbuf_write(dr, *datap, tx);

    if (dn->dn_object == DMU_META_DNODE_OBJECT) {
        ...
    } else {
        /*
         * Although zio_nowait() does not "wait for an IO", it does
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         * initiate the IO. If this is an empty write it seems plausible
           ^^^^^^^^^^^^^^^
         * that the IO could actually be completed before the nowait
         * returns. We need to DB_DNODE_EXIT() first in case
         * zio_nowait() invalidates the dbuf.
         */
        DB_DNODE_EXIT(db);

        zio_nowait(dr->dr_zio);

    }
---

The zio of the upper level is always kicked off after iterate the underlying level.

Question:
The write zio's pipeline will enter zio_issue_async, all zios will be executed
in parallel with multiple threads. Then we could get bigger throughput.

Refer to taskq_create and zio_taskqs

But this also means the zios of children and parents maybe executed out of order.
This is a big problem because the zio_checksum_generate must be executed after the
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
children zios are ready, then the new checksum could include the new blkptrs of children.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

How to handle this ?

The answer is ZIO_STAGE_WRITE_COMPRESS
zio_write_compress
---
    /*
     * If our children haven't all reached the ready stage,
     * wait for them and then repeat this pipeline stage.
     */
    if (zio_wait_for_children(zio, ZIO_CHILD_LOGICAL_BIT |
        ZIO_CHILD_GANG_BIT, ZIO_WAIT_READY)) {
        return (NULL);
    }
---

zio_write
---
    zio = zio_create(pio, spa, txg, bp, data, lsize, psize, done, private,
        ZIO_TYPE_WRITE, priority, flags, NULL, 0, zb,
                                         ^^^^
        ZIO_STAGE_OPEN, (flags & ZIO_FLAG_DDT_CHILD) ?
        ZIO_DDT_CHILD_WRITE_PIPELINE : ZIO_WRITE_PIPELINE);
---

zio_create
---
    if (vd != NULL)
        zio->io_child_type = ZIO_CHILD_VDEV;
    else if (flags & ZIO_FLAG_GANG_CHILD)
        zio->io_child_type = ZIO_CHILD_GANG;
    else if (flags & ZIO_FLAG_DDT_CHILD)
        zio->io_child_type = ZIO_CHILD_DDT;
    else
        zio->io_child_type = ZIO_CHILD_LOGICAL;
---

All of the zio created in DMU should be logical one

 mirro and raidz use zio_vdev_child_io which has vd assigned
 vdisk uses __vdev_disk_physio which doesn't create zio but dio_request

ZIL

Why does the ZILexist?

Writes in ZFS are "write-back"
Data is first written and stored in-memory, in DMU layer
Later, data for whole pool written to disk via spa_sync()
Without the ZIL, sync operations could wait for spa_sync()

spa_sync() can take tens of seconds (or more) to complete

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Further, with the ZIL, write amplification can be mitigated
A single ZPL operation can cause many writes to occur
ZIL allows operation to "complete" with minimal data written
ZIL needed to provide "fast" synchronous semantics to applications
Correctness could be acheived without it, but would be "too slow"

lwb write

Log block chain



Header -> +-----------+      +--->  +-----------+      +--->  +-----------+
          | zil_chain | -----+      | zil_chain | -----+      | zil_chain |
          +-----------+             +-----------+             +-----------+
          |    LR     |             |    LR     |             |    LR     |
          +-----------+             +-----------+             +-----------+
          |    LR     |             |    LR     |             |    LR     |
          +-----------+             +-----------+             +-----------+

typedef struct zil_chain {
    uint64_t zc_pad;
    blkptr_t zc_next_blk;    /* next block in chain */
    uint64_t zc_nused;    /* bytes in log block used */
    zio_eck_t zc_eck;    /* block trailer */
} zil_chain_t;

zil_lwb_write_issue will issue the old lwb and allocate a new one.
The log record chain are also built up here.

zil_lwb_write_issue
---
    if (BP_GET_CHECKSUM(&lwb->lwb_blk) == ZIO_CHECKSUM_ZILOG2) {
        zilc = (zil_chain_t *)lwb->lwb_buf;
        bp = &zilc->zc_next_blk;
    }
    ...

    // the bp points to the previous log block's zil_chain_t.zc_next_blk

    error = zio_alloc_zil(spa, zilog->zl_os, txg, bp, zil_blksz, &slog);
    ...

    // update the lwb_vdev_tree which includes vdevs to flush after lwb write

    zil_lwb_add_block(lwb, &lwb->lwb_blk);
    lwb->lwb_issued_timestamp = gethrtime();
    lwb->lwb_state = LWB_STATE_ISSUED;

    zio_nowait(lwb->lwb_root_zio);
    zio_nowait(lwb->lwb_write_zio);
---

The lwb_root_zio and lwb_write_zio is created here

zil_lwb_commit
  -> zil_lwb_write_open
  ---
        lwb->lwb_root_zio = zio_root(zilog->zl_spa,
            zil_lwb_flush_vdevs_done, lwb, ZIO_FLAG_CANFAIL);

        lwb->lwb_write_zio = zio_rewrite(lwb->lwb_root_zio,
            zilog->zl_spa, 0, &lwb->lwb_blk, lwb_abd,
            BP_GET_LSIZE(&lwb->lwb_blk), zil_lwb_write_done, lwb,
            prio, ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |
            ZIO_FLAG_FASTWRITE, &zb);

        lwb->lwb_state = LWB_STATE_OPENED;
  ---
The zio pipeline of the zio_rewrite is special.

#define    ZIO_REWRITE_PIPELINE            \
    (ZIO_WRITE_COMMON_STAGES |        \
    ZIO_STAGE_WRITE_COMPRESS |        \
    ZIO_STAGE_ENCRYPT |            \
    ZIO_STAGE_WRITE_BP_INIT)

There is no ZIO_STAGE_DVA_ALLOCATE. So the zil seems not COWed.

zil_lwb_write_done would trigger flush on the vdevs involved in this zio
---
    while ((zv = avl_destroy_nodes(t, &cookie)) != NULL) {
        vdev_t *vd = vdev_lookup_top(spa, zv->zv_vdev);
        if (vd != NULL)
            zio_flush(lwb->lwb_root_zio, vd);
        kmem_free(zv, sizeof (*zv));
    }
---

After these flushes are done,
zil_lwb_flush_vdevs_done
---
    while ((zcw = list_head(&lwb->lwb_waiters)) != NULL) {
        mutex_enter(&zcw->zcw_lock);

        list_remove(&lwb->lwb_waiters, zcw);

        zcw->zcw_lwb = NULL;

        zcw->zcw_zio_error = zio->io_error;

        zcw->zcw_done = B_TRUE;

        //Notify the waiter in zil_commit_waiter

        cv_broadcast(&zcw->zcw_cv);

        mutex_exit(&zcw->zcw_lock);
    }

    mutex_exit(&zilog->zl_lock);


    /*
     * Now that we've written this log block, we have a stable pointer
     * to the next block in the chain, so it's OK to let the txg in
     * which we allocated the next block sync.
     */

    dmu_tx_commit(tx);
---

Why must we stop the next block's allocation tgx ?

Think about following scene,

before zil block lwb_A is written to disk, the zil block lwb_B is allocated in
tgx T.

lwb_A is issued to disk and concurrently, tgx T is synced.

If the tgx T is synced to disk before the lwb_A and system crash at the moment,

lwb_B's allocation is persistent on disk but noone knows it any more.
lwb_A is lost
then we leak t he lwb_B

This could lead to a new scene

lwb_A is on disk, but the tgx T is not synced successfully before crash,
isn't lwb_B a invalid block ?

spa_ld_claim_log_blocks will claim these blocks for us.

And there is another amazing facts that we even don't need to do any real allocation on disk for the normal case.

The lwbs we allocated in txg T would be freed finally
zil_sync
---
    while ((lwb = list_head(&zilog->zl_lwb_list)) != NULL) {
        zh->zh_log = lwb->lwb_blk;
        if (lwb->lwb_buf != NULL || lwb->lwb_max_txg > txg)
            break;
        list_remove(&zilog->zl_lwb_list, lwb);

        zio_free(spa, txg, &lwb->lwb_blk);

        zil_free_lwb(zilog, lwb);
    }

---
The zil is per objset and the spacemap's objset is MOS.

log record

The zil means zfs intend log.
The core here is the intend What will be recorded in the zil ?
Let's first look at a example

zfs_rmdir
---
    if (error == 0) {
        uint64_t txtype = TX_RMDIR;
        if (flags & FIGNORECASE)
            txtype |= TX_CI;
        zfs_log_remove(zilog, tx, txtype, dzp, name, ZFS_NO_OBJECT);
    }


    // zfs_log_remove will construct the lr_remove_t
typedef struct {
    lr_t        lr_common;    /* common portion of log record */
    uint64_t    lr_doid;    /* obj id of directory */
    /* name of object to remove follows this */
} lr_remove_t;



---

The write operation is more complicated.

The specific header for write intent log is

typedef struct {
    lr_t        lr_common;    /* common portion of log record */
    uint64_t    lr_foid;    /* file object to write */
    uint64_t    lr_offset;    /* offset to write to */
    uint64_t    lr_length;    /* user data length to write */
    uint64_t    lr_blkoff;    /* no longer used */
    blkptr_t    lr_blkptr;    /* spa block pointer for replay */

    /* write data will follow for small writes */

} lr_write_t;


There are two flavors of writing log records

 immediate
For small writes it's cheaper to store the data with the log record
zfs_get_data
---
    if (buf != NULL) { /* immediate write */
        zgd->zgd_lr = rangelock_enter(&zp->z_rangelock,
            offset, size, RL_READER);
        /* test for truncation needs to be done while range locked */
        if (offset >= zp->z_size) {
            error = SET_ERROR(ENOENT);
        } else {

            // the object here is the target of the write operation which this
            // ZIL want to record. dmu_read will read the content into the buf
            which should be part of the ZIL block following the lr_write_t


            error = dmu_read(os, object, offset, size, buf,
                DMU_READ_NO_PREFETCH);

        }
    } 
---


 indirect
for large writes it's cheaper to sync the data and get a pointer
to it (indirect) so that we don't have to write the data twice.

        if (error == 0)


            error = dmu_buf_hold(os, object, offset, zgd, &db,
                DMU_READ_NO_PREFETCH);

        if (error == 0) {

            // the lr_blkptr is not pointer but part of the lr_write_t structure
            // we get the pointer here.

            blkptr_t *bp = &lr->lr_blkptr;

            zgd->zgd_db = db;
            zgd->zgd_bp = bp;


            error = dmu_sync(zio, lr->lr_common.lrc_txg,
                zfs_get_done, zgd);
            ...
        }
        dmu_sync
          -> arc_write // the parameter bp is zgd->zgd_bp
            -> zio_write
               the bp pointer will be set to zio->io_bp
          -> zio_nowait

      in the zio_dva_allocate, the zio->io_bp will be assigned with new value
      where the io will be store on disk. And finally, the lr_write_t.lr_blkptr
      will be set during this.




zil_commit
  -> zil_commit_impl
    -> zil_commit_writer
      -> zil_process_commit_list
        -> zil_lwb_commit
        ---
            error = zilog->zl_get_data(itx->itx_private,
                lrwb, dbuf, lwb, lwb->lwb_write_zio);

        ---
zfs_get_data  //Get data to generate a TX_WRITE intent log record

checkpoint


zil_sync
---
    zil_header_t *zh = zil_header_in_syncing_context(zilog);
    ...
    while ((lwb = list_head(&zilog->zl_lwb_list)) != NULL) {

        // update the zil header0->zh_log
        // this is where we start to replay the log

        zh->zh_log = lwb->lwb_blk;
        if (lwb->lwb_buf != NULL || lwb->lwb_max_txg > txg)
            break;
        list_remove(&zilog->zl_lwb_list, lwb);
        zio_free(spa, txg, &lwb->lwb_blk);
        zil_free_lwb(zilog, lwb);
        ...
    }

---

The zil_header_t points to os->os_zil_header

dmu_objset_open_impl
  -> os->os_zil = zil_alloc(os, &os->os_zil_header)
    -> zilog->zl_header = zh_phys 


zil_header_t is parts of the objset_phys_t
typedef struct objset_phys {
    dnode_phys_t os_meta_dnode;

    zil_header_t os_zil_header;

    uint64_t os_type;
    uint64_t os_flags;
    ...
    }


dmu_objset_sync
---
    /*
     * Free intent log blocks up to this tx.
     */
    zil_sync(os->os_zil, tx);
    os->os_phys->os_zil_header = os->os_zil_header;
    zio_nowait(zio)
---

The zil_chain_t of a zil block contains the next block's blkptr.
Due to the previous zil block is always written before the next, it cannot get
the correct checksum of the next zil block.
How to calculate and verify the checksum of the zil block ?

The answer is embedded checksum.

See zio_checksum_table
    {{abd_fletcher_2_native,    abd_fletcher_2_byteswap},
        NULL, NULL, ZCHECKSUM_FLAG_EMBEDDED, "zilog"},

    {{abd_fletcher_4_native,    abd_fletcher_4_byteswap},
        NULL, NULL, ZCHECKSUM_FLAG_EMBEDDED, "zilog2"},

ZCHECKSUM_FLAG_EMBEDDED is there.

zio_checksum_error_impl
---
    if (ci->ci_flags & ZCHECKSUM_FLAG_EMBEDDED) {
        zio_cksum_t verifier;
        size_t eck_offset;

        if (checksum == ZIO_CHECKSUM_ZILOG2) {
            zil_chain_t zilc;
            uint64_t nused;

            abd_copy_to_buf(&zilc, abd, sizeof (zil_chain_t));

            eck = zilc.zc_eck;
            eck_offset = offsetof(zil_chain_t, zc_eck) +
                offsetof(zio_eck_t, zec_cksum);
            ...
        } else {
            eck_offset = size - sizeof (zio_eck_t);
            abd_copy_to_buf_off(&eck, abd, eck_offset,
                sizeof (zio_eck_t));
            eck_offset += offsetof(zio_eck_t, zec_cksum);
        }
        ...

        expected_cksum = eck.zec_cksum;


        ci->ci_func[byteswap](abd, size,
            spa->spa_cksum_tmpls[checksum], &actual_cksum);
    }
---


When compute the checksum,
zio_checksum_compute
---
    if (ci->ci_flags & ZCHECKSUM_FLAG_EMBEDDED) {
        zio_eck_t eck;
        size_t eck_offset;

        bzero(&saved, sizeof (zio_cksum_t));

        if (checksum == ZIO_CHECKSUM_ZILOG2) {
            zil_chain_t zilc;
            abd_copy_to_buf(&zilc, abd, sizeof (zil_chain_t));

            size = P2ROUNDUP_TYPED(zilc.zc_nused, ZIL_MIN_BLKSZ,
                uint64_t);
            eck = zilc.zc_eck;
            eck_offset = offsetof(zil_chain_t, zc_eck);
        } else {
            eck_offset = size - sizeof (zio_eck_t);
            abd_copy_to_buf_off(&eck, abd, eck_offset,
                sizeof (zio_eck_t));
        }
        ...
        ci->ci_func[0](abd, size, spa->spa_cksum_tmpls[checksum],
            &cksum);
        ...
        abd_copy_from_buf_off(abd, &cksum,
            eck_offset + offsetof(zio_eck_t, zec_cksum),
            sizeof (zio_cksum_t));
    } else {
        saved = bp->blk_cksum;
        ci->ci_func[0](abd, size, spa->spa_cksum_tmpls[checksum],
            &cksum);
        if (BP_USES_CRYPT(bp) && BP_GET_TYPE(bp) != DMU_OT_OBJSET)
            zio_checksum_handle_crypt(&cksum, &saved, insecure);
        bp->blk_cksum = cksum;
    }

---

space management

COW of spacemap

Where is the spacemap stored ?

The vdev lable

vdev_tree nvlist 

Name: “metaslab_array”
Value: DATA_TYPE_UINT64
Description: Object number of an object containing an array of object numbers.
Each element of this array (ma[i]) is, in turn, an object number of a space map
for metaslab 'i'. 

Name: “metaslab_shift” 
Value: DATA_TYPE_UINT64
Description: log base 2 of the metaslab size


The objset of the object above is the MOS (meta object set)
plus,
every spacemap of a metaslab is an object (dnode)
      ^^^^^^^^                     ^^^^^^


Look at vdev_metaslab_init
---
    for (m = oldc; m < newc; m++) {
        uint64_t object = 0;

        if (txg == 0 && vd->vdev_ms_array != 0) {
            error = dmu_read(mos, vd->vdev_ms_array,
                m * sizeof (uint64_t), sizeof (uint64_t), &object,
                DMU_READ_PREFETCH);
                ...
        }

        error = metaslab_init(vd->vdev_mg, m, object, txg,
            &(vd->vdev_ms[m]));
            ...
    }
---

When it do dmu_read, the 3 parameters are very important

 objset         ->  mos (Yes, it is the MOS)
 object id      ->  vd->vdev_ms_array (the metaslab array object id)
 offset and len ->  m * sizeof (uint64_t), sizeof (uint64_t)


space_map_write_impl only dirty the associated dmu_buf_t and thus finally dirty
the MOS.

Look at the spa_sync_iterate_to_convergence
---
    do {
        int pass = ++spa->spa_sync_pass;
        ...
        dsl_pool_sync(dp, txg);
          -> dsl_pool_sync_mos
        ...
        vdev_t *vd = NULL;
        while ((vd = txg_list_remove(&spa->spa_vdev_txg_list, txg))
            != NULL)
            vdev_sync(vd, txg);

            // The dsl_pool_sync could cause new allocation/free operation,
            // so the metaslab sync must be invoked it.

              -> metaslab_sync
        ...
        spa_sync_deferred_frees(spa, tx);
    } while (dmu_objset_is_dirty(mos, txg));

---


Here is a question answered by Matthew Ahrens which is the core developer of zfs

Q: Space map store on disk as dnode, writing dnode blocks again needs allocation and free, 
   is this a feedback loop and how to break this cycling dependence ?
   
A: Yes, we call this "sync to convergence". The cycle is broken by overwriting the block in place, 
                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   thus not requiring any change of the allocation information. This is safe since we are only
   overwriting blocks that were allocated in the current txg, so if we crash it's as if nothing happened.


Where does the 'broken by overwriting the block in place' happen ?


Aha, it is the magic zio_write_compress

zio_write_compress
---

    /*
     * The final pass of spa_sync() must be all rewrites, but the first
     * few passes offer a trade-off: allocating blocks defers convergence,
     * but newly allocated blocks are sequential, so they can be written
     * to disk faster.  Therefore, we allow the first few passes of
     * spa_sync() to allocate new blocks, but force rewrites after that.
     * There should only be a handful of blocks after pass 1 in any case.
     */

    if (!BP_IS_HOLE(bp) && bp->blk_birth == zio->io_txg &&
        BP_GET_PSIZE(bp) == psize &&
        pass >= zfs_sync_pass_rewrite) {
        enum zio_stage gang_stages = zio->io_pipeline & ZIO_GANG_STAGES;

        zio->io_pipeline = ZIO_REWRITE_PIPELINE | gang_stages;
        zio->io_flags |= ZIO_FLAG_IO_REWRITE;
    } 
---

There are two critical conditions

 bp->blk_birth == zio->io_txg
     This indicates that this bp is born in this txg

 pass >= zfs_sync_pass_rewrite
     Note, the operator here is GREATER OR EQUAL. The comment has
     explained this.
    The zfs_sync_pass_rewrite is 2

original blocks

How does the zfs handle the original unused blocks after COWed ?
When to free them ?
Look at here

dbuf_write
---
    dr->dr_zio = arc_write(zio, os->os_spa, txg,
            &dr->dr_bp_copy, data, DBUF_IS_L2CACHEABLE(db),
            &zp, dbuf_write_ready,
            children_ready_cb, dbuf_write_physdone,
            dbuf_write_done, db, ZIO_PRIORITY_ASYNC_WRITE,
            ZIO_FLAG_MUSTSUCCEED, &zb);
---


zio_done
  -> zio->io_done
     arc_write_done
       -> callback->awcb_done
          dbuf_write_done
          ---

        blkptr_t *bp_orig = &zio->io_bp_orig;
        if (zio->io_flags & (ZIO_FLAG_IO_REWRITE | ZIO_FLAG_NOPWRITE)) {
            ASSERT(BP_EQUAL(bp, bp_orig));
        } else {
            dsl_dataset_t *ds = os->os_dsl_dataset;

            (void) dsl_dataset_block_kill(ds, bp_orig, tx, B_TRUE);

            dsl_dataset_block_born(ds, bp, tx);
        }

          ---

dsl_dataset_block_kill
  -> dsl_free
    -> zio_free
    ---
    if (BP_IS_GANG(bp) || BP_GET_DEDUP(bp) ||
        txg != spa->spa_syncing_txg ||
        spa_sync_pass(spa) >= zfs_sync_pass_deferred_free) {
        bplist_append(&spa->spa_free_bplist[txg & TXG_MASK], bp);
    } else {
        VERIFY0(zio_wait(zio_free_sync(NULL, spa, txg, bp, 0)));
    }
    ---

ARC


                                  c
                  ________________^________________
                 /                                 \
     MRU_Ghost          MRU               MFU           MFU_Ghost
|_ _ _ _ _ _ _ _ |________________|________________|_ _ _ _ _ _ _ _ |
                  \_______ _______/
                          v
                          p

MRU          Most Recently Used (arc header and data)
MRU_Ghost    Most Recently Used (arc header, no data)
MFU          Most Frequently Used (arc header and data)
MFU_Ghost    Most Frequently Used (arc header, no data)

The initial state of c and p
   c_min = MAX(1/32 of all mem, 64Mb)
   c_max = MAX(3/4 of all mem, all but 1Gb)
   c = MIN(1/8 physmem, 1/8 VM size)
   p = arc_c / 2


The core of the ARC is that adapts c ( cache size ) and p ( used pages in MRU ) in response to workloads 
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                  c
                                  
     MRU_Ghost          MRU               MFU           MFU_Ghost
|_ _ _ _ _ _ _ _ |________________|________________|_ _ _ _ _ _ _ _ |
                         p        |     c-p
                                     |   
-9- - -8- - -7- - [5] - [3] - [0] |  [9] - [8] - [7] - -6- - -3- - -2-
                                  |    
            Acess time              |           Access frequency
                                  |

When evicting during cache insert, then:

 Inserting in MRU & MRU < p then arc_evict(MFU)
 Inserting in MRU & MRU > p then arc_evict(MRU)
 Inserting in MFU & MFU < (c-p) then arc_evict(MRU)
 Inserting in MFU & MFU > (c-p) then arc_evict(MFU) 

Try to keep MRU close to p
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^

When adding new content:

 If (hit in MRU_Ghost) then increase p
 If (hit in MFU_Ghost) then decrease p
 If (arc_size within (2*maxblocksize) of c) then increase c


When shrinking or reclaiming: 

 If (MRU > p) then arc_evict(MRU)
 If (MRU+MRU_Ghost > c) then arc_evict(MRU_Ghost)
 If (arc_size > c) then arc_evict(MFU)
 If (arc_size + Ghosts > 2*c) then arc_evict(MFU_Ghost)


In conclusion,

 The bigger the MRU is, the higher the hit rate of second access
 The bigger the MFU is, the higher the hit rate of third access

dmu_tx_try_assign -> dsl_dir_tempreserve_space -> arc_tempreserve_space // Throttle writes when the amount of dirty data in the cache // gets too large. We try to keep the cache less than half full // of dirty blocks so that our sync times don't grow too large.

L2ARC


 
                  +-----------------------+
                  |         ARC           |
                  +-----------------------+
                     |         ^     ^
                     |         |     |
       l2arc_feed_thread()    arc_read()
                     |         |     |
                     |  l2arc read   |
                     V         |     |
                +---------------+    |
                |     L2ARC     |    |
                +---------------+    |
                    |    ^           |
           l2arc_write() |           |
                    |    |           |
                    V    |           |
                  +-------+      +-------+
                  | vdev  |      | vdev  |
                  | cache |      | cache |
                  +-------+      +-------+
                  +=========+     .-----.
                  :  L2ARC  :    |-_____-|
                  : devices :    | Disks |
                  +=========+    `-_____-'
 


           head -->                        tail
            +---------------------+----------+
    ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->.   # already on L2ARC
            +---------------------+----------+   |   o L2ARC eligible
    ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->|   : ARC buffer
            +---------------------+----------+   |
                 15.9 Gbytes      ^ 32 Mbytes    |
                               headroom          |
                                          l2arc_feed_thread()
                                                 |
                     l2arc write hand <--[oooo]--'
                             |           8 Mbyte
                             |          write max
                             V
          +==============================+
    L2ARC dev |####|#|###|###|    |####| ... |
              +==============================+
                         32 Gbytes


Note:
  The main role of this cache is to boost the performance of random read workloads.
  The L2ARC does not store dirty content, it never needs to flush write buffers
  back to disk based storage.
  ITOW, we needn't save any metadata for the mapping on disk
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  But only save them in memory.

How does L2ARC only select the clean arc buf ?

  Note that buffers can be in one of 6 states:
     ARC_anon    - anonymous (discussed below)
     ARC_mru        - recently used, currently cached
     ARC_mru_ghost    - recentely used, no longer in cache
     ARC_mfu        - frequently used, currently cached
     ARC_mfu_ghost    - frequently used, no longer in cache
     ARC_l2c_only    - exists in L2ARC but not other states

Anonymous buffers are buffers that are not associated with
a DVA.  These are buffers that hold dirty block copies
before they are written to stable storage.  By definition,
they are "ref'd" and are considered part of arc_mru
that cannot be freed.  Generally, they will acquire a DVA
as they are written and migrate onto the arc_mru list.

The l2arc only cares about the arc_mru and arc_mfu.

static multilist_sublist_t *
l2arc_sublist_lock(int list_num)
{
    multilist_t *ml = NULL;
    unsigned int idx;

    ASSERT(list_num >= 0 && list_num < L2ARC_FEED_TYPES);

    switch (list_num) {
    case 0:
        ml = arc_mfu->arcs_list[ARC_BUFC_METADATA];
        break;
    case 1:
        ml = arc_mru->arcs_list[ARC_BUFC_METADATA];
        break;
    case 2:
        ml = arc_mfu->arcs_list[ARC_BUFC_DATA];
        break;
    case 3:
        ml = arc_mru->arcs_list[ARC_BUFC_DATA];
        break;
    default:
        return (NULL);
    }

    /*
     * Return a randomly-selected sublist. This is acceptable
     * because the caller feeds only a little bit of data for each
     * call (8MB). Subsequent calls will result in different
     * sublists being selected.
     */
    idx = multilist_get_random_index(ml);
    return (multilist_sublist_lock(ml, idx));
}

Let's look at how does the l2arc write data out

l2arc_write_buffers
---
    for (int try = 0; try < L2ARC_FEED_TYPES; try++) {
        multilist_sublist_t *mls = l2arc_sublist_lock(try);
        uint64_t passed_sz = 0;

         * L2ARC fast warmup.
         *
         * Until the ARC is warm and starts to evict, read from the
         * head of the ARC lists rather than the tail.
         */
        if (arc_warm == B_FALSE)
            hdr = multilist_sublist_head(mls);
        else
            hdr = multilist_sublist_tail(mls);

        headroom = target_sz * l2arc_headroom;
        if (zfs_compressed_arc_enabled)
            headroom = (headroom * l2arc_headroom_boost) / 100;

        for (; hdr; hdr = hdr_prev) {
            kmutex_t *hash_lock;
            abd_t *to_write = NULL;

            if (arc_warm == B_FALSE)
                hdr_prev = multilist_sublist_next(mls, hdr);
            else
                hdr_prev = multilist_sublist_prev(mls, hdr);

            // HASH Lock

            hash_lock = HDR_LOCK(hdr);
            if (!mutex_tryenter(hash_lock)) {
                /*
                 * Skip this buffer rather than waiting.
                 */
                continue;
            }

            passed_sz += HDR_GET_LSIZE(hdr);
            if (passed_sz > headroom) {
                /*
                 * Searched too far.
                 */
                mutex_exit(hash_lock);
                break;
            }

            if (!l2arc_write_eligible(guid, hdr)) {
                mutex_exit(hash_lock);
                continue;
            }
            ...
            if (pio == NULL) {
                /*
                 * Insert a dummy header on the buflist so
                 * l2arc_write_done() can find where the
                 * write buffers begin without searching.
                 */
                mutex_enter(&dev->l2ad_mtx);
                list_insert_head(&dev->l2ad_buflist, head);
                mutex_exit(&dev->l2ad_mtx);

                cb = kmem_alloc(
                    sizeof (l2arc_write_callback_t), KM_SLEEP);
                cb->l2wcb_dev = dev;
                cb->l2wcb_head = head;
                pio = zio_root(spa, l2arc_write_done, cb,
                    ZIO_FLAG_CANFAIL);
            }

            hdr->b_l2hdr.b_dev = dev;
            hdr->b_l2hdr.b_hits = 0;

            hdr->b_l2hdr.b_daddr = dev->l2ad_hand;
            arc_hdr_set_flags(hdr, ARC_FLAG_HAS_L2HDR);

            mutex_enter(&dev->l2ad_mtx);
            list_insert_head(&dev->l2ad_buflist, hdr);
            mutex_exit(&dev->l2ad_mtx);

            (void) zfs_refcount_add_many(&dev->l2ad_alloc,
                arc_hdr_size(hdr), hdr);

            wzio = zio_write_phys(pio, dev->l2ad_vdev,
                hdr->b_l2hdr.b_daddr, asize, to_write,
                ZIO_CHECKSUM_OFF, NULL, hdr,
                ZIO_PRIORITY_ASYNC_WRITE,
                ZIO_FLAG_CANFAIL, B_FALSE);

            ...
            mutex_exit(hash_lock);

            (void) zio_nowait(wzio);
        }

        multilist_sublist_unlock(mls);

        if (full == B_TRUE)
            break;
    }

    ...
    dev->l2ad_writing = B_TRUE;
    (void) zio_wait(pio);
    dev->l2ad_writing = B_FALSE;

    return (write_asize);
}

How does the data in l2arc read in

arc_read
---
        if (HDR_HAS_L2HDR(hdr) &&
            (vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) {
            devw = hdr->b_l2hdr.b_dev->l2ad_writing;
            addr = hdr->b_l2hdr.b_daddr;
            /*
             * Lock out L2ARC device removal.
             */
            if (vdev_is_dead(vd) ||
                !spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
                vd = NULL;
        }
        ...
        if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
            /*
             * Read from the L2ARC if the following are true:
             * 1. The L2ARC vdev was previously cached.
             * 2. This buffer still has L2ARC metadata.
             * 3. This buffer isn't currently writing to the L2ARC.
             * 4. The L2ARC entry wasn't evicted, which may
             *    also have invalidated the vdev.
             * 5. This isn't prefetch and l2arc_noprefetch is set.
             */
            if (HDR_HAS_L2HDR(hdr) &&
                !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
                !(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
                l2arc_read_callback_t *cb;
                abd_t *abd;
                uint64_t asize;

                atomic_inc_32(&hdr->b_l2hdr.b_hits);

                cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
                    KM_SLEEP);
                cb->l2rcb_hdr = hdr;
                cb->l2rcb_bp = *bp;
                cb->l2rcb_zb = *zb;
                cb->l2rcb_flags = zio_flags;

                asize = vdev_psize_to_asize(vd, size);
                if (asize != size) {
                    abd = abd_alloc_for_io(asize,
                        HDR_ISTYPE_METADATA(hdr));
                    cb->l2rcb_abd = abd;
                } else {
                    abd = hdr_abd;
                }

                /*
                 * l2arc read.  The SCL_L2ARC lock will be
                 * released by l2arc_read_done().
                 * Issue a null zio if the underlying buffer
                 * was squashed to zero size by compression.
                 */
                rzio = zio_read_phys(pio, vd, addr,
                    asize, abd,
                    ZIO_CHECKSUM_OFF,
                    l2arc_read_done, cb, priority,
                    zio_flags | ZIO_FLAG_DONT_CACHE |
                    ZIO_FLAG_CANFAIL |
                    ZIO_FLAG_DONT_PROPAGATE |
                    ZIO_FLAG_DONT_RETRY, B_FALSE);
                acb->acb_zio_head = rzio;

                if (hash_lock != NULL)
                    mutex_exit(hash_lock);

                if (*arc_flags & ARC_FLAG_NOWAIT) {
                    zio_nowait(rzio);
                    goto out;
                }

                if (zio_wait(rzio) == 0)
                    goto out;

                /* l2arc read error; goto zio_read() */
                if (hash_lock != NULL)
                    mutex_enter(hash_lock);
            } 
---


l2arc_read_done
---
    zio->io_bp_copy = cb->l2rcb_bp;    /* XXX fix in L2ARC 2.0    */
    zio->io_bp = &zio->io_bp_copy;    /* XXX fix in L2ARC 2.0    */


    // checksum is checked here

    valid_cksum = arc_cksum_is_equal(hdr, zio);

    /*
     * b_rabd will always match the data as it exists on disk if it is
     * being used. Therefore if we are reading into b_rabd we do not
     * attempt to untransform the data.
     */
    if (valid_cksum && !using_rdata)
        tfm_error = l2arc_untransform(zio, cb);

    if (valid_cksum && tfm_error == 0 && zio->io_error == 0 &&
        !HDR_L2_EVICTED(hdr)) {
        mutex_exit(hash_lock);
        zio->io_private = hdr;

        arc_read_done(zio);

    } else {
        mutex_exit(hash_lock);
        /*
         * Buffer didn't survive caching.  Increment stats and
         * reissue to the original storage device.
         */
         ...
        /*
         * If there's no waiter, issue an async i/o to the primary
         * storage now.  If there *is* a waiter, the caller must
         * issue the i/o in a context where it's OK to block.
         */
        if (zio->io_waiter == NULL) {
            zio_t *pio = zio_unique_parent(zio);
            void *abd = (using_rdata) ?
                hdr->b_crypt_hdr.b_rabd : hdr->b_l1hdr.b_pabd;

            zio_nowait(zio_read(pio, zio->io_spa, zio->io_bp,
                abd, zio->io_size, arc_read_done,
                hdr, zio->io_priority, cb->l2rcb_flags,
                &cb->l2rcb_zb));
        }
    }


---

arc state machine

There are following state for an arc buffer,

ARC_anon        - anonymous (not associated with a DVA,
                  hold dirty block copies)
ARC_mru            - recently used, currently cached
ARC_mru_ghost    - recentely used, no longer in cache
ARC_mfu            - frequently used, currently cached
ARC_mfu_ghost    - frequently used, no longer in cache
ARC_l2c_only    - exists in L2ARC but not other states

Another thing we need to know is the method of organizing arc_buf_hdr_t.

buf_hash_find
---
    const dva_t *dva = BP_IDENTITY(bp);
    uint64_t birth = BP_PHYSICAL_BIRTH(bp);
    uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
    kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
    arc_buf_hdr_t *hdr;

    mutex_enter(hash_lock);
    for (hdr = buf_hash_table.ht_table[idx]; hdr != NULL;
        hdr = hdr->b_hash_next) {
        if (HDR_EQUAL(spa, dva, birth, hdr)) {
            *lockp = hash_lock;
            return (hdr);
        }
    }
    mutex_exit(hash_lock);
    *lockp = NULL;
    return (NULL);
---

The buf_hash_table.ht_table is a hash table.
The hash_lock is also a array contains locks.


#define    BUF_LOCKS 8192

typedef struct buf_hash_table {
    uint64_t ht_mask;
    arc_buf_hdr_t **ht_table;
    struct ht_lock ht_locks[BUF_LOCKS];
} buf_hash_table_t;

This is a very common implementation is zfs.

The entry of the arc state machine is arc_access.
Let's first look at it.

All of the policy that how does the arc state machine run is here.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


arc_access
---
    if (hdr->b_l1hdr.b_state == arc_anon) {

        /*
         * This buffer is not in the cache, and does not
         * appear in our "ghost" list.  Add the new buffer
         * to the MRU state.
         */

        hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
        arc_change_state(arc_mru, hdr, hash_lock);

    } else if (hdr->b_l1hdr.b_state == arc_mru) {
        now = ddi_get_lbolt();
        ...

        /*
         * This buffer has been "accessed" only once so far,
         * but it is still in the cache. Move it to the MFU
         * state.
         */

        #define    ARC_MINTIME    (hz>>4) /* 62 ms */

        if (ddi_time_after(now, hdr->b_l1hdr.b_arc_access +
            ARC_MINTIME)) {
            /*
             * More than 125ms have passed since we
             * instantiated this buffer.  Move it to the
             * most frequently used state.
             */
            hdr->b_l1hdr.b_arc_access = now;
            arc_change_state(arc_mfu, hdr, hash_lock);

            // Note here, it is ddi_time_after here,
            // The arc buf is only moved to mfu after ARC_MINTIME
                                                ^^^^^

        }
        atomic_inc_32(&hdr->b_l1hdr.b_mru_hits);
    } else if (hdr->b_l1hdr.b_state == arc_mru_ghost) {
        arc_state_t    *new_state;


        /*
         * This buffer has been "accessed" recently, but
         * was evicted from the cache.  Move it to the
         * MFU state.
         */

        if (HDR_PREFETCH(hdr) || HDR_PRESCIENT_PREFETCH(hdr)) {
            ...
        } else {
            new_state = arc_mfu;
        }

        hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
        arc_change_state(new_state, hdr, hash_lock);

        atomic_inc_32(&hdr->b_l1hdr.b_mru_ghost_hits);
    } else if (hdr->b_l1hdr.b_state == arc_mfu) {

        /*
         * This buffer has been accessed more than once and is
         * still in the cache.  Keep it in the MFU state.
         *
         * NOTE: an add_reference() that occurred when we did
         * the arc_read() will have kicked this off the list.
         * If it was a prefetch, we will explicitly move it to
         * the head of the list now.
         */

        atomic_inc_32(&hdr->b_l1hdr.b_mfu_hits);
        hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
    } else if (hdr->b_l1hdr.b_state == arc_mfu_ghost) {
        arc_state_t    *new_state = arc_mfu;

        /*
         * This buffer has been accessed more than once but has
         * been evicted from the cache.  Move it back to the
         * MFU state.
         */
>
        hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
        arc_change_state(new_state, hdr, hash_lock);
        atomic_inc_32(&hdr->b_l1hdr.b_mfu_ghost_hits);
    } else if (hdr->b_l1hdr.b_state == arc_l2c_only) {

        /*
         * This buffer is on the 2nd Level ARC.
         */

        hdr->b_l1hdr.b_arc_access = ddi_get_lbolt();
        arc_change_state(arc_mfu, hdr, hash_lock);
    } else {
        cmn_err(CE_PANIC, "invalid arc state 0x%p",
            hdr->b_l1hdr.b_state);
    }
---

Where to push the arc state machine forward ?

dbuf_hold_impl_arg
  -> arc_buf_access

This should be the hotest one.

arc_read
arc_write_done

Scrub and Resilvering

DTL

The DTL means 'Dirty Time Logging' which is based on the blkptr_t.blk_birth and the txg.

For each drive in a storage pool, ZFS keeps track of which transaction groups have been applied,
so if a drive is offline for a period of time the same birth time comparison used in replication
is used to identify what parts of the file system changes need to be applied to the drive when
it comes back online

The most important thing here is that the DTL will not be written on disk but kept in memory.
Look at the code,

when a new devie is added

spa_vdev_attach
---

    /*
     * Set newvd's DTL to [TXG_INITIAL, dtl_max_txg) so that we account
     * for any dmu_sync-ed blocks.  It will propagate upward when
     * spa_vdev_exit() calls vdev_dtl_reassess().
     */

    dtl_max_txg = txg + TXG_CONCURRENT_STATES;

    vdev_dtl_dirty(newvd, DTL_MISSING, TXG_INITIAL,
        dtl_max_txg - TXG_INITIAL);
---

IO runtime

zio_done
  -> vdev_stat_update
---
    if (spa->spa_load_state == SPA_LOAD_NONE &&
        type == ZIO_TYPE_WRITE && txg != 0 &&
        (!(flags & ZIO_FLAG_IO_REPAIR) ||
        (flags & ZIO_FLAG_SCAN_THREAD) ||
        spa->spa_claiming)) {

        /*
         * This is either a normal write (not a repair), or it's
         * a repair induced by the scrub thread, or it's a repair
         * made by zil_claim() during spa_load() in the first txg.
         * In the normal case, we commit the DTL change in the same
         * txg as the block was born.  In the scrub-induced repair
         * case, we know that scrubs run in first-pass syncing context,
         * so we commit the DTL change in spa_syncing_txg(spa).
         * In the zil_claim() case, we commit in spa_first_txg(spa).
         *
         * We currently do not make DTL entries for failed spontaneous
         * self-healing writes triggered by normal (non-scrubbing)
         * reads, because we have no transactional context in which to
         * do so -- and it's not clear that it'd be desirable anyway.
         */

        if (vd->vdev_ops->vdev_op_leaf) {
            uint64_t commit_txg = txg;
            if (flags & ZIO_FLAG_SCAN_THREAD) {
                vdev_dtl_dirty(vd, DTL_SCRUB, txg, 1);
                commit_txg = spa_syncing_txg(spa);
            } else if (spa->spa_claiming) {
                ASSERT(flags & ZIO_FLAG_IO_REPAIR);
                commit_txg = spa_first_txg(spa);
            }
            if (vdev_dtl_contains(vd, DTL_MISSING, txg, 1))
                return;
            for (pvd = vd; pvd != rvd; pvd = pvd->vdev_parent)
                vdev_dtl_dirty(pvd, DTL_PARTIAL, txg, 1);
            vdev_dirty(vd->vdev_top, VDD_DTL, vd, commit_txg);
        }
        if (vd != rvd)
            vdev_dtl_dirty(vd, DTL_MISSING, txg, 1);
    }

---

DSL_Scan

The core of the scan is dsl_scan_visitbp.

dsl_scan_visitbp
  -> dsl_scan_recurse

---

    if (BP_GET_LEVEL(bp) > 0) {

        arc_flags_t flags = ARC_FLAG_WAIT;
        int i;
        blkptr_t *cbp;
        int epb = BP_GET_LSIZE(bp) >> SPA_BLKPTRSHIFT;
        arc_buf_t *buf;

        err = arc_read(NULL, dp->dp_spa, bp, arc_getbuf_func, &buf,
            ZIO_PRIORITY_SCRUB, zio_flags, &flags, zb);
        if (err) {
            scn->scn_phys.scn_errors++;
            return (err);
        }
        for (i = 0, cbp = buf->b_data; i < epb; i++, cbp++) {
            zbookmark_phys_t czb;

            SET_BOOKMARK(&czb, zb->zb_objset, zb->zb_object,
                zb->zb_level - 1,
                zb->zb_blkid * epb + i);

            dsl_scan_visitbp(cbp, &czb, dnp,
                ds, scn, ostype, tx);

        }
        arc_buf_destroy(buf, &buf);


    } else if (BP_GET_TYPE(bp) == DMU_OT_DNODE) {

        arc_flags_t flags = ARC_FLAG_WAIT;
        dnode_phys_t *cdnp;
        int i;
        int epb = BP_GET_LSIZE(bp) >> DNODE_SHIFT;
        arc_buf_t *buf;
        ...
        err = arc_read(NULL, dp->dp_spa, bp, arc_getbuf_func, &buf,
            ZIO_PRIORITY_SCRUB, zio_flags, &flags, zb);
        if (err) {
            scn->scn_phys.scn_errors++;
            return (err);
        }
        for (i = 0, cdnp = buf->b_data; i < epb;
            i += cdnp->dn_extra_slots + 1,
            cdnp += cdnp->dn_extra_slots + 1) {

            dsl_scan_visitdnode(scn, ds, ostype,
                cdnp, zb->zb_blkid * epb + i, tx);

        }

        arc_buf_destroy(buf, &buf);
    } else if (BP_GET_TYPE(bp) == DMU_OT_OBJSET) {
        arc_flags_t flags = ARC_FLAG_WAIT;
        objset_phys_t *osp;
        arc_buf_t *buf;

        err = arc_read(NULL, dp->dp_spa, bp, arc_getbuf_func, &buf,
            ZIO_PRIORITY_SCRUB, zio_flags, &flags, zb);
        if (err) {
            scn->scn_phys.scn_errors++;
            return (err);
        }

        osp = buf->b_data;

        dsl_scan_visitdnode(scn, ds, osp->os_type,
            &osp->os_meta_dnode, DMU_META_DNODE_OBJECT, tx);
        ...
        arc_buf_destroy(buf, &buf);
    }
---

In one word, it will recurr into the tree and check every bp.

dsl_scan_visitbp
  -> scan_funcs[scn->scn_phys.scn_func](dp, bp, zb);
     dsl_scan_scrub_cb
    ---
    if (phys_birth <= scn->scn_phys.scn_min_txg ||
        phys_birth >= scn->scn_phys.scn_max_txg) {
        count_block(scn, dp->dp_blkstats, bp);
        return (0);
    }

    if (scn->scn_phys.scn_func == POOL_SCAN_SCRUB) {
        zio_flags |= ZIO_FLAG_SCRUB;
        needs_io = B_TRUE;
    } else {
        zio_flags |= ZIO_FLAG_RESILVER;
        needs_io = B_FALSE;
    }
    ...
    for (int d = 0; d < BP_GET_NDVAS(bp); d++) {
        const dva_t *dva = &bp->blk_dva[d];

        /*
         * Keep track of how much data we've examined so that
         * zpool(1M) status can make useful progress reports.
         */
        scn->scn_phys.scn_examined += DVA_GET_ASIZE(dva);
        spa->spa_scan_pass_exam += DVA_GET_ASIZE(dva);


        /* if it's a resilver, this may not be in the target range */

        if (!needs_io)
            needs_io = dsl_scan_need_resilver(spa, dva, psize,
                phys_birth);
    }

    if (needs_io && !zfs_no_scrub_io) {
        dsl_scan_enqueue(dp, bp, zio_flags, zb);
    } else {
        count_block(scn, dp->dp_blkstats, bp);
    }
    ---

dsl_scan_enqueue
---
    if (!dp->dp_scan->scn_is_sorted || BP_IS_GANG(bp)) {
        scan_exec_io(dp, bp, zio_flags, zb, NULL);
        return;
    }

    for (int i = 0; i < BP_GET_NDVAS(bp); i++) {
        dva_t dva;
        vdev_t *vdev;

        dva = bp->blk_dva[i];
        vdev = vdev_lookup_top(spa, DVA_GET_VDEV(&dva));

        mutex_enter(&vdev->vdev_scan_io_queue_lock);
        if (vdev->vdev_scan_io_queue == NULL)
            vdev->vdev_scan_io_queue = scan_io_queue_create(vdev);
        scan_io_queue_insert(vdev->vdev_scan_io_queue, bp,
            i, zio_flags, zb);
        mutex_exit(&vdev->vdev_scan_io_queue_lock);
    }
---

scan_exec_io
---
    if (queue == NULL) {
        mutex_enter(&spa->spa_scrub_lock);
        while (spa->spa_scrub_inflight >= scn->scn_maxinflight_bytes)
            cv_wait(&spa->spa_scrub_io_cv, &spa->spa_scrub_lock);
        spa->spa_scrub_inflight += BP_GET_PSIZE(bp);
        mutex_exit(&spa->spa_scrub_lock);
    } else {
        kmutex_t *q_lock = &queue->q_vd->vdev_scan_io_queue_lock;

        mutex_enter(q_lock);
        while (queue->q_inflight_bytes >= queue->q_maxinflight_bytes)
            cv_wait(&queue->q_zio_cv, q_lock);
        queue->q_inflight_bytes += BP_GET_PSIZE(bp);
        mutex_exit(q_lock);
    }

    count_block(scn, dp->dp_blkstats, bp);
    zio_nowait(zio_read(scn->scn_zio_root, spa, bp, data, size,
        dsl_scan_scrub_done, queue, ZIO_PRIORITY_SCRUB, zio_flags, zb));

---

Mirror

vdev_mirror_io_start --- if (zio->io_type == ZIO_TYPE_READ) { if (zio->io_bp != NULL && (zio->io_flags & ZIO_FLAG_SCRUB) && !mm->mm_resilvering) { /* * For scrubbing reads (if we can verify the * checksum here, as indicated by io_bp being * non-NULL) we need to allocate a read buffer for * each child and issue reads to all children. If * any child succeeds, it will copy its data into ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * zio->io_data in vdev_mirror_scrub_done. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ But do we need to copy for every good one ? */ for (c = 0; c < mm->mm_children; c++) { mc = &mm->mm_child[c]; zio_nowait(zio_vdev_child_io(zio, zio->io_bp, mc->mc_vd, mc->mc_offset, abd_alloc_sametype(zio->io_abd, zio->io_size), zio->io_size, zio->io_type, zio->io_priority, 0, vdev_mirror_scrub_done, mc)); } zio_execute(zio); return; } /* * For normal reads just pick one child. */ c = vdev_mirror_child_select(zio); children = (c >= 0); } --- The child zio will do the checksum verification. vdev_mirror_io_done --- if (good_copies && spa_writeable(zio->io_spa) && (unexpected_errors || (zio->io_flags & ZIO_FLAG_RESILVER) || ((zio->io_flags & ZIO_FLAG_SCRUB) && mm->mm_resilvering))) { /* * Use the good data we have in hand to repair damaged children. */ for (c = 0; c < mm->mm_children; c++) { /* * Don't rewrite known good children. * Not only is it unnecessary, it could * actually be harmful: if the system lost * power while rewriting the only good copy, * there would be no good copies left! */ mc = &mm->mm_child[c]; if (mc->mc_error == 0) { // For the scrub, it will read every child, the successful one // would have a 'mc_error == 0' and 'mc_tried = 1'. Only the failed one // will be repaired. // For the resilvering one, only one is readin, it will try to repair // the one which is missing in DTL table. if (mc->mc_tried) continue; /* * We didn't try this child. We need to * repair it if: * 1. it's a scrub (in which case we have * tried everything that was healthy) * - or - * 2. it's an indirect vdev (in which case * it could point to any other vdev, which * might have a bad DTL) * - or - * 3. the DTL indicates that this data is * missing from this vdev */ if (!(zio->io_flags & ZIO_FLAG_SCRUB) && mc->mc_vd->vdev_ops != &vdev_indirect_ops && !vdev_dtl_contains(mc->mc_vd, DTL_PARTIAL, zio->io_txg, 1)) continue; mc->mc_error = SET_ERROR(ESTALE); } zio_nowait(zio_vdev_child_io(zio, zio->io_bp, mc->mc_vd, mc->mc_offset, zio->io_abd, zio->io_size, ZIO_TYPE_WRITE, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_IO_REPAIR | (unexpected_errors ? ZIO_FLAG_SELF_HEAL : 0), NULL, NULL)); } } ---

Talking

We call this section as 'Talking' because we have not got a global view of
of the ZFS. So we have to just setup some small, separate and independent
sections here.

self healing

Mirror

There are two layers of vdev

              vdev of mirror
                  /  \
                 /    \
              vdev   vdev
              sda     sdb

vdev_mirror_io_start
---
    if (zio->io_type == ZIO_TYPE_READ) {
        ...
        /*
         * For normal reads just pick one child.
         */

        c = vdev_mirror_child_select(zio);
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        children = (c >= 0);

    } else {
        ASSERT(zio->io_type == ZIO_TYPE_WRITE);

        /*
         * Writes go to all children.
         */
        c = 0;
        children = mm->mm_children;
    }

    // send out the child IOs

    while (children--) {
        mc = &mm->mm_child[c];
        zio_nowait(zio_vdev_child_io(zio, zio->io_bp,
            mc->mc_vd, mc->mc_offset, zio->io_abd, zio->io_size,
            zio->io_type, zio->io_priority, 0,
            vdev_mirror_child_done, mc));
        c++;
    }

    zio_execute(zio);
---

vdev_mirror_child_select
---
    for (c = 0; c < mm->mm_children; c++) {
        mirror_child_t *mc;

        mc = &mm->mm_child[c];

        if (mc->mc_tried || mc->mc_skipped)

            continue;
        ...
    
        mc->mc_load = vdev_mirror_load(mm, mc->mc_vd, mc->mc_offset);
        if (mc->mc_load > lowest_load)
            continue;

        if (mc->mc_load < lowest_load) {
            lowest_load = mc->mc_load;
            mm->mm_preferred_cnt = 0;
        }
        mm->mm_preferred[mm->mm_preferred_cnt] = c;
        mm->mm_preferred_cnt++;
    }
---

Every child IO (multiple for write, one for read) would go through the zio_pipeline.

VDEV_IO_START -> VDEV_IO_DONE -> VDEV_IO_ASSESS -> CHECKSUM_VERIFY -> DONE

There are some special things about the child zio.

zio_vdev_child_io
---
    enum zio_stage pipeline = ZIO_VDEV_CHILD_PIPELINE;

    #define    ZIO_VDEV_IO_STAGES            \
        (ZIO_STAGE_VDEV_IO_START |        \
        ZIO_STAGE_VDEV_IO_DONE |        \
        ZIO_STAGE_VDEV_IO_ASSESS)

    #define    ZIO_VDEV_CHILD_PIPELINE            \
        (ZIO_VDEV_IO_STAGES |            \
        ZIO_STAGE_DONE)


    ...
    if (type == ZIO_TYPE_READ && bp != NULL) {
        /*

         * If we have the bp, then the child should perform the
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         * checksum and the parent need not.  This pushes error
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

         * detection as close to the leaves as possible and
         * eliminates redundant checksums in the interior nodes.
         */
        pipeline |= ZIO_STAGE_CHECKSUM_VERIFY;
        pio->io_pipeline &= ~ZIO_STAGE_CHECKSUM_VERIFY;
    }
    ...
    flags |= ZIO_VDEV_CHILD_FLAGS(pio);

    #define    ZIO_VDEV_CHILD_FLAGS(zio)                \
        (((zio)->io_flags & ZIO_FLAG_VDEV_INHERIT) |        \
        ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_CANFAIL)

---

When the zio_done is invoked for the child io of mirror,

zio_done
  -> zio->io_done
     vdev_mirror_child_done
     ---
        mc->mc_error = zio->io_error;
        mc->mc_tried = 1;
        mc->mc_skipped = 0;
     ---
  -> zio_notify_parent
  ---
    uint64_t *countp = &pio->io_children[zio->io_child_type][wait];
    ...
    mutex_enter(&pio->io_lock);
    ...
    (*countp)--;

    if (*countp == 0 && pio->io_stall == countp) {
        zio_taskq_type_t type =
            pio->io_stage < ZIO_STAGE_VDEV_IO_START ? ZIO_TASKQ_ISSUE :
            ZIO_TASKQ_INTERRUPT;
        pio->io_stall = NULL;
        mutex_exit(&pio->io_lock);

        if (next_to_executep != NULL && *next_to_executep == NULL) {

            *next_to_executep = pio;
            ^^^^^^^^^^^^^^^^^^^^^^^

        } else {
            zio_taskq_dispatch(pio, type, B_FALSE);
        }
    } 
  ---

zio_done of child zio would return its parent zio which will be executed next.

The zio_vdev_io_done will be invoked for mirror zio.
zio_vdev_io_done
  -> ops->vdev_op_io_done
     vdev_mirror_io_done
     ---
    for (c = 0; c < mm->mm_children; c++) {
        mc = &mm->mm_child[c];

        if (mc->mc_error) {
            if (!mc->mc_skipped)
                unexpected_errors++;
        } else if (mc->mc_tried) {
            good_copies++;
        }
    }

    ...

    /*
     * If we don't have a good copy yet, keep trying other children.
     */

    /* XXPOLICY */
    if (good_copies == 0 && (c = vdev_mirror_child_select(zio)) != -1) {
        ASSERT(c >= 0 && c < mm->mm_children);
        mc = &mm->mm_child[c];
        zio_vdev_io_redone(zio);
        zio_nowait(zio_vdev_child_io(zio, zio->io_bp,
            mc->mc_vd, mc->mc_offset, zio->io_abd, zio->io_size,
            ZIO_TYPE_READ, zio->io_priority, 0,
            vdev_mirror_child_done, mc));
        return;
    }
    ...
    if (good_copies && spa_writeable(zio->io_spa) &&
        (unexpected_errors ||
        (zio->io_flags & ZIO_FLAG_RESILVER) ||
        ((zio->io_flags & ZIO_FLAG_SCRUB) && mm->mm_resilvering))) {
        /*
         * Use the good data we have in hand to repair damaged children.
         */
        for (c = 0; c < mm->mm_children; c++) {
            /*
             * Don't rewrite known good children.
             * Not only is it unnecessary, it could
             * actually be harmful: if the system lost
             * power while rewriting the only good copy,
             * there would be no good copies left!
             */
            mc = &mm->mm_child[c];

            if (mc->mc_error == 0) {
                if (mc->mc_tried)
                    continue;
                ...
            }

            zio_nowait(zio_vdev_child_io(zio, zio->io_bp,
                mc->mc_vd, mc->mc_offset,
                zio->io_abd, zio->io_size,
                ZIO_TYPE_WRITE, ZIO_PRIORITY_ASYNC_WRITE,
                ZIO_FLAG_IO_REPAIR | (unexpected_errors ?
                ZIO_FLAG_SELF_HEAL : 0), NULL, NULL));
        }

In conclusion, the silent data corruption and self-healing is done in vdev mirror layer.
^^^^^^^^^^^^^^^^^ There is nothing to do with zio reexecute.

RAIDZ

One of the most important thing we should know is that

RAIDZ has dynamic stripe width

                 +--+--+--+--+--+
                 |P0|D0|D2|D4|D6|
                 +--+--+--+--+--+
                 |P1|D1|D3|D5|D7|
                 +--+--+--+--+--+
                 |P0|D1|D2|D3|P0|
                 +--+--+--+--+--+
                 |D1|D2|D3|P0|D0|
                 +--+--+--+--+--+
                 |P0|D0|D1|D2|D3|
                 +--+--+--+--+--+

 variable block size from 512 byte to 16M
 every logical block has its own stripe
 every write is a full stripe write

And the checksum is against block
So in RAIDZ, the checksum isn't checked on the child zio under it.

vdev_raidz_io_start
---
    for (c = rm->rm_cols - 1; c >= 0; c--) {
        rc = &rm->rm_col[c];
        cvd = vd->vdev_child[rc->rc_devidx];
        ...
        if (c >= rm->rm_firstdatacol || rm->rm_missingdata > 0 ||
            (zio->io_flags & (ZIO_FLAG_SCRUB | ZIO_FLAG_RESILVER))) {
            zio_nowait(zio_vdev_child_io(zio, NULL, cvd,

                                               /\
                                               ||
                                            The block pointer here is NULL

                rc->rc_offset, rc->rc_abd, rc->rc_size,
                zio->io_type, zio->io_priority, 0,
                vdev_raidz_child_done, rc));
        }
    }
---
zio_vdev_child_io
---
    if (type == ZIO_TYPE_READ && bp != NULL) {
        pipeline |= ZIO_STAGE_CHECKSUM_VERIFY;
        pio->io_pipeline &= ~ZIO_STAGE_CHECKSUM_VERIFY;
    }
---
Only the bp is provided, zfs check the checksum of the child zio. (Mirror is
that case)

The checksum is checked in vdev_raidz_io_done with raidz_checksum_verify.
If data errors occurred:

Try to reassemble the data from the parity available.
If we haven't yet read the parity drives, read them now.
If all parity drives have been read but the data still doesn't
reassemble with a correct checksum, then try combinatorial reconstruction.
If that doesn't work, return an error.

Let's look at the code segment that read in all of the columns and perform combinatorial reconstruction over all possible combinations.

vdev_raidz_io_done
---
    for (c = 0; c < rm->rm_cols; c++) {
        if (rm->rm_col[c].rc_tried)// updated by vdev_raidz_child_done
            continue;

        zio_vdev_io_redone(zio);
        do {
            rc = &rm->rm_col[c];
            if (rc->rc_tried)
                continue;
            zio_nowait(zio_vdev_child_io(zio, NULL,
                vd->vdev_child[rc->rc_devidx],
                rc->rc_offset, rc->rc_abd, rc->rc_size,
                zio->io_type, zio->io_priority, 0,
                vdev_raidz_child_done, rc));
        } while (++c < rm->rm_cols);

        return;
    }

    if (total_errors > rm->rm_firstdatacol) {
        zio->io_error = vdev_raidz_worst_error(rm);

    } else if (total_errors < rm->rm_firstdatacol &&

        (code = vdev_raidz_combrec(zio, total_errors, data_errors)) != 0) {

        if (code != (1 << rm->rm_firstdatacol) - 1)
            (void) raidz_parity_verify(zio, rm);
    }
---
vdev_raidz_combrec
---
            code = vdev_raidz_reconstruct(rm, tgts, n);
            if (raidz_checksum_verify(zio) == 0) {
---

If we get valid data after reconstruction attempts, vdev_raidz_io_done would try to repair the errors.

    if (zio->io_error == 0 && spa_writeable(zio->io_spa) &&
        (unexpected_errors || (zio->io_flags & ZIO_FLAG_RESILVER))) {
        /*
         * Use the good data we have in hand to repair damaged children.
         */
        for (c = 0; c < rm->rm_cols; c++) {
            rc = &rm->rm_col[c];
            cvd = vd->vdev_child[rc->rc_devidx];

            if (rc->rc_error == 0)
                continue;

            zio_nowait(zio_vdev_child_io(zio, NULL, cvd,
                rc->rc_offset, rc->rc_abd, rc->rc_size,
                ZIO_TYPE_WRITE, ZIO_PRIORITY_ASYNC_WRITE,
                ZIO_FLAG_IO_REPAIR | (unexpected_errors ?
                ZIO_FLAG_SELF_HEAL : 0), NULL, NULL));
        }
    }

update_uberblock

The structure of the uberblock

struct uberblock {
    uint64_t    ub_magic;    /* UBERBLOCK_MAGIC        */
    uint64_t    ub_version;    /* SPA_VERSION            */
    uint64_t    ub_txg;        /* txg of last sync        */
    uint64_t    ub_guid_sum;    /* sum of all vdev guids    */
    uint64_t    ub_timestamp;    /* UTC time of last sync    */
    blkptr_t    ub_rootbp;    /* MOS objset_phys_t        */
    ...
    }

The MOS (Meta Object set) is unique around one pool.

The process of update uberblock

spa_sync_iterate_to_convergence
  -> dsl_pool_sync
    -> dsl_pool_sync_mos
     ---
    zio_t *zio = zio_root(dp->dp_spa, NULL, NULL, ZIO_FLAG_MUSTSUCCEED);
    dmu_objset_sync(dp->dp_meta_objset, zio, tx);
    VERIFY0(zio_wait(zio));
    dprintf_bp(&dp->dp_meta_rootbp, "meta objset rootbp is %s", "");
    spa_set_rootblkptr(dp->dp_spa, &dp->dp_meta_rootbp);
     ---

dmu_objset_sync
---
    zio = arc_write(pio, os->os_spa, tx->tx_txg,
        blkptr_copy, os->os_phys_buf, DMU_OS_IS_L2CACHEABLE(os),
        &zp, dmu_objset_write_ready, NULL, NULL, dmu_objset_write_done,
             ^^^^^^^^^^^^^^^^^^^^^^
        os, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb);

---
dmu_objset_write_ready
---
    if (os->os_dsl_dataset != NULL)
        rrw_enter(&os->os_dsl_dataset->ds_bp_rwlock, RW_WRITER, FTAG);
    *os->os_rootbp = *bp;
    if (os->os_dsl_dataset != NULL)
        rrw_exit(&os->os_dsl_dataset->ds_bp_rwlock, FTAG);
---

This os_rootbp points to the dsl_pool_t.dp_meta_rootbp
dsl_pool_init
---
    err = dmu_objset_open_impl(spa, NULL, &dp->dp_meta_rootbp,
        &dp->dp_meta_objset);
---

dsl_pool_sync_mos will set this dp->dp_meta_rootbp to uberblock.
void
spa_set_rootblkptr(spa_t *spa, const blkptr_t *bp)
{
    spa->spa_uberblock.ub_rootbp = *bp;
}

The spa->spa_uberblock will be set to disk
spa_sync
  -> spa_sync_iterate_to_convergence
  -> spa_sync_rewrite_vdev_config
    -> vdev_config_sync
      -> vdev_label_sync_list
        -> vdev_uberblock_sync_list
        ---
        for (int v = 0; v < svdcount; v++)
            vdev_uberblock_sync(zio, &good_writes, ub, svd[v], flags);

        (void) zio_wait(zio);

        /*
         * Flush the uberblocks to disk.  This ensures that the odd labels
         * are no longer needed (because the new uberblocks and the even
         * labels are safely on disk), so it is safe to overwrite them.
         */
        zio = zio_root(spa, NULL, NULL, flags);

        for (int v = 0; v < svdcount; v++) {
            if (vdev_writeable(svd[v])) {
                zio_flush(zio, svd[v]);
            }
        }

        (void) zio_wait(zio);
        ---

vdev_uberblock_sync
---
    /* Copy the uberblock_t into the ABD */
    abd_t *ub_abd = abd_alloc_for_io(VDEV_UBERBLOCK_SIZE(vd), B_TRUE);
    abd_zero(ub_abd, VDEV_UBERBLOCK_SIZE(vd));
    abd_copy_from_buf(ub_abd, ub, sizeof (uberblock_t));

    for (int l = 0; l < VDEV_LABELS; l++)
        vdev_label_write(zio, vd, l, ub_abd,
            VDEV_UBERBLOCK_OFFSET(vd, n), VDEV_UBERBLOCK_SIZE(vd),
            vdev_uberblock_sync_done, good_writes,
            flags | ZIO_FLAG_DONT_PROPAGATE);
        -> 
        ---
        zio_nowait(zio_write_phys(zio, vd,
            vdev_label_offset(vd->vdev_psize, l, offset),
            size, buf, ZIO_CHECKSUM_LABEL, done, private,
            ZIO_PRIORITY_SYNC_WRITE, flags, B_TRUE));
        ---
---

dmu transaction quiescing

A quiescing txg doesn't accept new tx

All of its entering txs need to be completed before hand off a quiesced txg to sync thread

Next, let's look at how to do that.

dmu_tx_assign
  -> dmu_tx_try_assign
    -> txg_hold_open
    ---
    kpreempt_disable();
    tc = &tx->tx_cpu[CPU_SEQID];
    kpreempt_enable();

    mutex_enter(&tc->tc_open_lock);

    txg = tx->tx_open_txg;

    mutex_enter(&tc->tc_lock);
    tc->tc_count[txg & TXG_MASK]++;
    mutex_exit(&tc->tc_lock);
    ---

dmu_tx_commit
  -> txg_rele_to_sync
  ---
      tx_cpu_t *tc = th->th_cpu;
    int g = th->th_txg & TXG_MASK;

    mutex_enter(&tc->tc_lock);
    if (--tc->tc_count[g] == 0)
        cv_broadcast(&tc->tc_cv[g]);
    mutex_exit(&tc->tc_lock);
  ---

txg_quiesce_thread
  ---

        // we can only have one txg in "quiescing" or
        // "quiesced, waiting to sync" state.  So we wait until
        // the "quiesced, waiting to sync" txg has been consumed
        // by the sync thread.

        while (!tx->tx_exiting &&
            (tx->tx_open_txg >= tx->tx_quiesce_txg_waiting ||
            txg_has_quiesced_to_sync(dp)))
            txg_thread_wait(tx, &cpr, &tx->tx_quiesce_more_cv, 0);

        txg = tx->tx_open_txg;
        tx->tx_quiescing_txg = txg;

        mutex_exit(&tx->tx_sync_lock);

        txg_quiesce(dp, txg);

        mutex_enter(&tx->tx_sync_lock);

        /*
         * Hand this txg off to the sync thread.
         */
        tx->tx_quiescing_txg = 0;
        tx->tx_quiesced_txg = txg;
        cv_broadcast(&tx->tx_sync_more_cv); //Wake up the sync thread
  ---

txg_quiesce
---
    for (c = 0; c < max_ncpus; c++)
        mutex_enter(&tx->tx_cpu[c].tc_open_lock);

    tx->tx_open_txg++;

    tx->tx_open_time = tx_open_time = gethrtime();

    /*
     * Now that we've incremented tx_open_txg, we can let threads
     * enter the next transaction group.
     */

    for (c = 0; c < max_ncpus; c++)
        mutex_exit(&tx->tx_cpu[c].tc_open_lock);

    /*
     * Quiesce the transaction group by waiting for everyone to txg_exit().
     */
    for (c = 0; c < max_ncpus; c++) {
        tx_cpu_t *tc = &tx->tx_cpu[c];
        mutex_enter(&tc->tc_lock);
        while (tc->tc_count[g] != 0)
            cv_wait(&tc->tc_cv[g], &tc->tc_lock);
        mutex_exit(&tc->tc_lock);
    }
---

multiple DVAs of blkptr

Look at here

zio_write
---
    zio = zio_create(pio, spa, txg, bp, data, lsize, psize, done, private,
        ZIO_TYPE_WRITE, priority, flags, NULL, 0, zb,
        ZIO_STAGE_OPEN, (flags & ZIO_FLAG_DDT_CHILD) ?
        ZIO_DDT_CHILD_WRITE_PIPELINE : ZIO_WRITE_PIPELINE);
---

zio_read
---
    zio = zio_create(pio, spa, BP_PHYSICAL_BIRTH(bp), bp,
        data, size, size, done, private,
        ZIO_TYPE_READ, priority, flags, NULL, 0, zb,
        ZIO_STAGE_OPEN, (flags & ZIO_FLAG_DDT_CHILD) ?
        ZIO_DDT_CHILD_READ_PIPELINE : ZIO_READ_PIPELINE);
---

The vdev and io_offset parameter are both zero.
And look at the bottom of the zfs io stack,

vdev_disk_io_start
---

    // the bio is issued here !!!

    error = __vdev_disk_physio(vd->vd_bdev, zio,
        zio->io_size, zio->io_offset, rw, flags);
---

Where is the vdev and io_offset set ?
The zfs code is really tricky.
Look at the ZIO_STAGE_VDEV_IO_START of zio_pipeline,

zio_vdev_io_start
---
    if (vd == NULL) {
        if (!(zio->io_flags & ZIO_FLAG_CONFIG_WRITER))
            spa_config_enter(spa, SCL_ZIO, zio, RW_READER);


        // The mirror_ops handle multiple DVAs in a single BP.
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        // This also includes the single DVA one

        vdev_mirror_ops.vdev_op_io_start(zio);
        return (NULL);
    }
---


vdev_mirror_io_start
  -> vdev_mirror_map_init
---
    vdev_t *vd = zio->io_vd;
    ...
    if (vd == NULL) {
        dva_t *dva = zio->io_bp->blk_dva;
        spa_t *spa = zio->io_spa;
        dsl_scan_t *scn = spa->spa_dsl_pool->dp_scan;
        dva_t dva_copy[SPA_DVAS_PER_BP];

        if ((zio->io_flags & ZIO_FLAG_SCRUB) &&
            !(zio->io_flags & ZIO_FLAG_IO_RETRY) &&
            dsl_scan_scrubbing(spa->spa_dsl_pool) &&
            scn->scn_is_sorted) {
            c = 1;
        } else {

            c = BP_GET_NDVAS(zio->io_bp);

        }
        ...
        mm = vdev_mirror_map_alloc(c, B_FALSE, B_TRUE);
        for (c = 0; c < mm->mm_children; c++) {
            mc = &mm->mm_child[c];

            mc->mc_vd = vdev_lookup_top(spa, DVA_GET_VDEV(&dva[c]));
            mc->mc_offset = DVA_GET_OFFSET(&dva[c]);
        }
    } 
---

Then vdev_mirror_io_start
---
    while (children--) {
        mc = &mm->mm_child[c];
        zio_nowait(zio_vdev_child_io(zio, zio->io_bp,
            mc->mc_vd, mc->mc_offset, zio->io_abd, zio->io_size,
            zio->io_type, zio->io_priority, 0,
            vdev_mirror_child_done, mc));
        c++;
    }

    zio_execute(zio);
---

Then if the underlying vdev is still mirror, it will enter vdev_mirror_io_start

vdev_mirror_io_start
  -> vdev_mirror_map_init

---
    if (vd == NULL) {
        ...
    } else {
        boolean_t replacing = (vd->vdev_ops == &vdev_replacing_ops ||
            vd->vdev_ops == &vdev_spare_ops) &&
            spa_load_state(vd->vdev_spa) == SPA_LOAD_NONE &&
            dsl_scan_resilvering(vd->vdev_spa->spa_dsl_pool);

        mm = vdev_mirror_map_alloc(vd->vdev_children, replacing,
            B_FALSE);
        for (c = 0; c < mm->mm_children; c++) {
            mc = &mm->mm_child[c];
            mc->mc_vd = vd->vdev_child[c];
            mc->mc_offset = zio->io_offset;
        }
    }
---

So all of the things look like this

          blkptr with multiple DVAs

          +----+   +----+
          :    :   :    :    dumy mirror layer for the multiple DVAs
          :    :   :    :    different position on the underlying vdev
          +----+   +----+
             \       /   
              +-----+
              |     |
              |     |        the real mirror vdev
              +-----+
             /       \
          +----+   +----+
          |    |   |    |    the physical disk
          +----+   +----+

raidz dynamic stripe layout


RAIDZ has dynamic stripe width

                 +--+--+--+--+--+
                 |P0|D0|D2|D4|D6|
                 +--+--+--+--+--+
                 |P1|D1|D3|D5|D7|
                 +--+--+--+--+--+
                 |P0|D1|D2|D3|P0|
                 +--+--+--+--+--+
                 |D1|D2|D3|P0|D0|
                 +--+--+--+--+--+
                 |P0|D0|D1|D2|D3|
                 +--+--+--+--+--+

dynamic stripe allocation of raidz is done in vdev_raidz_map_alloc

vdev_raidz_map_alloc

The code is a bit tricky. Let's look at the following example,

dcols     vd->vdev_children     4
nparity   vd->vdev_nparity      1

device physical block size is 4K

There is a IO is 20K

    s = 5

The 3 critical value of the dynamic stripe layout is
    q = s / (dcols - nparity)
    r = s - q * (dcols - nparity);
    bc = (r == 0 ? 0 : r + nparity);

    q = 1
    r = 2
    bc = 3


    (r must be less than (dcols - nparity))


The stripe layout is

         Parity   Data    Data    Data
          +--+    +--+    +--+    +--+
          |  |    |00|    |02|    |04|
          +--+    +--+    +--+    +--+
          |  |    |01|    |03|
          +--+    +--+    +--+

    +--+
    |00|    data block with offset in the IO
    +--+
    Parity  The parity is always at first, see here

Based on the following code segment
---
    for (c = 0; c < scols; c++) {
        col = f + c;
        coff = o;
        if (col >= dcols) {
            col -= dcols;
            coff += 1ULL << ashift;
        }
        rm->rm_col[c].rc_devidx = col;
        ...
        if (c >= acols)
            rm->rm_col[c].rc_size = 0;
        else if (c < bc)
            rm->rm_col[c].rc_size = (q + 1) << ashift;
        else
            rm->rm_col[c].rc_size = q << ashift;

        asize += rm->rm_col[c].rc_size;
    }
    ...

//parity is always at first

    for (c = 0; c < rm->rm_firstdatacol; c++)
        rm->rm_col[c].rc_abd =
            abd_alloc_linear(rm->rm_col[c].rc_size, B_FALSE);

    for (; c < acols; c++) {
        rm->rm_col[c].rc_abd = abd_get_offset_size(zio->io_abd, off,
            rm->rm_col[c].rc_size);
        off += rm->rm_col[c].rc_size;
    }
---


The only question is the IO is spread across the child vdev's based on the raidz
vdev->vdev_ashift which come from vdev_raidz_open
---
    for (c = 0; c < vd->vdev_children; c++) {
        cvd = vd->vdev_child[c];

        if (cvd->vdev_open_error != 0) {
            lasterror = cvd->vdev_open_error;
            numerrors++;
            continue;
        }

        *asize = MIN(*asize - 1, cvd->vdev_asize - 1) + 1;
        *max_asize = MIN(*max_asize - 1, cvd->vdev_max_asize - 1) + 1;
        *ashift = MAX(*ashift, cvd->vdev_ashift);
    }
---

It is the physical block size of the child vdev.
Is it too fine ?

Compression

Do compression with some overhead of cpu cycles could get following benefits

save disk space, this maybe a good thing for replicas ?

less IO, may get higher performance with fixed bandwith

zio_write_compress is responsible for compression.

Actually, it is a common part of the zio write pipeline,
except for the compression, it will do following things,

wait for all of children IO to ready stage

This is a very important point of constructing COWed tree of blkptr_t.
Because the parent zio needs the new blkptr_t of the children zio

    /*
     * If our children haven't all reached the ready stage,
     * wait for them and then repeat this pipeline stage.
     */
    if (zio_wait_for_children(zio, ZIO_CHILD_LOGICAL_BIT |
        ZIO_CHILD_GANG_BIT, ZIO_WAIT_READY)) {
        return (NULL);
    }

    if (zio->io_children_ready != NULL) {
        /*
         * Now that all our children are ready, run the callback
         * associated with this zio in case it wants to modify the
         * data to be written.
         */
        ASSERT3U(zp->zp_level, >, 0);
        zio->io_children_ready(zio);
    }

update some fields of blkptr_t

        BP_SET_LSIZE(bp, lsize);
        BP_SET_TYPE(bp, zp->zp_type);
        BP_SET_LEVEL(bp, zp->zp_level);
        BP_SET_PSIZE(bp, psize);
        BP_SET_COMPRESS(bp, compress);
        BP_SET_CHECKSUM(bp, zp->zp_checksum);
        BP_SET_DEDUP(bp, zp->zp_dedup);
        BP_SET_BYTEORDER(bp, ZFS_HOST_BYTEORDER);

In normal case,

       lsize = pisze = zio->io_size

Let's come back and focus on the compression

zio_write_compress
---
    if (compress != ZIO_COMPRESS_OFF &&
        !(zio->io_flags & ZIO_FLAG_RAW_COMPRESS)) {

        // cbuf is new buffer allocated for compressed data
        // lsize is the size of the original size
        // psize is the size of the compressed size

        void *cbuf = zio_buf_alloc(lsize);
        psize = zio_compress_data(compress, zio->io_abd, cbuf, lsize);


        if (psize == 0 || psize == lsize) {
            compress = ZIO_COMPRESS_OFF;
            zio_buf_free(cbuf, lsize);
        } else if (!zp->zp_dedup && !zp->zp_encrypt &&
            psize <= BPE_PAYLOAD_SIZE &&
            zp->zp_level == 0 && !DMU_OT_HAS_FILL(zp->zp_type) &&
            spa_feature_is_enabled(spa, SPA_FEATURE_EMBEDDED_DATA)) {


        // Normally, block pointers point (via their DVAs) to a block which holds data.
        // If the data that we need to store is very small, this is an inefficient
        // use of space, because a block must be at minimum 1 sector (typically 512
        // bytes or 4KB).  Additionally, reading these small blocks tends to generate
        // more random reads.
 
        // Embedded-data Block Pointers allow small pieces of data (the "payload",
        // up to 112 bytes) to be stored in the block pointer itself, instead of
        // being pointed to. 

            encode_embedded_bp_compressed(bp,
                cbuf, compress, lsize, psize);
            BPE_SET_ETYPE(bp, BP_EMBEDDED_TYPE_DATA);
            BP_SET_TYPE(bp, zio->io_prop.zp_type);
            BP_SET_LEVEL(bp, zio->io_prop.zp_level);
            zio_buf_free(cbuf, lsize);
            bp->blk_birth = zio->io_txg;
            zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
            return (zio);
        } else {
            /*
             * Round up compressed size up to the ashift
             * of the smallest-ashift device, and zero the tail.
             * This ensures that the compressed size of the BP
             * (and thus compressratio property) are correct,
             * in that we charge for the padding used to fill out
             * the last sector.
             */
            size_t rounded = (size_t)P2ROUNDUP(psize,
                1ULL << spa->spa_min_ashift);
            if (rounded >= lsize) {
                compress = ZIO_COMPRESS_OFF;
                zio_buf_free(cbuf, lsize);
                psize = lsize;
            } else {
                abd_t *cdata = abd_get_from_buf(cbuf, lsize);
                abd_take_ownership_of_buf(cdata, B_TRUE);
                abd_zero_off(cdata, psize, rounded - psize);

                psize = rounded;
                zio_push_transform(zio, cdata,
                    psize, lsize, NULL);

            }
        }
---

zio_push_transform is very important
---
    zio_transform_t *zt = kmem_alloc(sizeof (zio_transform_t), KM_SLEEP);

    zt->zt_orig_abd = zio->io_abd;
    zt->zt_orig_size = zio->io_size;
    zt->zt_bufsize = bufsize;
    zt->zt_transform = transform;

    zt->zt_next = zio->io_transform_stack;
    zio->io_transform_stack = zt;

    zio->io_abd = data;
    zio->io_size = size;
---

It saves the original io_abd into zio->io_transform_stack and
put the newly compressed data into the zio.

The original io_abd will be poped out in zio_done

Finally, zio_dva_allocate won't feel whether the zio is compressed.

zio_dva_allocate
---
    error = metaslab_alloc(spa, mc, zio->io_size, bp,
        zio->io_prop.zp_copies, zio->io_txg, NULL, flags,
        &zio->io_alloc_list, zio, zio->io_allocator);
---


The required size is the zio->io_size

The decompress process is plugged in read zio pipeline but very different with compress part.

zio_read_bp_init
---
    if (BP_GET_COMPRESS(bp) != ZIO_COMPRESS_OFF &&
        zio->io_child_type == ZIO_CHILD_LOGICAL &&

        !(zio->io_flags & ZIO_FLAG_RAW_COMPRESS)) {

        zio_push_transform(zio, abd_alloc_sametype(zio->io_abd, psize),
            psize, psize, zio_decompress);
    }
---

This ZIO_FLAG_RAW_COMPRESS will be discussed in Deferred Decompression
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


Then

zio_done
  -> zio_pop_transforms
  ---
    while ((zt = zio->io_transform_stack) != NULL) {
        if (zt->zt_transform != NULL)
            zt->zt_transform(zio,
                zt->zt_orig_abd, zt->zt_orig_size);

        if (zt->zt_bufsize != 0)
            abd_free(zio->io_abd);

        zio->io_abd = zt->zt_orig_abd;
        zio->io_size = zt->zt_orig_size;
        zio->io_transform_stack = zt->zt_next;

        kmem_free(zt, sizeof (zio_transform_t));
    }

  ---

Deferred Decompression In this feature, the decompression is deferred util data is read from cache.
It sacrifices the cpu cycles to save the more limited system memory.
Look at the code

arc_read
---
        if (hdr == NULL) {
            /*
             * This block is not in the cache or it has
             * embedded data.
             */
            arc_buf_hdr_t *exists = NULL;
            arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
            hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,
                BP_IS_PROTECTED(bp), BP_GET_COMPRESS(bp), type,
                encrypted_read);
            ...
        }
        ...
        if (encrypted_read) {
            ...
        } else {
            size = arc_hdr_size(hdr);
            hdr_abd = hdr->b_l1hdr.b_pabd;

            if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF) {
                zio_flags |= ZIO_FLAG_RAW_COMPRESS;
            }

        //arc_hdr_alloc will set the hdr compress with BP_GET_COMPRESS(bp)
        //in our scene, it should be set

            ...
        }
---

So it seems that the zio_done->zio_pop_transforms will not decompress the data.
When to do that ?

dbuf_read
---
    if (db->db_state == DB_CACHED) {
        spa_t *spa = dn->dn_objset->os_spa;
        ...

        /*
         * If the arc buf is compressed or encrypted and the caller
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         * requested uncompressed data, we need to untransform it
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
         * before returning. We also call arc_untransform() on any
         * unauthenticated blocks, which will verify their MAC if
         * the key is now available.
         */

        if (err == 0 && db->db_buf != NULL &&
            (flags & DB_RF_NO_DECRYPT) == 0 &&
            (arc_is_encrypted(db->db_buf) ||
            arc_is_unauthenticated(db->db_buf) ||

            arc_get_compression(db->db_buf) != ZIO_COMPRESS_OFF)) {

            err = arc_untransform(db->db_buf, spa, &zb, B_FALSE);
            dbuf_set_data(db, db->db_buf);
        }
        mutex_exit(&db->db_mtx);
    } 
---

arc_buf_fill --- boolean_t hdr_compressed = (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF); boolean_t compressed = (flags & ARC_FILL_COMPRESSED) != 0; ... // The data in the buf is what we want if (hdr_compressed == compressed) { if (!arc_buf_is_shared(buf)) { abd_copy_to_buf(buf->b_data, hdr->b_l1hdr.b_pabd, arc_buf_size(buf)); } } else { if (arc_buf_is_shared(buf)) { ... } else if (ARC_BUF_COMPRESSED(buf)) { /* We need to reallocate the buf's b_data */ arc_free_data_buf(hdr, buf->b_data, HDR_GET_PSIZE(hdr), buf); buf->b_data = arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf); } buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED; /* * Try copying the data from another buf which already has a * decompressed version. If that's not possible, it's time to * bite the bullet and decompress the data from the hdr. */ if (arc_buf_try_copy_decompressed_data(buf)) { return (0); } else { error = zio_decompress_data(HDR_GET_COMPRESS(hdr), hdr->b_l1hdr.b_pabd, buf->b_data, HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr)); ... } } ---

prefetch

There are two level prefetch in zfs stack

File level prefetch

Vdev level prefetch

This prefetch mechanism works during the zio pipeline

zio_vdev_io_start
---
    if (vd->vdev_ops->vdev_op_leaf && (zio->io_type == ZIO_TYPE_READ ||
        zio->io_type == ZIO_TYPE_WRITE || zio->io_type == ZIO_TYPE_TRIM)) {

        if (zio->io_type == ZIO_TYPE_READ && vdev_cache_read(zio))
            return (zio);
        ...
    }
---

vdev_cache_read
---
    mutex_enter(&vc->vc_lock);

    ve_search = kmem_alloc(sizeof (vdev_cache_entry_t), KM_SLEEP);
    ve_search->ve_offset = cache_offset;
    ve = avl_find(&vc->vc_offset_tree, ve_search, NULL);
    kmem_free(ve_search, sizeof (vdev_cache_entry_t));

    if (ve != NULL) {
        if (ve->ve_missed_update) {
            mutex_exit(&vc->vc_lock);
            return (B_FALSE);
        }

        if ((fio = ve->ve_fill_io) != NULL) {
            zio_vdev_io_bypass(zio);
            zio_add_child(zio, fio);
            mutex_exit(&vc->vc_lock);
            VDCSTAT_BUMP(vdc_stat_delegations);
            return (B_TRUE);
        }

        vdev_cache_hit(vc, ve, zio);
        zio_vdev_io_bypass(zio);

        mutex_exit(&vc->vc_lock);
        VDCSTAT_BUMP(vdc_stat_hits);
        return (B_TRUE);
    }

    ve = vdev_cache_allocate(zio);

    if (ve == NULL) {
        mutex_exit(&vc->vc_lock);
        return (B_FALSE);
    }

    fio = zio_vdev_delegated_io(zio->io_vd, cache_offset,
        ve->ve_abd, VCBS, ZIO_TYPE_READ, ZIO_PRIORITY_NOW,
        ZIO_FLAG_DONT_CACHE, vdev_cache_fill, ve);

    ve->ve_fill_io = fio;
    zio_vdev_io_bypass(zio);
    zio_add_child(zio, fio);

    mutex_exit(&vc->vc_lock);
    zio_nowait(fio);
    VDCSTAT_BUMP(vdc_stat_misses);

    return (B_TRUE);

---

deduplication table

dedup is a process of eliminating duplicate data copies.

            [ block X ]    [ block Y ]    [ block Z ]
                   \            |            /
                    \           |           /
                     
                          [ block ON DISK ]

zfs ddt write

Look at the pipeline of a standard zio ddt write

#define    ZIO_DDT_WRITE_PIPELINE            \
    (ZIO_INTERLOCK_STAGES |            \
    ZIO_STAGE_WRITE_BP_INIT |        \
    ZIO_STAGE_ISSUE_ASYNC |            \
    ZIO_STAGE_WRITE_COMPRESS |        \
    ZIO_STAGE_ENCRYPT |            \
    ZIO_STAGE_CHECKSUM_GENERATE |        \
    ZIO_STAGE_DDT_WRITE)


NOTE:
     There is no ZIO_VDEV_IO_STAGES which includes IO_START and IO_DONE
                                                   ^^^^^^^^     ^^^^^^^

This seems to indicate that if one write hit the ddt, it will not trigger any IO.
We will look into the code to prove this.

Another important thing is that
static zio_pipe_stage_t *zio_pipeline[] = {
    NULL,
    zio_read_bp_init,
    zio_write_bp_init,
    zio_free_bp_init,
    zio_issue_async,
    zio_write_compress,
    zio_encrypt,

    zio_checksum_generate,

    zio_nop_write,
    zio_ddt_read_start,
    zio_ddt_read_done,

    zio_ddt_write,

    ...
    }

zio_ddt_write is after the zio_checksum_generate in the zio pipeline because the
zfs ddt need the checksum to identify the duplicated data.

the checksum should be strong enough to dedup without verification (compare byte
by byte), see ZCHECKSUM_FLAG_DEDUP

Let's look at the code of zfs_ddt_write

zfs_ddt_write
---
    blkptr_t *bp = zio->io_bp;
    int p = zp->zp_copies;
    ddt_t *ddt = ddt_select(spa, bp);

    ddt_enter(ddt); //A mutex lock
    dde = ddt_lookup(ddt, bp, B_TRUE);

      -> ddt_key_fill(&dde_search.dde_key, bp);
         ---
            ddk->ddk_cksum = bp->blk_cksum;
            ddk->ddk_prop = 0;

            DDK_SET_LSIZE(ddk, BP_GET_LSIZE(bp));
            DDK_SET_PSIZE(ddk, BP_GET_PSIZE(bp));
            DDK_SET_COMPRESS(ddk, BP_GET_COMPRESS(bp));
            DDK_SET_CRYPT(ddk, BP_USES_CRYPT(bp));
         ---
      -> dde = avl_find(&ddt->ddt_tree, &dde_search, &where);

    ddp = &dde->dde_phys[p];
    ...
    if (ddp->ddp_phys_birth != 0 || dde->dde_lead_zio[p] != NULL) {
        if (ddp->ddp_phys_birth != 0)

            // ddt hit, fill the zio->io_bp
            // actually, at this moment, the write io is deemed to be on disk.


            ddt_bp_fill(ddp, bp, txg);


        // The duplicate block IO is ongoing
        // Take it as our child IO, then we will be notified after it is completed.

        if (dde->dde_lead_zio[p] != NULL)
            zio_add_child(zio, dde->dde_lead_zio[p]);
        else
            ddt_phys_addref(ddp);
    } else if (zio->io_bp_override) {
        ...
    } else {

        // ddt miss, we need to issue a IO to disk.

        cio = zio_write(zio, spa, txg, bp, zio->io_orig_abd,
            zio->io_orig_size, zio->io_orig_size, zp,
            zio_ddt_child_write_ready, NULL, NULL,
            zio_ddt_child_write_done, dde, zio->io_priority,
            ZIO_DDT_CHILD_FLAGS(zio), &zio->io_bookmark);

        zio_push_transform(cio, zio->io_abd, zio->io_size, 0, NULL);

        dde->dde_lead_zio[p] = cio;

    }

    ddt_exit(ddt);

    if (cio)
        zio_nowait(cio);
    if (dio)
        zio_nowait(dio);

    return (zio);

---

write throttle

The dirty data in fly is limited by the write throttle mechanism.

dmu_tx_assign
---
    while ((err = dmu_tx_try_assign(tx, txg_how)) != 0) {
        dmu_tx_unassign(tx);

        if (err != ERESTART || !(txg_how & TXG_WAIT))
            return (err);

        dmu_tx_wait(tx);
    }
---

dmu_tx_try_assign
---
    if (!tx->tx_dirty_delayed &&
        dsl_pool_need_dirty_delay(tx->tx_pool)) {
        tx->tx_wait_dirty = B_TRUE;
        DMU_TX_STAT_BUMP(dmu_tx_dirty_delay);
        return (SET_ERROR(ERESTART));
    }
---

dsl_pool_need_dirty_delay
---
    uint64_t delay_min_bytes =
        zfs_dirty_data_max * zfs_delay_min_dirty_percent / 100;
    uint64_t dirty_min_bytes =
        zfs_dirty_data_max * zfs_dirty_data_sync_percent / 100;
    boolean_t rv;

    mutex_enter(&dp->dp_lock);
    if (dp->dp_dirty_total > dirty_min_bytes)
        txg_kick(dp);
    rv = (dp->dp_dirty_total > delay_min_bytes);
    mutex_exit(&dp->dp_lock);
---

The dsl_pool.dp_dirty_total is updated in following path
dbuf_dirty
  -> dmu_objset_willuse_space
    -> dsl_pool_dirty_space
      -> dsl_pool_dirty_delta
      ---
    dp->dp_dirty_total += delta;


    /*
     * Note: we signal even when increasing dp_dirty_total.
     * This ensures forward progress -- each thread wakes the next waiter.
     */

    if (dp->dp_dirty_total < zfs_dirty_data_max)
        cv_signal(&dp->dp_spaceavail_cv);

      ---
Who would decrease the dsl_pool->dp_dirty_total ?

 dbuf_write_physdone
    dbuf_write_physdone
      -> dsl_pool_undirty_space
        -> dsl_pool_dirty_delta(dp, -space);

 dsl_pool_sync

    /*
     * We have written all of the accounted dirty data, so our
     * dp_space_towrite should now be zero.  However, some seldom-used
     * code paths do not adhere to this (e.g. dbuf_undirty(), also
     * rounding error in dbuf_write_physdone).
     * Shore up the accounting of any dirtied space now.
     */

    dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);





Look at the dmu_tx_wait
---
    if (tx->tx_wait_dirty) {
        uint64_t dirty;

        /*
         * dmu_tx_try_assign() has determined that we need to wait
         * because we've consumed much or all of the dirty buffer
         * space.
         */

        mutex_enter(&dp->dp_lock);
        if (dp->dp_dirty_total >= zfs_dirty_data_max)
            DMU_TX_STAT_BUMP(dmu_tx_dirty_over_max);

        // A hard limit, dsl_pool_dirty_delta will notify us

        while (dp->dp_dirty_total >= zfs_dirty_data_max)
            cv_wait(&dp->dp_spaceavail_cv, &dp->dp_lock);
        dirty = dp->dp_dirty_total;
        mutex_exit(&dp->dp_lock);

        dmu_tx_delay(tx, dirty);

        tx->tx_wait_dirty = B_FALSE;

        /*
         * Note: setting tx_dirty_delayed only has effect if the
         * caller used TX_WAIT.  Otherwise they are going to
         * destroy this tx and try again.  The common case,
         * zfs_write(), uses TX_WAIT.
         */
        tx->tx_dirty_delayed = B_TRUE;
    }...
---

dmu_tx_delay is used to limit the incoming writes when the backend storage
cannot accommodate.

dmu_tx_delay(dmu_tx_t *tx, uint64_t dirty)
{
    dsl_pool_t *dp = tx->tx_pool;

    uint64_t delay_min_bytes =
        zfs_dirty_data_max * zfs_delay_min_dirty_percent / 100;

    hrtime_t wakeup, min_tx_time, now;

    if (dirty <= delay_min_bytes)
        return;

    now = gethrtime();

    min_tx_time = zfs_delay_scale *
        (dirty - delay_min_bytes) / (zfs_dirty_data_max - dirty);
    min_tx_time = MIN(min_tx_time, zfs_delay_max_ns);


    // The closer we are to the zfs_dirty_data_max, the longer will'll wait.

    if (now > tx->tx_start + min_tx_time)
        return;

    mutex_enter(&dp->dp_lock);
    wakeup = MAX(tx->tx_start + min_tx_time,
        dp->dp_last_wakeup + min_tx_time);
    dp->dp_last_wakeup = wakeup;
    mutex_exit(&dp->dp_lock);

    zfs_sleep_until(wakeup);
}

More detailed, please refer to the comment above the dmu_tx_delay.

Note that the value of zfs_dirty_data_max is relevant when sizing a separate intent log device (SLOG).
zfs_dirty_data_max puts a hard limit on the amount of data in memory that has yet been written to the
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
main pool; at most, that much data is active on the SLOG at any given time. This is why small, fast
^^^^^^^^^           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
devices such as the DDRDrive make for great log devices.

dynamically stripes

ZFS dynamically stripes data across all top-level virtual devices.

The decision about where to place data is done at write time, so no
fixed-width stripes are created at allocation time.

When new virtual devices are added to a pool, ZFS gradually allocates
data to the new device in order to maintain performance and disk space
allocation policies. Each virtual device can also be a mirror or a RAID-Z
device that contains other disk devices or files. This configuration gives
you flexibility in controlling the fault characteristics of your pool.
For example, you could create the following configurations out of four disks:

 Four disks using dynamic striping
 One four-way RAID-Z configuration
 Two two-way mirrors using dynamic striping

Although ZFS supports combining different types of virtual devices within
the same pool, avoid this practice. For example, you can create a pool with
a two-way mirror and a three-way RAID-Z configuration. However, your fault
tolerance is as good as your worst virtual device, RAID-Z in this case.
A best practice is to use top-level virtual devices of the same type with
the same redundancy level in each device.

The layout of vdevs of a pool with 4 two-way RAID-Z is

                                           root vdev
     _________________________________________^_______________________________________
    /                                                                                 \

        raidz1-0              raidz1-1                raidz1-2              raidz1-3      
    _______^________      _______^________        _______^________      _______^________    
   /                \    /                \      /                \    /                \   
   disk0 disk1 disk2     disk3 disk4 disk5       disk6 disk7 disk8     disk9 disk10 disk11    


construct_spec
spa_config_parse

The raidz1-0 raidz1-1 raidz1-2 raid1-3 are the top level vdevs.

Where does the code of dynamically stripe work?

The anwser is metaslab

Look at following comment,

A metaslab class encompasses a category of allocatable top-level vdevs.
Each top-level vdev is associated with a metaslab group which defines
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
the allocatable region for that vdev. Examples of these categories include
"normal" for data block allocations (i.e. main pool allocations) or "log"
for allocations designated for intent log devices (i.e. slog devices).
When a block allocation is requested from the SPA it is associated with a
metaslab_class_t, and only top-level vdevs (i.e. metaslab groups) belonging
to the class can be used to satisfy that request. Allocations are done
by traversing the metaslab groups that are linked off of the mc_rotor field.
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This rotor points to the next metaslab group where allocations will be
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
attempted. 
^^^^^^^^^^

Allocating a block is a 3 step process:
-- select the metaslab group
-- select the metaslab
-- allocate the block
The metaslab class defines the low-level block allocator that will be used
as the final step in allocation. These allocators are pluggable allowing each
class to use a block allocator that best suits that class.

metaslab_alloc_dva
---
    if (hintdva) {
        ...
    } else if (d != 0) {
        vd = vdev_lookup_top(spa, DVA_GET_VDEV(&dva[d - 1]));
        mg = vd->vdev_mg->mg_next;
    } else if (flags & METASLAB_FASTWRITE) {
        ...
    } else {
        mg = mc->mc_rotor;
    }

    rotor = mg;
top:
    do {
        boolean_t allocatable;

        vd = mg->mg_vd;
        ...
        uint64_t asize = vdev_psize_to_asize(vd, psize);
        uint64_t offset = metaslab_group_alloc(mg, zal, asize, txg,
            !try_hard, dva, d, allocator);

        if (offset != -1ULL) {
            ...
            if ((flags & METASLAB_FASTWRITE) ||
                atomic_add_64_nv(&mc->mc_aliquot, asize) >=
                mg->mg_aliquot + mg->mg_bias) {

                mc->mc_rotor = mg->mg_next;

                mc->mc_aliquot = 0;
            }
            ...
            DVA_SET_VDEV(&dva[d], vd->vdev_id);
            DVA_SET_OFFSET(&dva[d], offset);
            DVA_SET_GANG(&dva[d],
                ((flags & METASLAB_GANG_HEADER) ? 1 : 0));
            DVA_SET_ASIZE(&dva[d], asize);

            return (0);
        }
next:
        mc->mc_rotor = mg->mg_next;
        mc->mc_aliquot = 0;
    } while ((mg = mg->mg_next) != rotor);


---