Block Basis

concepts
blk-mq

Block legacy plug
BIO Merge
FLUSH and FUA
Queue state flags
WBT
blkdev gendisk hd
blk sysfs
request_queue cleanup and release
blk_integrity
blk loop blk-stats
blk-timeout
blk-throttle bsg
direct_IO
blk RPM
  • blk and hardware DISCARD

    concepts


    EIO is fatal for fs

    whether EIO is fatal or not depends on the component that is receiving it,
    and they behave accordingly. If a file system encounters EIO error during
    normal I/O (no metadata updates are involved), the error is bubbled back to
    user space. Here even userspace application can choose to behave differently.
    They can resubmit if possible, or crash if the I/O is related to recovery.
    
    In file systems case, if an EIO error is returned during journal
    update(metadata update) like in this case, it has 2 choices. 1) remount FS to
    read-only or 2) crash the node. If FS is in single user mode, it can take the
    FS to read-only, however if it's in clustered mode, it has to evict itself
    hoping at least other nodes can continue fine
    So avoid IO error as much as possible
    

    blk-mq


    sbitmap


    There are two parts in sbitmap:

    wait queue

    The core idea of sbitmap_queue is 'batch' and 'scatter'
    scatter

    Caller of sbq_wait_ptr has its owner wait_index.
    static inline struct sbq_wait_state *sbq_wait_ptr(struct sbitmap_queue *sbq,
                              atomic_t *wait_index)
    {
        struct sbq_wait_state *ws;
    
        ws = &sbq->ws[atomic_read(wait_index)];
        sbq_index_atomic_inc(wait_index);
    
        the wait_index will be increased every time.
    
        return ws;
    }
    Every time the caller try to get the sbq_wait_state, its wait_index will be
    increased 1.
    Take blk_mq_get_request as example, when there are multiple tasks try to
    allocate tag, if all of them fail, they will try to get a wait queue and sleep
    on it. sbq_wait_ptr will ensure they get different wait queue, so there will be
    no contending when the wait entry is adding on the wait queue.
    
    we could check this on the /sys/kernel/debug/block/nvme0n1/hctx0/tags  (driver tag )
    wake_index=0
    ws={
        {.wait_cnt=1, .wait=inactive},
        {.wait_cnt=1, .wait=active},
        {.wait_cnt=1, .wait=inactive},
        {.wait_cnt=1, .wait=active},
        {.wait_cnt=1, .wait=inactive},
        {.wait_cnt=1, .wait=active},
        {.wait_cnt=1, .wait=inactive},
        {.wait_cnt=1, .wait=active},
    }
    
    
    batch
    static void sbq_wake_up(struct sbitmap_queue *sbq)
    {
        ...
        ws = sbq_wake_ptr(sbq);
        if (!ws)
            return;
    
        wait_cnt = atomic_dec_return(&ws->wait_cnt);
        if (wait_cnt <= 0) {
            wake_batch = READ_ONCE(sbq->wake_batch);
            smp_mb__before_atomic();
            atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wait_cnt + wake_batch);
            sbq_index_atomic_inc(&sbq->wake_index);
            wake_up_nr(&ws->wait, wake_batch);
        }
    }
    
    wake_index=0
    ws={
        {.wait_cnt=1, .wait=inactive},
        {.wait_cnt=1, .wait=active},
        {.wait_cnt=1, .wait=inactive},
        {.wait_cnt=1, .wait=active},
        {.wait_cnt=1, .wait=inactive},
        {.wait_cnt=1, .wait=active},
        {.wait_cnt=1, .wait=inactive},
        {.wait_cnt=1, .wait=active},         only one wait queue will be waked up per wait_cnt
    }
    
    Does the wake_batch introduce delay on high speed device ?
    
    There is a interesting bug about wake_batch.
    The wake_batch is calculated based on the sbitmap_queue depth which is actually
    the tagset depth.
    But the runtime tagset depth could be changed due to shallow_depth and
    .limit_depth callback.
    
    BFQ could ends up limiting shallow_depth to something low that is smaller than
    the wake batch sizing for sbitmap, we can run into cases where we never wake up
    folks waiting for a tag. The end result is an idle system with no IO pending,
    but with tasks waiting for a tag with no one to wake them up because the
    wake_batch.
    
    Kyber could run into the same issue, if the async depth is limited low enough.
    

    tag


    There two types of tags.

    In the comment of the commit which add the MQ capable IO scheduler framework (bd166ef), Jens Axboe said:
    We split driver and scheduler tags, so we can run the scheduling independently of device queue depth.
    
      sched tags  sched tags  sched tags  sched tags
      
      Queue0      Queue1      Queue2      Queue3
    
                   shared driver tags
    
               HBA cmd queue [C][C][C][C]
    
       LUN0        LUN1       LUN2        LUN3
    
    

    tag allocation

    blk_mq_get_tag is used to allocate tag.
    There are following points need to be noded:

    if the tag is used up, there mainly two methods to wait the tags.

    tag sharing

    On HBA could connect to multiple LU, every LU has a request queue, all of these request_queue share a tagset of the HBA.
    From the view of scsi source code:

    scsi_alloc_sdev
      -> scsi_mq_alloc_queue
    ---
        sdev->request_queue = blk_mq_init_queue(&sdev->host->tag_set);
    
        all of the scsi dev (LU) share the same tagset of the host (HBA).
    
        if (IS_ERR(sdev->request_queue))
            return NULL;
    
        sdev->request_queue->queuedata = sdev;
        __scsi_init_queue(sdev->host, sdev->request_queue);
        blk_queue_flag_set(QUEUE_FLAG_SCSI_PASSTHROUGH, sdev->request_queue);
        return sdev->request_queue;
    ---
    
    For shared tag users, we track the number of currently active users and attempt to provide a fair share of the tag depth for each of them.
    blk_mq_get_request/blk_mq_get_driver_tag
      -> blk_mq_get_tag
        -> __blk_mq_get_tag
          -> hctx_may_queue
    static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
                      struct sbitmap_queue *bt)
    {
        unsigned int depth, users;
    
        if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
            return true;
        if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
            return true;
    
        /*
         * Don't try dividing an ant
         */
        if (bt->sb.depth == 1)
            return true;
    
        users = atomic_read(&hctx->tags->active_queues);
        if (!users)
            return true;
    
        /*
         * Allow at least some tags
         */
    
        depth = max((bt->sb.depth + users - 1) / users, 4U);
    
        return atomic_read(&hctx->nr_active) < depth;
    }
    
    There are two key points here: Where to activate them ?
    blk_mq_rq_ctx_init
    ---
        if (data->flags & blk_mq_req_internal) {
            rq->tag = -1;
            rq->internal_tag = tag;
        } else {
    
            if (blk_mq_tag_busy(data->hctx)) {
                rq_flags = RQF_MQ_INFLIGHT;
                atomic_inc(&data->hctx->nr_active);
            }
    
            rq->tag = tag;
            rq->internal_tag = -1;
            data->hctx->tags->rqs[rq->tag] = rq;
        }
    ---
    blk_mq_get_driver_tag
    ---
        rq->tag = blk_mq_get_tag(&data);
        if (rq->tag >= 0) {
            if (blk_mq_tag_busy(data.hctx)) {
                rq->rq_flags |= RQF_MQ_INFLIGHT;
                atomic_inc(&data.hctx->nr_active);
            }
            data.hctx->tags->rqs[rq->tag] = rq;
        }
    ---
    blk_mq_tag_busy
      -> __blk_mq_tag_busy
      ---
        if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
            !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
            atomic_inc(&hctx->tags->active_queues);
      ---
    
    When to deactivate them ? A interesting question:
    BLK-MQ
            q of LUN0  q of LUN1   q of LUN2   q of LUN3
                                             
            hctx       hctx        hctx        hctx
    
            active     active      active      inactive
    
                          driver tags
    ------------------------------------------------------
    LLDD    
                             HBA
    All the driver tags have been used up by the 3 active q.
    At the moment, we submit bio to an inactive q of LUN3, it cannot get driver tag
    and queue the req on the hctx->dispatch list.
    When will this hctx of LUN3 be waked up ?
    
    blk_mq_mark_tag_wait will put this hctx of LUN3 on the shared-tag's wait queue.
    When a driver tag is freed, it will wake up the waiters on the tag's wait queue
    in round-robin fashion.
    The active_queues of the shared-tags has been changed, so reqs to LUN0/1/2 have
    to wait for its budget even if hctxs of LUN0/1/2 are waked up prio to LUN3's.
    
    

    blk-mq io scheduler


    Here is part of the comment about io scheduler for blk-mq from the paper [Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems]

    While global sequential re-ordering is still possible across the multiple
    software queues, it is only necessary for HDD based devices, where the additional
    latency and locking overhead required to achieve total ordering does not hurt IOPS
    performance. It can be argued that, for many users, it is no longer necessary to
    employ advanced fairness scheduling as the speed of the devices are often
    exceeding the ability of even multiple applications to saturate their performance.
    If fairness is essential, it is possible to design a scheduler that exploits the 
    characteristics of SSDs at coarser granularity to achieve lower performance overhead.
    Whether the scheduler should reside in the block layer or on the SSD controller
    is an open issue. If the SSD is responsible for fair IO scheduling, it can leverage
    internal device parallelism, and lower latency, at the cost of additional interface
    complexity between disk and OS
    
    We could get following points from the comment above:
    [blk-mq io scheduler framework]
            [scheduler init]
            elevator_switch_mq
                -> blk_mq_init_sched //freezed and quiesced
                  -> [.init_sched]
                  -> [.init_hctx]
    
            [bio submit]
            blk_mq_make_request
              -> blk_mq_sched_bio_merge
                -> __blk_mq_sched_bio_merge
                  -> [.bio_merge]
                    -> blk_mq_sched_try_merge //bfq and mq-deadline, use it to merge a bio to existing request
                      elv_merge // get the merge decision and req
                        -> [.request_merge]
                      if ELEVATOR_BACK_MERGE
                         blk_mq_sched_allow_merge
                           -> [.allow_merge]
                         bio_attempt_back_merge // merge the bio to the tail of req
                         attempt_back_merge // the new bio may have fill the hole between req and the latter req
                           -> elv_latter_request
                             -> [.next_request]
                           -> attempt_merge
                             -> [.requests_merged] // notify the io  scheduler that the two reqs have been merged
                         elv_merged_request // if attempt_back_merge do nothing
                           -> [.request_merged] // one bio is merged into this req
                      else if ELEVATOR_FRONT_MERGE
                         blk_mq_sched_allow_merge
                           -> [.allow_merge]
                         bio_attempt_front_merge // merge the bio to the head of req
                         attempt_front_merge // the new bio may have fill the hole between req and the former req
                           -> elv_former_request
                             -> [.former_request]
                           -> attempt_merge
                             -> [.requests_merged] // notify the io  scheduler that the two reqs have been merged
                         elv_merged_request // if attempt_front_merge do nothing
                           -> [.request_merged]
                    -> if there is request merging happen, invoke blk_mq_free_request ot free the merged request
                      blk_mq_free_request
                        -> [.finish_request]
    
            [request allocation]
              blk_mq_get_request
                -> [.limit_depth] //update the blk_mq_alloc_data->shallow_depth
                -> blk_mq_get_tag
                  -> shallow_depth? __sbitmap_queue_get_shallow : __sbitmap_queue_get
                -> blk_mq_rq_ctx_init
                -> blk_mq_sched_assign_ioc
                  -> ioc_create_icq
                    -> [.init_icq] // only bfq use it
                -> [.prepare_request]
    
            [request enqueue]
              blk_mq_sched_insert_request
                -> [.insert_requests]
                  -> blk_mq_sched_try_merge
                    -> elv_attempt_insert_merge
                    try blk_attempt_req_merge on q->last_merge or req from elv_rqhash tree
                      -> attempt_merge
                        -> [.requests_merged] // notify the io  scheduler that the two reqs have been merged
                      //if there is request merging happen, invoke blk_mq_free_request ot free the merged request
                      -> blk_mq_free_request
                        -> [.finish_request]
    
            [dispatch request]
            blk_mq_sched_dispatch_requests
              -> blk_mq_do_dispatch_sched
                -> [.has_work] // blk_mq_sched_has_work
                -> [.dispatch_request]
            blk_mq_start_request
              -> blk_mq_sched_started_request
                -> [.started_request]
    
            [requeue request]
            blk_mq_requeue_request
              -> __blk_mq_requeue_request
                -> blk_mq_put_driver_tag // very important
              -> blk_mq_sched_requeue_request
                -> [.requeue_request]
            blk_mq_requeue_work
              -> blk_mq_sched_insert_request
    
    
            Note: in blk-mq, a requeued request will be inserted to io scheduler
            again, this is very different with blk-legacy. For the io scheduler of
            blk-mq, .requeue_request is same with .finish_request (bfq and kyber)
    
            [complete request]
            __blk_mq_complete_request
              -> blk_mq_sched_completed_request
                -> [.completed_request]
            blk_mq_free_request
              -> [.finish_request]
    
            We should notice: LLDD will not always complete a request with blk_mq_complete_request,
            but also blk_mq_end_request. At the moment, .completed_request will not be invoked.
    
    

    hctx


    issue directly

    This is a special path for high speed device.

    blk_mq_make_request
      -> blk_mq_try_issue_directly
        -> __blk_mq_try_issue_directly
    ---
        if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
            run_queue = false;
            bypass_insert = false;
            goto insert;
        }
    
        // No io scheduler
    
        if (q->elevator && !bypass_insert)
            goto insert;
    
        // No .get_budget
    
        if (!blk_mq_get_dispatch_budget(hctx))
            goto insert;
    
        // No io scheduler, so driver tag has been got
    
        if (!blk_mq_get_driver_tag(rq, NULL, false)) {
            blk_mq_put_dispatch_budget(hctx);
            goto insert;
        }
    
        return __blk_mq_issue_directly(hctx, rq, cookie);
    
        // invoke .queue_rq directly here
    
    insert:
        if (bypass_insert)
            return BLK_STS_RESOURCE;
    
        // if io scheduler is set, fallback to normal path
    
        blk_mq_sched_insert_request(rq, false, run_queue, false);
        return BLK_STS_OK;
    ---
    
    w/o io scheduler attached, the sync io could nearly bypass the whole blk-mq stack.
    
                submit_bio
    ----------------|---------------------
    BLK-MQ          v
                blk_mq_make_request
                    |
                ----^---- insert to ctx
                    |
                ----^---- run hctx
    ----------------|--------------------
    LLDD            v
                .queue_rq
    

    Where to run hctx

    Where to run the hctx ? or in the other word, will be a hctx ran on the cpu which is not mapped to this hctx ?
    Let's see the two basic scenario that the hctx will be ran.

    Whether the hctx will be executed on different mapped cpus concurrently ?
      cpu0    cpu1    cpu2    cpu3  
       .      flush   i_d     run_work
       .       .       .       .
       v       .       .       v
               v hctx0 .
    -------------------.---------------
                       v
                 HBA
    
    i_d  issue directly
    
    
    The possible concurrent path:

    hctx restart

    There are some cases where the requests cannot be dispatched immediately.

    hctx-restart is a supplement to tag wakeup hook, because not all dispatch deferring is due to lack of driver tag

    Let's look into the hctx restart next.
    Mark restart
    Currently, blk_mq_sched_mark_restart_hctx will only be invoked by blk_mq_sched_dispatch_requests when there are requests in hctx->dispatch list. The requests could be inserted into hctx->dispatchlist in following cases
    static void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx)
    {
        if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
            return;
    
        if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
            struct request_queue *q = hctx->queue;
    
            if (!test_and_set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
                atomic_inc(&q->shared_hctx_restart);
    
            //if not set, increase the q->shared_hctx_restart
            // shared_hctx_restart counts the number of hctx need to be restarted.
    
        } else
            set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
    }
    


    Restart
    For the non-shared tag case, it is very simple, just invoke blk_mq_run_hw_queue(hctx, true) finally.
    But for shared tag case, it is a bit complicated.
    We will do hctx restart around all the hctxs that share same tags in round-robin fashion.
    
    Why we need this ?
    
    for sharing the resource of lldd fairly 
    if we always restart the hctx which the freed request points to,
    other hctxs that share the same tagset will be starved.
    
                        restart
                /'---------------------------------------,
    BLK-MQ     V                                          \
            q of LUN0  q of LUN1   q of LUN2   q of LUN3   |
                                                           |
            hctx       hctx        hctx        hctx        |
                                                ^          |
                          driver tags           | blk_mq_free_request
    ------------------------------------------------------
    LLDD    
                             HBA
    
    We needn't worry about the fairly sharing on driver tag.
    sbitmap wakeup hook and tag-sharing (hctx_may_queue) will work well.
    
    Loop every q and hctx sharing the same tagset causes a massive performance regression if you have a lot of
    shared devices. 8e8320c (blk-mq: fix performance regression with shared tags) will fix this.
    
    A atomic shared_hctx_restart is added in request_queue to mark there is hctx need to be restarted in this
    request_queue. Then blk_mq_sched_restart_hctx don't need to loop every time.
    
    There is a question here:
    The rr fashion hctx restart check would only happen:
     - there is hctx marked as need restart
     - there is req freed on the request_queue
    
    What if there is no other req in-flight when hctx restart is marked ?
    Who restart the hctx ?  The others sharing the same tagset will not do that, because they are not marked as
    restart in q->shared_hctx_restart.
    
    This is genernal issue no matter sharing tag or not.
    If there is no in-flight request, and .queue_rq need to requeue the request:
     - return BLK_STS_RESOURCE
     - LLDD rerun the hw queue itself
    
    In fact, it looks that we don't always need to restart the hctxs in rr fashion.
     - if we fail to get driver tag, tags wakeup hook could save us
     - if we have reqs on hctx->dispatch which is inserted directly, it doesn't matter to other hctxs
    
    
    There are also some special cases, look at the code segment in blk_mq_dispatch_rq_list:
    if (!list_empty(list)) {
            bool needs_restart;
    
        // we reach here, because the .queue_rq returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE
    
            spin_lock(&hctx->lock);
            list_splice_init(list, &hctx->dispatch);
            spin_unlock(&hctx->lock);
    
            needs_restart = blk_mq_sched_needs_restart(hctx);
            if (!needs_restart ||
                (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
                blk_mq_run_hw_queue(hctx, true);
            else if (needs_restart && (ret == BLK_STS_RESOURCE))
                blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
        }
    
    When there is request left in hctx->dispatch list, there are some cases need to be handled:

    requeue

    __blk_mq_requeue_request is used to prepare for a requeue.

    ---
    
        //w/ io scheduler attached, there will be no in-queue req that
        //holds driver tag.
    
        blk_mq_put_driver_tag(rq);
    
        trace_block_rq_requeue(q, rq);
        wbt_requeue(q->rq_wb, &rq->issue_stat);
    
        if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
    
        // switch to IDLE state
    
            blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
        ...
        }
    ---
    
    Where will be the req requeued ? Question: why the blk_mq_sched_requeue_request is only invoked in blk_mq_requeue_request ?
    Look at the bfq and kyber, the callbacks of .requeue_request and .finish_request are the same one.
    
    For blk_mq_dispatch_rq_list, the request is not queued back to io scheduler, we can say the request
    is still being dispatched, so needn't invoke .requeue_request callback.
    
    For the __blk_mq_try_issue_directly, the direct issue path only works w/o io scheduler attached.
    
    Only the blk_mq_requeue_request case, the request is dequeued from io scheduler and will be requeued
    back to io scheduler.
    
    In fact, there is a big difference between block legacy and blk-mq in requeue.
    blk_requeue_request
      -> elv_requeue_request
        -> __elv_add_request //ELEVATOR_INSERT_REQUEUE
          -> list_add(&rq->queuelist, &q->queue_head);
    The request is requeued to q->queue_head which is similar with hctx->dispatch.
    
    

    Block legacy


    Tag

    There is also a tag mechanism in block legacy. Quote comment from blk-mq about tagging.

    Device command tagging was first introduced with hardware supporting native command queuing. A tag is an integer value that uniquely identifies the position of the block IO in the driver submission queue, so when completed the tag is passed back from the device indicating which IO has been completed. This eliminates the need to perform a linear search of the in-flight window to determine which IO has completed.
    
    We don't look into how to implement it but just how to employ it in block legacy and do some comparing with tagging in blk-mq.
    How to use it in driver level ?
    static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host *shost,
            int tag)
    {
        struct request *req = NULL;
    
        if (tag == SCSI_NO_TAG)
            return NULL;
    
        if (shost_use_blk_mq(shost)) {
            u16 hwq = blk_mq_unique_tag_o_hwq(tag);
    
            if (hwq < shost->tag_set.nr_hw_queues) {
                req = blk_mq_tag_to_rq(shost->tag_set.tags[hwq],
                    blk_mq_unique_tag_to_tag(tag));
            }
        } else {
            req = blk_map_queue_find_tag(shost->bqt, tag);
        }
    
        if (!req)
            return NULL;
        return blk_mq_rq_to_pdu(req);
    }
    
    A reverse mapping tag -> req -> driver pdu
    How to assign tag to a req ?
    scsi_request_fn()
    >>>>
            /*
             * Remove the request from the request list.
             */
            if (!(blk_queue_tagged(q) && !blk_queue_start_tag(q, req)))
                blk_start_request(req);
            /*
             blk_queue_tagged() will check QUEUE_FLAG_QUEUED in the q->flags, means the hardware support native command queuing.
             blk_queue_start_tag() will try to assign tag for this rq, if tags has been used up, return 1.
             otherwise,
             bqt->next_tag = (tag + 1) % bqt->max_depth;
             rq->rq_flags |= RQF_QUEUED; //indicates tag has been assigned
             rq->tag = tag;
             bqt->tag_index[tag] = rq;
             blk_start_request(rq);
             list_add(&rq->queuelist, &q->tag_busy_list);
             */
    >>>>
            /*
             * We hit this when the driver is using a host wide
             * tag map. For device level tag maps the queue_depth check
             * in the device ready fn would prevent us from trying
             * to allocate a tag. Since the map is a shared host resource
             * we add the dev to the starved list so it eventually gets
             * a run when a tag is freed.
             */
            if (blk_queue_tagged(q) && !(req->rq_flags & RQF_QUEUED)) {
                spin_lock_irq(shost->host_lock);
                if (list_empty(&sdev->starved_entry))
                    list_add_tail(&sdev->starved_entry,
                              &shost->starved_list);
                spin_unlock_irq(shost->host_lock);
                goto not_ready;
            }
    >>>>
     not_ready:
        /*
         * The tag here looks like the driver tag in blk-mq.
         * In block legacy, the req is requeued and inserted to the head of q->queue_head directly.
         * In blk-mq, the action is similar, refer to blk_mq_dispatch_rq_list. (but __blk_mq_try_issue_directly looks like not assigned with this.)
         */
        spin_lock_irq(q->queue_lock);
        blk_requeue_request(q, req);
        atomic_dec(&sdev->device_busy);
    >>>>
    

    plug

    There are mainly two aspects about blk plug's benifit.

    where is the plug list flushed from schedule ?
    
    schedule
      -> sched_submit_work
        -> blk_schedule_flush_plug
    
    io_schedule_timeout/io_schedule
      -> io_schedule_prepare
        -> blk_schedule_flush_plug
    
    
    However, the preempt schedule path doesn't flush plug list
    
    asmlinkage __visible void __sched preempt_schedule_irq(void)
    {
        enum ctx_state prev_state;
    
        /* Catch callers which need to be fixed */
        BUG_ON(preempt_count() || !irqs_disabled());
    
        prev_state = exception_enter();
    
        do {
            preempt_disable();
            local_irq_enable();
            __schedule(true);
            local_irq_disable();
            sched_preempt_enable_no_resched();
        } while (need_resched());
    
        exception_exit(prev_state);
    }
    
    
    

    BIO


    Let's look into the _basic unit_ in block layer, the bio.
    We could deem there is a bio layer between the fs and block layer.

                             
                    FS LAYER
         ------------------------------------------------
                              | submit_bio 
                              |
                              V generic_make_request <-------+
         ------------------------------------------------    |
                                 blk-throttl                 |
                    BIO LAYER    bio remap +--> partition    |
                                           |                 |
                                           +--> bio based device mapper (stackable)
        -------------------------------------------------    |
                              |                              |
                              V  blk_queue_bio/blk_mq_make_request
    
                    BLOCK LAGACY/BLK-MQ
    
    The basic architecture of a bio.
    request->bio __                    
                   \                  
                    \     bio        
                     \   ________    
                      ->| bi_next        next bio in one request, the blocks in these bios should be contigous on disk
                        |
                        | bi_disk        gendisk->request_queue 
                        |
                        | bi_partno      partition NO.
                        |
                        | bi_opf         bio_op, req_flag_bits, same with req->cmd_flags
                        |
                        | bi_phys_segments  Number of segments in this BIO after physical address coalescing is performed.
                        |
                        | bi_end_io   blk_update_request->req_bio_endio->bio_endio
                        |
                        | bi_vcnt        how many bio_vec's
                        | bi_max_vecs    max bio_vecs can hold
                        | bi_io_vec      pointer to bio_io_vec list    
                        |         \      ________    
                        |          --->  | bv_page       
                        |                | bv_len        
                        |                | bv_offset     
                        |                 ________       
                        |                | bv_page       
                        |                | bv_len        
                        |                | bv_offset    These two pages could be non physical contigously
                        |                               But the corresponding blocks on storage disk should be contigous.
                        | bi_pool        as its name
                        | 
                        | bi_iter        the current iterating status in bio_vec list
                                          ___________
                                         | bi_sector    device address in 512 byte sectors
                                         | bi_size      residual I/O count
                                         | bi_idx       current index into bvl_vec
                                         | bi_done      number of bytes completed
                                         | bi_bvec_done number of bytes completed in current bvec
    
    
    (Some members associated with cgroup,blk-throttle,merge-assistant are ignored here.)
    

    Setup and complete a bio

    Let's take the submit_bh_wbc() as example to show how to setup a bio

    static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
                 enum rw_hint write_hint, struct writeback_control *wbc)
    {
        struct bio *bio;
        >>>>
        bio = bio_alloc(GFP_NOIO, 1); // the second parameter is the count of bvec
    
        if (wbc) {
            wbc_init_bio(wbc, bio);
            wbc_account_io(wbc, bh->b_page, bh->b_size);
        }
    
        bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
        bio_set_dev(bio, bh->b_bdev);
        //(bio)->bi_disk = (bdev)->bd_disk;
        //(bio)->bi_partno = (bdev)->bd_partno;
        bio->bi_write_hint = write_hint;
    
        bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
        >>>>//Fs with blocksize smaller than pagesize, could reach here.
            if (bio->bi_vcnt > 0) {
                bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
    
                if (page == bv->bv_page &&
                    offset == bv->bv_offset + bv->bv_len) {
                    bv->bv_len += len;
                    goto done;
                } 
            } //merged with previous one 
    
            if (bio->bi_vcnt >= bio->bi_max_vecs)
                return 0;
    
            bv        = &bio->bi_io_vec[bio->bi_vcnt];
            bv->bv_page    = page;
            bv->bv_len    = len;
            bv->bv_offset    = offset;
    
            bio->bi_vcnt++;
        done:
            bio->bi_iter.bi_size += len;
        >>>>
        BUG_ON(bio->bi_iter.bi_size != bh->b_size);
    
        bio->bi_end_io = end_bio_bh_io_sync;
        bio->bi_private = bh; //reverse mapping to the bh
    
        /* Take care of bh's that straddle the end of the device */
        guard_bio_eod(op, bio);
    
        if (buffer_meta(bh))
            op_flags |= REQ_META;
        if (buffer_prio(bh))
            op_flags |= REQ_PRIO;
        bio_set_op_attrs(bio, op, op_flags);
        
        submit_bio(bio);
        return 0;
    }
    
    Most of the information to construct a bio is from the bh. If we want to dig deeper, we have to look into how to setup a bh.
    static int
    grow_dev_page(struct block_device *bdev, sector_t block,
              pgoff_t index, int size, int sizebits, gfp_t gfp)
    {
        >>>>
        page = find_or_create_page(inode->i_mapping, index, gfp_mask);
            -> pagecache_get_page()
                -> __page_cache_alloc() //no_page case
                    -> __alloc_pages_node(n, gfp, 0);
        /*
         The pages of page cache are allocated one by one. It's more flexible to
         map and unmap, page in and swap out. And in the past, the memory is limited, there is not
         enougth contiguous pages to take advantage of.
         */
        BUG_ON(!PageLocked(page));
        >>>>`
        /*
         * Allocate some buffers for this page
         */
        bh = alloc_page_buffers(page, size, true);
    
        /*
         * Link the page to the buffers and initialise them.  Take the
         * lock to be atomic wrt __find_get_block(), which does not
         * run under the page lock.
         */
        spin_lock(&inode->i_mapping->private_lock);
        link_dev_buffers(page, bh);
        end_block = init_page_buffers(page, bdev, (sector_t)index << sizebits,
                size);
        >>>>
        do {
            if (!buffer_mapped(bh)) {
                init_buffer(bh, NULL, NULL);
                bh->b_bdev = bdev;
                bh->b_blocknr = block;
                if (uptodate)
                    set_buffer_uptodate(bh);
                if (block < end_block)
                    set_buffer_mapped(bh);
            }
            block++;
            bh = bh->b_this_page;
        } while (bh != head);
        >>>>
        spin_unlock(&inode->i_mapping->private_lock);
    done:
        ret = (block < end_block) ? 1 : -ENXIO;
    failed:
        unlock_page(page);
        put_page(page);
        return ret;
    }
    
    One page from pagecache could be broken up into several bh's based on the blocksize of the associated filesystem (sb->s_blocksize). One bh corresponds to one block in disk. Then echo bh will be used to constructed a bio and submitted to block layer. At the moment, the bio only contain one bio_vec pointing to page of the bh. This is the classical path to setup a bio. Nowadays, some filesystems would like to create bios itself, during the procedure, the bio containing multiple bio_vec maybe created. For example:
    static int io_submit_add_bh(struct ext4_io_submit *io,
                    struct inode *inode,
                    struct page *page,
                    struct buffer_head *bh)
    {
        int ret;
    
        if (io->io_bio && bh->b_blocknr != io->io_next_block) {
    submit_and_retry:
            ext4_io_submit(io);
        }
        if (io->io_bio == NULL) {
            ret = io_submit_init_bio(io, bh);
            if (ret)
                return ret;
            io->io_bio->bi_write_hint = inode->i_write_hint;
        }
        ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
        if (ret != bh->b_size)
            goto submit_and_retry;
        wbc_account_io(io->io_wbc, page, bh->b_size);
        io->io_next_block++;
        return 0;
    }
    
    We could see that: one bio_vec would correspond to part or the whole page.

    Bio operations

    bio advance

    static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
                        unsigned bytes)
    {
        iter->bi_sector += bytes >> 9;
        /* So this is why the bi_sector is located in bio->bi_iter, it could be
         * put forward */
        if (bio_no_advance_iter(bio))
        {/REQ_OP_DISCARD/SECTOR_ERASE/WRITE_SAME/WRITE_ZERO
            iter->bi_size -= bytes;
            iter->bi_done += bytes;
        } else {
            bvec_iter_advance(bio->bi_io_vec, iter, bytes);
            /* TODO: It is reasonable to complete bio with error here. */
        }
    }
    
    static inline bool bvec_iter_advance(const struct bio_vec *bv,
            struct bvec_iter *iter, unsigned bytes)
    {
        >>>>
        while (bytes) {
            unsigned iter_len = bvec_iter_len(bv, *iter);
            unsigned len = min(bytes, iter_len);
    
            bytes -= len;
            iter->bi_size -= len; // remaining length
            iter->bi_bvec_done += len; //completed length of current bvec
            iter->bi_done += len; //completed length of this bio
    
            if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) {
                iter->bi_bvec_done = 0;
                iter->bi_idx++; //push forward the bvec table here
            }
        }
        return true;
    }
    
    After invoke this function, we could confirm one bio has been finished througth (bio->bi_iter.bi_size == 0). For example, in blk_update_request()
    blk_mq_end_request()
        -> blk_update_request()
            -> req_bio_endio()
    >>>>
        bio_advance(bio, nbytes);
    
        /* don't actually finish bio if it's part of flush sequence */
        // when RQF_FLUSH_SEQ is set, the req->end_io would be invoked instead of
        // bio_end.
        if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
            bio_endio(bio);
    >>>>
    
    bio clone
    in the device mapper stack, the bio will be cloned. Let's look at how to do that. clone_bio(), clone a new bio contain the sector ~ (sector+len) of original one.
    static int clone_bio(struct dm_target_io *tio, struct bio *bio,
                 sector_t sector, unsigned len)
    {
        struct bio *clone = &tio->clone;
    
        __bio_clone_fast(clone, bio);
        >>>>
            bio->bi_disk = bio_src->bi_disk;
            bio->bi_partno = bio_src->bi_partno;
            bio_set_flag(bio, BIO_CLONED); // a cloned bio
            bio->bi_opf = bio_src->bi_opf;
            bio->bi_write_hint = bio_src->bi_write_hint;
            bio->bi_iter = bio_src->bi_iter;
            bio->bi_io_vec = bio_src->bi_io_vec;
            //The cloned bio will shared a same bvec table with previous one.
            bio_clone_blkcg_association(bio, bio_src);
        >>>>
        if (bio_op(bio) != REQ_OP_ZONE_REPORT)
            bio_advance(clone, to_bytes(sector - clone->bi_iter.bi_sector));
        clone->bi_iter.bi_size = to_bytes(len);
        //cut out the sector ~ (sector+len) part of original one here
        if (unlikely(bio_integrity(bio) != NULL))
            bio_integrity_trim(clone);
    
        return 0;
    }
    

    Bio split

    bio will be split in blk_mq_make_request, why ?
    The associated commit is:
    54efd50b ( block: make generic_make_request handle arbitrarily sized bios)
    
    ---
        The way the block layer is currently written, it goes to great lengths
        to avoid having to split bios; upper layer code (such as bio_add_page())
        checks what the underlying device can handle and tries to always create
        bios that don't need to be split.
        
        But this approach becomes unwieldy and eventually breaks down with
        stacked devices and devices with dynamic limits, and it adds a lot of
        complexity.
    ---
    
    Then FS layer could submit arbitrary size bios.
    
    How to do it ?
    
    blk_queue_split
      -> blk_bio_segment_split
        -> bio_split
    ---
        split = bio_clone_fast(bio, gfp, bs);
          -> __bio_clone_fast
          ---
            bio->bi_disk = bio_src->bi_disk;
            bio->bi_partno = bio_src->bi_partno;
            bio_set_flag(bio, BIO_CLONED);
            if (bio_flagged(bio_src, BIO_THROTTLED))
                bio_set_flag(bio, BIO_THROTTLED);
            bio->bi_opf = bio_src->bi_opf;
            bio->bi_write_hint = bio_src->bi_write_hint;
            bio->bi_iter = bio_src->bi_iter;
    
            bio->bi_io_vec = bio_src->bi_io_vec;
    
            ...
          ---
        split->bi_iter.bi_size = sectors << 9;
    
        if (bio_integrity(split))
            bio_integrity_trim(split);
    
        bio_advance(bio, split->bi_iter.bi_size);
    ---
                  |  sectors  |
       bi_io_vec  [  bv  ] [  bv  ] [  bv  ] [  bv  ]
                  \____  _____/\________  __________/
                        V                V
              split->bi_iter         bio->bi_iter
    
    blk_queue_split
    ---
        if (split) {
            /* there isn't chance to merge the splitted bio */
            split->bi_opf |= REQ_NOMERGE;
    
            /*
             * Since we're recursing into make_request here, ensure
             * that we mark this bio as already having entered the queue.
             * If not, and the queue is going away, we can get stuck
             * forever on waiting for the queue reference to drop. But
             * that will never happen, as we're already holding a
             * reference to it.
             */
            bio_set_flag(*bio, BIO_QUEUE_ENTERED);
    
            bio_chain(split, *bio);
            trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
    
                    a big bio
            |  max  |
            |__________________________|
            \___ ___/\________ ________/
                v             v
              submit      go back to
                         generic_make_request
    
    
            generic_make_request(*bio);
            *bio = split;
        }
    ---
    

    stacked bio layer

    bios from stacked devices

    How does the generic_make_request handle bios from stacked devices ?

    Two important code fragment,
    
    #1
    ---
        if (current->bio_list) {
            bio_list_add(¤t->bio_list[0], bio);
            goto out;
        }
    
    ---
    
    #2
    ---
        do {
            bool enter_succeeded = true;
    
            if (unlikely(q != bio->bi_disk->queue)) {
                if (q)
                    blk_queue_exit(q);
                q = bio->bi_disk->queue;
                flags = 0;
                if (bio->bi_opf & REQ_NOWAIT)
                    flags = BLK_MQ_REQ_NOWAIT;
                if (blk_queue_enter(q, flags) < 0) {
                    enter_succeeded = false;
                    q = NULL;
                }
            }
    
            if (enter_succeeded) {
                struct bio_list lower, same;
    
                /* Create a fresh bio_list for all subordinate requests */
                bio_list_on_stack[1] = bio_list_on_stack[0];
                bio_list_init(&bio_list_on_stack[0]);
                ret = q->make_request_fn(q, bio);
    
                /* sort new bios into those for a lower level
                 * and those for the same level
                 */
                bio_list_init(&lower);
                bio_list_init(&same);
                while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
                    if (q == bio->bi_disk->queue)
                        bio_list_add(&same, bio);
                    else
                        bio_list_add(&lower, bio);
                /* now assemble so we handle the lowest level first */
                bio_list_merge(&bio_list_on_stack[0], &lower);
                bio_list_merge(&bio_list_on_stack[0], &same);
                bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
            } else {
                if (unlikely(!blk_queue_dying(q) &&
                        (bio->bi_opf & REQ_NOWAIT)))
                    bio_wouldblock_error(bio);
                else
                    bio_io_error(bio);
            }
            bio = bio_list_pop(&bio_list_on_stack[0]);
        } while (bio);
    ---
    
    Let's take the stripe as an example,
    
    
           stripe_dev
    
           bio 0 ~ 31
      |--------------------|
      +--+  +--+  +--+  +--+
      |  |  |  |  |  |  |  | } 4K (8 sectors)
      +--+  +--+  +--+  +--+
      |  |  |  |  |  |  |  |
      +--+  +--+  +--+  +--+
      |  |  |  |  |  |  |  |
      +--+  +--+  +--+  +--+
    
      dev0  dev1  dev2  dev3
    
    Round #1
    
    bio[0, 31].stripe_dev
    q->make_request_fn
    then,
    bio_list_on_stack[0] -> bio[0, 7].dev0 -> bio[8, 31].stripe_dev
    then,
    lower -> bio[0, 7].dev0
    same -> bio[8, 31].stripe_dev
    then
    bio_list_on_stack[0] ->  bio[0, 7].dev0 ->  bio[8, 31].stripe_dev
    
    Round #2
    
    bio[0, 7].dev0 is picked up to handle
    bio_list_on_stack[1] -> bio[8, 31].stripe_dev
    q->make_request_fn
    bio_list_on_stack[0] is NULL
    then
    bio_list_on_stack[1] is merged into bio_list_on_stack[0]
    bio_list_on_stack[0] -> bio[8, 31].stripe_dev
    
    Round #3
    
    bio[8, 31].stripe_dev is picked up to handle
    q->make_request_fn
    then
    bio_list_on_stack[0] -> bio[8, 15].dev1 -> bio[16, 31].stripe_dev
    then
    lower ->  bio[8, 15].dev1
    same -> bio[16, 31].stripe_dev
    then
    bio_list_on_stack
    bio_list_on_stack[0] -> bio[8, 15].dev1 -> bio[16, 31].stripe_dev
    
    Round #4
    
    bio[8, 15].dev1 is picked up to handle
    bio_list_on_stack[1] ->bio[16, 31].stripe_dev
    ....
    
    
    
    

    Merge

    The main merging point.

    blk_mq_sched_try_merge
    This is used to merge bio with req.
    It is usually in bio submitting path.
    elv_merge choose a rq which could merge with a new bio
    and return how to merge.
    (bio) (req) indicates the new one
    
    if ELEVATOR_BACK_MERGE
        req -> bio -> (bio)
        then try to merge this req with latter one.
        (req) -?-> req
    
    if ELEVATOR_FRONT_MERGE
        req -> (bio) -> bio
        then try to merge this req with former one.
        req -?-> (req)
    
    elv_attempt_insert_merge
    This is used to merge req with req.
    It is usually in req inserting path.
    
    Both elv_merge and elv_attempt_insert_merge employ q->last_merge
    and request_queue elv rqhash to find out contiguous reqs.
    
    
    Note: req is just a package. The real things are bios in them.

    attempt_merge is used to merge two reqs (req, next).
    The mainly checking it does:
    If two requests could be merged with echo other:
        req->biotail->bi_next = next->bio;
        req->biotail = next->biotail;
    
        req->__data_len += blk_rq_bytes(next);
    
        elv_merge_requests(q, req, next);
    
        /*
         * 'next' is going away, so update stats accordingly
         */
        blk_account_io_merge(next);
    
        req->ioprio = ioprio_best(req->ioprio, next->ioprio);
        if (blk_rq_cpu_valid(next))
            req->cpu = next->cpu;
    
        /*
         * ownership of bio passed from next to req, return 'next' for
         * the caller to free
         */
        next->bio = NULL;
    
    Then next one will be freed though __blk_put_request().

    FLUSH and FUA


    First, we need to know the volatile write cache.
    Quote from Documentation/block/writeback_cache_control.txt

    Many storage devices, especially in the consumer market, come with volatile
    write back caches.  That means the devices signal I/O completion to the
    operating system before data actually has hit the non-volatile storage.  This
    behavior obviously speeds up various workloads, but it means the operating
    system needs to force data out to the non-volatile storage when it performs
    a data integrity operation like fsync, sync or an unmount. >
    
    There are two flag set in bio or req to indicate which operation on vwc will be carried out. The block device driver need to notify the queue that whether it supports REQ_FLUSH and REQ_FUA through blk_queue_write_cache(). And the flags will be set into queue->queue_flags.
    void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
    {
        spin_lock_irq(q->queue_lock);
        if (wc)
            queue_flag_set(QUEUE_FLAG_WC, q);
        else
            queue_flag_clear(QUEUE_FLAG_WC, q);
        if (fua)
            queue_flag_set(QUEUE_FLAG_FUA, q);
        else
            queue_flag_clear(QUEUE_FLAG_FUA, q);
        spin_unlock_irq(q->queue_lock);
    
        wbt_set_write_cache(q->rq_wb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
    }
    
    How to implement the flush operation
    There are 4 flush sequence flag: These flush operation life cycle could include any ones of them. blk core will execute them in sequence. blk_flush_policy() is used to construct this sequence. Let's see it.
    static unsigned int blk_flush_policy(unsigned long fflags, struct request *rq)
    {
        unsigned int policy = 0;
    
        if (blk_rq_sectors(rq))
            policy |= REQ_FSEQ_DATA;
    
        if (fflags & (1UL << QUEUE_FLAG_WC)) {
            if (rq->cmd_flags & REQ_PREFLUSH)
                policy |= REQ_FSEQ_PREFLUSH;
            if (!(fflags & (1UL << QUEUE_FLAG_FUA)) &&
                (rq->cmd_flags & REQ_FUA))
                policy |= REQ_FSEQ_POSTFLUSH;
        }
        return policy;
    }
    
    Two things need to be emphasized here. If blk_flush_policy() just return REQ_FSEQ_DATA, the request can be processed directly without going through flush machinery. For blk-mq, it will be inserted into the tail of hctx->dispatch.
    Otherwise, a flush sequence will be started.
    The flush sequence is carried out based on blk_flush_queue->flush_queue[2]. In addition, there are two idx to indicates the current state of the flush_queue. Both of them only have two values 0/1. At initial state, pending == running. After kick a flush sequence, the pending_idx is toggled, then the pending_idx become different from running_idx which means flush is in flight. During the process while flush is in flight, the new flushes will be queued on pending_idx which is different from the running_idx. After the flush is completed, the running_idx is toggled then the running_idx is same with pending_idx again.
    a preallocated request - flush_rq will do the actual flush work on behalf of the FLUSH requests. when completed, all the FLUSH request on the running queuee would be pushed forward to next step.
    
    blk_flush_queue->flush_queue[2]
                     running 0
                     pending 0
    rq0 (PREFLUSH + DATA)
    rq1 (DATA + POSTFLUSH)
    rq2 (PREFLUSH + DATA)
    
    Time 0: running 0, pending 0
    
                     (seq = PREFLUSH)   
    flush_queue[0] - rq0
    
    blk_kick_flush toggle the pending_idx and send out
    the flush_rq.
    Time 1: running 0, pending 1
    
                     (seq = PREFLUSH)   
    flush_queue[0] - rq0
    
    hctx->dispatch - flush_rq (w/ tag from rq0, RQF_FLUSH_SEQ)
                     requeue -> bypass insert
    
    rq1 is inserted by blk_insert_flush
    Time 2: running 0, pending 1
    
                     (seq = PREFLUSH)   
    flush_queue[0] - rq0
                           (seq = DATA)
    flush_data_in_flight - rq1
    
    hctx->dispatch - rq1 (RQF_FLUSH_SEQ) - flush_rq (w/ tag from rq0, RQF_FLUSH_SEQ)
                     both requeue -> bypass insert
    
    rq2 is inserted by blk_insert_flush
    Time 3: running 0, pending 1
    
                     (seq = PREFLUSH)   
    flush_queue[1] - rq2
                     (seq = PREFLUSH)   
    flush_queue[0] - rq0
                           (seq = DATA)
    flush_data_in_flight - rq1
    
    hctx->dispatch - rq1 (RQF_FLUSH_SEQ) - flush_rq (w/ tag from rq0, RQF_FLUSH_SEQ)
                     both requeue -> bypass insert
    
    rq1 is completed firstly, due to POSTFLUSH, it is inserted to pending
    Time 4: running 0, pending 1
    
                     (seq = PREFLUSH)   (seq = POSTFLUSH) 
    flush_queue[1] - rq2              - rq1 
                     (seq = PREFLUSH)   
    flush_queue[0] - rq0
    
    hctx->dispatch - flush_rq (w/ tag from rq0, RQF_FLUSH_SEQ)
                     
    
    flush_rq is completed
    get running list flush_queue[0]
    toggle running running = 1
    iterate running_list flush_queue[0] to invoke blk_flush_complete_seq
    rq0 is inserted to flush_data_in_flight and requeue, finally add head of hctx->dispatch
    another flush is issued by blk_kick_flush due to rq1 and rq2
    Time 5: running 1, pending 1
    
                     (seq = PREFLUSH)   (seq = POSTFLUSH) 
    flush_queue[1] - rq2              - rq1 
                           (seq = DATA)
    flush_data_in_flight - rq1
    
    hctx->dispatch -  rq0 (RQF_FLUSH_SEQ) - flush_rq (w/ tag from rq0, RQF_FLUSH_SEQ)
    
    Question
    the flush_rq could pass through the io scheduler with RQF_FLUSH_SEQ, but why does
    the original rq do the same ?
    does that mean all the rq with FLUSH or FUA will pass through the io scheduler ?
    
    
    A sequenced PREFLUSH/FUA request with DATA is completed twice.
    Once while executing DATA and again after the whole sequence is complete.
    The first completion updates the contained bio but doesn't finish it so that the 
    bio submitter is notified only after the whole sequence is complete.
    This is implemented by testing RQF_FLUSH_SEQ in req_bio_endio().
    
    Talking about the borrowed tag
    FLUSH reqs below means the request with FLUSH or FUA operations
    Why does the flush_rq borrow tags from the FLUSH request ?
    
    flush_rq is allocated separately, so it is not in the tag_set of blk-mq.
    
    For the non-scheduler case, the FLUSH req has occupied a driver tag and it
    depends on the completion of flush_rq. Assume the scenario, all the driver tags
    are held by FLUSH req, consequentially, the flush_rq cannot get driver tag
    any more and cannot make the flush sequence forward. A IO hang comes up. To
    avoid this, flush_rq should borrow driver tag from the FLUSH reqs.
    
    Recently,
    a commit 923218f (blk-mq: don't allocate driver tag upfront for flush rq)
    was introduced, it change the way how to handle the tag borrowing in blk-mq.
    
    Before this patch, when with io scheduler, the blk-mq will allocate driver tag ahead
    of delivering it to blk-flush. Then blk-flush may borrow this driver tag to the proxy
    flush_rq. Then this flush_rq will be queued to hctx->dispatch.
    
    blk_mq_make_request()
    ---
        if (unlikely(is_flush_fua)) {
            blk_mq_put_ctx(data.ctx);
            blk_mq_bio_to_request(rq, bio);
            if (q->elevator) {
                blk_mq_sched_insert_request(rq, false, true, true,
                        true);
            } 
    ---
    
    blk_mq_sched_insert_request()
    ---
        if (rq->tag == -1 && op_is_flush(rq->cmd_flags)) {
            blk_mq_sched_insert_flush(hctx, rq, can_block);
            return;
        }
    ---
    static void blk_mq_sched_insert_flush(struct blk_mq_hw_ctx *hctx,
                          struct request *rq, bool can_block)
    {
    
        if (blk_mq_get_driver_tag(rq, &hctx, can_block)) {
    
            blk_insert_flush(rq);
            blk_mq_run_hw_queue(hctx, true);
        } else
            blk_mq_add_to_requeue_list(rq, false, true);
    }
    
    And this will cause a issue. Look at the comment of reorder_tags_to_front()
    ---
    If we fail getting a driver tag because all the driver tags are already
    assigned and on the dispatch list, BUT the first entry does not have a
    tag, then we could deadlock. For that case, move entries with assigned
    driver tags to the front, leaving the set of tagged requests in the
    same order, and the untagged set in the same order.
    ---
    if the driver tags are all occupied by FLUSH reqs, and other reqs has to be 
    queued on hctx->dispatch because shortage of driver tag.
    the flush_rq with driver tag will be queued to the tail of hctx->dispatch.
    
    then we will get the scenario described above.
    
    The patch changes the way to handle this case, let flush_rq get a driver tag 
    just before .queue_rq() in blk_mq_dispatch_rq_list().
    This will not cause IO hang described above, because the FLUSH requests just
    occupy sched tags. But the flush_rq still need to borrow the sched tag to cheat
    the blk-mq.
    
    blk_kick_flush()
    >>>>
        if (q->mq_ops) {
            struct blk_mq_hw_ctx *hctx;
    
            flush_rq->mq_ctx = first_rq->mq_ctx;
    
            if (!q->elevator) {
                fq->orig_rq = first_rq;
                flush_rq->tag = first_rq->tag;
                hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
                blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
            } else {
                flush_rq->internal_tag = first_rq->internal_tag;
    >>>>
    

    Queue state flags


    Let's look at the similar 3 flags of request_queue.

    WBT


    WBT = Write Buff Throttle
    Why we need wbt ?
    Let's quote some comment from the developer of this feature Jens.

    When we do background buffered writeback, it should have little impact
    on foreground activity. That's the definition of background activity...
    But for as long as I can remember, heavy buffered writers have not
    behaved like that. For instance, if I do something like this:
    
    $ dd if=/dev/zero of=foo bs=1M count=10k
    
    on my laptop, and then try and start chrome, it basically won't start
    before the buffered writeback is done. Or, for server oriented
    workloads, where installation of a big RPM (or similar) adversely
    impacts database reads or sync writes. When that happens, I get people
    yelling at me.
    
    In conclusion, the foreground IOs should be priorized over the background ones.
    Who will be throttled
    wbt_should_throttle() gives the answer.
    static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
    {
        const int op = bio_op(bio);
    
        /*
         * If not a WRITE, do nothing
         */
        if (op != REQ_OP_WRITE)
            return false;
    
        /*
         * Don't throttle WRITE_ODIRECT
         */
        if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE))
            return false;
    
        return true;
    }
    
    The suspect is what's about the synchronous write ?
    For example, the updating of the metadata of filesystem ?
    How to implement it
    Let's first look at the hooks across the blk-mq layer.
              blk_mq_make_request()
                    wbt_wait()
                        if !may_queue()
                            sleep
    
                    wbt_track()
                        save track info 
                        on rq->issue_stat
    
              blk_mq_start_request()                        wb_timer_fn()
                    wbt_issue()                                 account the latency of sync IO
                        sync issue time                         and adjust the limits of different IO type
    
              blk_mq_free_request()/__blk_mq_end_request()
                    wbt_done()
                        dec inflight
                        wake up
    
              __blk_mq_requeue_request()
                    wbt_requeue()
                        clear sync issue time
    
    Yeah, it looks like the kyber IO scheduler.
    But there is a big difference regarding to the action when limit is reached.

    blkdev gendisk hd

    When we access the block device directly, for example /dev/sda1, we will not pass througth bdev fs first. /dev/ is devtmpfs, not bdev fs. We could refer to init_special_inode to know this.

            sda1    sda2    sda3    sda4              devtmpfs
                         | [1]
                         V
            blkdev1 blkdev3 blkdev3 blkdev4           blkdev fs
    
    
    
    blkdev - block_device
    disk   - gendisk
    hd     - hd_struct
    [1]    - bdget get blkdev with inode->i_rdev (block devt) from blkdev fs
             get_gendisk get gendisk and partno with block devt and install
             them on blkdev->bd_disk and blkdev->bd_partno
             
    
    In a realy workload, the stream is as following:
    mount_bdev
      sget
        set_bdev_super       xxx_get_block
          set sb->s_bdev       map_bh
                                 bh->bdev = sb->s_bdev
                                 |
                                 V
                             submit_bh_wbc
                               bio_set_dev(bio, bh->b_bdev)
                                 bio->bi_disk = bdev->bd_disk 
                                 bio->bi_partno = bdev->bd_partno
                                 |
                                 V
                             generic_make_request
                               generic_make_request_checks
                                 blk_partition_remap
                                   bio->bi_iter.bi_sector += hd->start_sect |
                                      bio->bi_partno = 0;
                               queue->make_request_fn
    
    

    blk sysfs

    Let's look at how is the following sysfs interface added.

        /sys/block/nvme/queue/
    	      ^     ^     ^
    		 [1]   [2]   [3]
    
        /sys/block/nvme/mq
    	                ^
    				   [4]
    

    request_queue cleanup and release

    The first thing what blk_cleanup_queue need to do is to prevent others from entering blk path again. This is achieved by invoking blk_set_queue_dying.

    void blk_set_queue_dying(struct request_queue *q)
    {
        blk_queue_flag_set(QUEUE_FLAG_DYING, q);
    
        /*
         * When queue DYING flag is set, we need to block new req
         * entering queue, so we call blk_freeze_queue_start() to
         * prevent I/O from crossing blk_queue_enter().
         */
        blk_freeze_queue_start(q);
    
        if (q->mq_ops)
            blk_mq_wake_waiters(q);
    
        wake up the tag waiters.
        the hw queues will be run.
        DYING flag is not same with QUIESCED, the post one will prevent requests from
        entering into lldd.
    
        else {
        ...
        }
    
        /* Make blk_queue_enter() reexamine the DYING flag. */
    
        wake_up_all(&q->mq_freeze_wq);
    }
    
    blk_queue_dying and blk_queue_enter will gate the other contexts out of blk path.
    blk_queue_dying gates:
    
    • sysfs interface
    • blk_execute_rq_nowait (looks like blk-mq doesn't do this)

    Then blk_cleanup_queue will invoke blk_freeze_queue. It will defense any new requests and also drained all the requests, no matter pending or outstanding.
    Even if all we have drained the queue, but there could be still contexts that will access the request_queue resources. such as blk-mq run work, requeue work blk_sync_queue is used to flush them.
    void blk_sync_queue(struct request_queue *q)
    {
        del_timer_sync(&q->timeout);
        cancel_work_sync(&q->timeout_work);
    
        if (q->mq_ops) {
            struct blk_mq_hw_ctx *hctx;
            int i;
    
            cancel_delayed_work_sync(&q->requeue_work);
            queue_for_each_hw_ctx(q, hctx, i)
                cancel_delayed_work_sync(&hctx->run_work);
        } else {
            cancel_delayed_work_sync(&q->delay_work);
        }
    }
    
    Finally, blk_put_queue put a reference of q->kobj.
    When the reference reaches zero, blk_queue_ktype.blk_release_queue will be invoked. It queue the __blk_release_queue which will do the final release.

    What need to be noted is the gendisk will take an extra ref on its request_queue in __device_add_disk and put it in disk_release. So the request_queue will sticks around as long as gendisk.

    blk_integrity

    What is blk_integrity for ?

    
           [ system memory ]
                   |   
                   | D  
                   | M   path1
                   | A
                   V        sas/fc/iscsi
             [ HBA memory]- - - - - - - - ->[ storage volume ]
                                 path2
    
    The data integrity on path2 could be ensured by the transport protocol, for example: e.g The path1 is protected by blk_integrity what we will talk next.


    How is blk_integrity implemented ?
    Quote from Documentation/block/data-integrity.txt
    Because the format of the protection data is tied to the physical
    disk, each block device has been extended with a block integrity
    profile (struct blk_integrity).  This optional profile is registered
    with the block layer using blk_integrity_register().
    
    The profile contains callback functions for generating and verifying
    the protection data, as well as getting and setting application tags.
    The profile also contains a few constants to aid in completing,
    merging and splitting the integrity metadata.
    
    Let's look at how does the scsi sd implement this.
    sd_probe_async
      -> sd_dif_config_host
    --
        /* Enable DMA of protection information */
        if (scsi_host_get_guard(sdkp->device->host) & SHOST_DIX_GUARD_IP) {
            if (type == T10_PI_TYPE3_PROTECTION)
                bi.profile = &t10_pi_type3_ip;
            else
                bi.profile = &t10_pi_type1_ip;
    
            bi.flags |= BLK_INTEGRITY_IP_CHECKSUM;
        } else
            if (type == T10_PI_TYPE3_PROTECTION)
                bi.profile = &t10_pi_type3_crc;
            else
                bi.profile = &t10_pi_type1_crc;
    
        bi.tuple_size = sizeof(struct t10_pi_tuple);
        sd_printk(KERN_NOTICE, sdkp,
              "Enabling DIX %s protection\n", bi.profile->name);
    
        if (dif && type) {
            bi.flags |= BLK_INTEGRITY_DEVICE_CAPABLE;
    
            if (!sdkp->ATO)
                goto out;
    
            if (type == T10_PI_TYPE3_PROTECTION)
                bi.tag_size = sizeof(u16) + sizeof(u32);
            else
                bi.tag_size = sizeof(u16);
    
            sd_printk(KERN_NOTICE, sdkp, "DIF application tag size %u\n",
                  bi.tag_size);
        }
    
    out:
        blk_integrity_register(disk, &bi);
    --
    


    The process of blk_integrity
    blk_mq_make_request
      -> bio_integrity_prep
        -> bio_integrity_add_page  //bio->bi_integrity
        -> bio_integrity_process(bio, &bio->bi_iter, bi->profile->generate_fn); //bio_data_dir(bio) == WRITE)
    
    bio_endio
      -> bio_integrity_endio
        -> __bio_integrity_endio
    --
        if (bio_op(bio) == REQ_OP_READ && !bio->bi_status &&
            (bip->bip_flags & BIP_BLOCK_INTEGRITY) && bi->profile->verify_fn) {
            INIT_WORK(&bip->bip_work, bio_integrity_verify_fn);
            queue_work(kintegrityd_wq, &bip->bip_work);
            return false;
        }
    --
    
    static void bio_integrity_verify_fn(struct work_struct *work)
    {
        struct bio_integrity_payload *bip =
            container_of(work, struct bio_integrity_payload, bip_work);
        struct bio *bio = bip->bip_bio;
        struct blk_integrity *bi = blk_get_integrity(bio->bi_disk);
        struct bvec_iter iter = bio->bi_iter;
    
        /*
         * At the moment verify is called bio's iterator was advanced
         * during split and completion, we need to rewind iterator to
         * it's original position.
         */
        if (bio_rewind_iter(bio, &iter, iter.bi_done)) {
            bio->bi_status = bio_integrity_process(bio, &iter,
                                   bi->profile->verify_fn);
        } else {
            bio->bi_status = BLK_STS_IOERR;
        }
    
        bio_integrity_free(bio);
        bio_endio(bio);
    }
    
    


    blk_integrity and fs
    After the request is issued to HBA, the data will be transported to HBA internal buffer through DMA and then verify it based on protection meta data. During the DMA transporting, the data in the sglist (page caches) cannot be be modified. This is guaranteed by fs.
    Steps of writing data to a file:
    1. writing into the page cache
    aops.write_begin
      -> lock page
      -> wait_for_stable_page
        -> if bdi_cap_stable_pages_required //BDI_CAP_STABLE_WRITES
             wait_on_page_writeback
    copy from user buffer to page cache
    aops.write_end
    
    2. writeback the pagecache to disk
    lock page
    set page writeback
    submit_bio
    unlock page
    
    3. io completion
    end bio
      -> end_page_writeback
        -> test_clear_page_writeback
        -> wake_up_page(page, PG_writeback)
    
    BDI_CAP_STABLE_WRITES is set in blk_integrity_register.

    blk loop

    What's blk-loop for ?

    
        /dev/loopX     /home/ubuntu-16.04.4-desktop-amd64.iso
             |         ^         |              |
             v         |         v              v
        +-------------C-------------------+  +-------+
        |     vfs cache|                  |  |  DIO  |
        +-------------C-------------------+  +-------+
             |         |         |              |
             v         |         v              v
        +-------------C------------------------------+
        |  block layer |                             |
        +-------------C------------------------------+
             |         |         |
             v         |         v
            blk-loop driver    SCSI layer
    
    The backend of a block device could be a HDD, SSD, or storage subsystem linked by fc or iscsi, and also could be a local file.

    There is another concept: direct IO.
    The data from applications will go directly to block layer, bypassing the system
    file cache.
    

    How to create

    Step 1

    /dev/loop-control 
    loop_ctl_fops
      -> loop_control_ioctl //LOOP_CTL_ADD
        -> loop_add
    There are a lot of interesting things in loop_add, let's look at it.
    static int loop_add(struct loop_device **l, int i)
    {
        struct loop_device *lo;
        struct gendisk *disk;
        int err;
    
        err = -ENOMEM;
        lo = kzalloc(sizeof(*lo), GFP_KERNEL);
        if (!lo)
            goto out;
    
        lo->lo_state = Lo_unbound; //This means no file is bound on this device
    
        /* allocate id, if @id >= 0, we're requesting that specific id */
        if (i >= 0) {
            err = idr_alloc(&loop_index_idr, lo, i, i + 1, GFP_KERNEL);
            if (err == -ENOSPC)
                err = -EEXIST;
        } else {
            err = idr_alloc(&loop_index_idr, lo, 0, 0, GFP_KERNEL);
        }
        if (err < 0)
            goto out_free_dev;
        i = err;
    
        err = -ENOMEM;
        lo->tag_set.ops = &loop_mq_ops;
        lo->tag_set.nr_hw_queues = 1;
        /*
        It should be an interesting theme to find out how many hw_queues to be
        required to get better performance.
        The real work is done in loop kthread, what .queue_rq does is just to insert
        a work or wakeup the kthread.
         */
        lo->tag_set.queue_depth = 128;
        lo->tag_set.numa_node = NUMA_NO_NODE;
        lo->tag_set.cmd_size = sizeof(struct loop_cmd);
        lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
        lo->tag_set.driver_data = lo;
    
        err = blk_mq_alloc_tag_set(&lo->tag_set);
        if (err)
            goto out_free_idr;
    
        lo->lo_queue = blk_mq_init_queue(&lo->tag_set);
        if (IS_ERR_OR_NULL(lo->lo_queue)) {
            err = PTR_ERR(lo->lo_queue);
            goto out_cleanup_tags;
        }
        lo->lo_queue->queuedata = lo;
    
        blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);
    
        /*
         * By default, we do buffer IO, so it doesn't make sense to enable
         * merge because the I/O submitted to backing file is handled page by
         * page. For directio mode, merge does help to dispatch bigger request
         * to underlayer disk. We will enable merge once directio is enabled.
         */
        queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
    
        err = -ENOMEM;
        disk = lo->lo_disk = alloc_disk(1 << part_shift);
        ...
        disk->fops        = &lo_fops; //this the fops for /dev/loopX
        disk->private_data    = lo;
        disk->queue        = lo->lo_queue;
        sprintf(disk->disk_name, "loop%d", i);
        add_disk(disk);
        *l = lo;
        return lo->lo_number;
        ...
    }
    
    
    Step 2
    /dev/loopX
    lo_fops
      -> lo_ioctl //LOOP_SET_FD
        -> loop_set_fd
    static int loop_set_fd(struct loop_device *lo, fmode_t mode,
                   struct block_device *bdev, unsigned int arg)
    {
        ...
        file = fget(arg);
        if (!file)
            goto out;
        ...
        mapping = file->f_mapping;
        inode = mapping->host;
        //regular file or block file
        if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
            goto out_putf;
    
        if (!(file->f_mode & FMODE_WRITE) || !(mode & FMODE_WRITE) ||
            !file->f_op->write_iter)
            lo_flags |= LO_FLAGS_READ_ONLY;
    
        error = -EFBIG;
        size = get_loop_size(lo, file);
        if ((loff_t)(sector_t)size != size)
            goto out_putf;
        error = loop_prepare_queue(lo);
        
                kthread_init_worker(&lo->worker);
                lo->worker_task = kthread_run(loop_kthread_worker_fn,
                        &lo->worker, "loop%d", lo->lo_number);
                if (IS_ERR(lo->worker_task))
                return -ENOMEM;
                set_user_nice(lo->worker_task, MIN_NICE);
        
    
        set_device_ro(bdev, (lo_flags & LO_FLAGS_READ_ONLY) != 0);
    
        lo->use_dio = false;
        lo->lo_device = bdev;
        lo->lo_flags = lo_flags;
        lo->lo_backing_file = file;
        lo->transfer = NULL;
        lo->ioctl = NULL;
        lo->lo_sizelimit = 0;
        lo->old_gfp_mask = mapping_gfp_mask(mapping);
        mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
    
        if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
            blk_queue_write_cache(lo->lo_queue, true, false);
    
        loop_update_dio(lo);
        set_capacity(lo->lo_disk, size);
        bd_set_size(bdev, size << 9);
        loop_sysfs_init(lo);
        /* let user-space know about the new size */
        kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE);
    
        set_blocksize(bdev, S_ISBLK(inode->i_mode) ?
                  block_size(inode->i_bdev) : PAGE_SIZE);
    
        lo->lo_state = Lo_bound;
        ...
    }
    

    Kthread or Workqueue ?

    When request enters into .queue_rq, how to handle it next ?
    It need to be handled in another context, because we have owned a deep stack from vfs_read/write to driver .queue_rq. This context could be kworker or standalone kthread. But which one shoud we use ?
    commit e03a3d7 ( block: loop: use kthread_work ) change the block loop from work to kthread context. Let's look at what block loop does before and after this patch.

    Work based.
    
               Concurrently                   Sequentially                         
        Read   Read   Read   Read      Write<->Write<->Write<->Write
        +---+  +---+  +---+  +---+     +---+
        | W |  | W |  | W |  | W |     | W |
        +---+  +---+  +---+  +---+     +---+
          |      |      |      |         |
       + -v- - - v - - -v- - - v - - - - v - - +
       |          Unbound worker pool          |
       + - - - - - - - - - - - - - - - - - - - +
    
    +---+
    | W |  work instance
    +---+
    
    
    For the read, block loop issues them concurrently as far as possible. This is due to read operastions often need to wait for the page caches to be filled, it is usually a sychronous one. Issuing Read concurrently is good for random read, but it is not so efficient for sequential read which often could hit the page cache.
    For the write, block loop issue them sequentially, because writes usually reaches on page cache, it is usually fast enough.
                 Write<->Write<->Read<->Read<->Write ....
              +- - - - -+
              | kthread |
              +- - - - -+
    
    When DIO/AIO is introduced, the read/write on backing file is not blocking operations.

    DIO & AIO on backing file

    In linux, read operastions are almost synchronous except for the required data has been already in the page cache, otherwise, it has to wait for the page cache to be filled by the block device through block layer and blk driver. Even if we have readahead mechanism, but the page cache cannot be often hit with random read.
    Consequently, the loop driver execute context (kworker or standalone kthread) has to wait and this will delay the other requests which may has associated page cache already.
    On ther other hand, there are two layer page cache would be involved, one for file over loop device, one for the backing file. This is unnecessary and wastes memory.

    Leiming introduced backing file DIO and AIO supporting in block loop.

    commit bc07c10a3603a5ab3ef01ba42b3d41f9ac63d1b6
    Author: Ming Lei 
    Date:   Mon Aug 17 10:31:51 2015 +0800
    
        block: loop: support DIO & AIO
        
        There are at least 3 advantages to use direct I/O and AIO on
        read/write loop's backing file:
        
        1) double cache can be avoided, then memory usage gets
        decreased a lot
        
        2) not like user space direct I/O, there isn't cost of
        pinning pages
        
        3) avoid context switch for obtaining good throughput
        - in buffered file read, random I/O top throughput is often obtained
        only if they are submitted concurrently from lots of tasks; but for
        sequential I/O, most of times they can be hit from page cache, so
        concurrent submissions often introduce unnecessary context switch
        and can't improve throughput much. There was such discussion[1]
        to use non-blocking I/O to improve the problem for application.
        - with direct I/O and AIO, concurrent submissions can be
        avoided and random read throughput can't be affected meantime
        
        xfstests(-g auto, ext4) is basically passed when running with
        direct I/O(aio), one exception is generic/232, but it failed in
        loop buffered I/O(4.2-rc6-next-20150814) too.
        
        Follows the fio test result for performance purpose:
            4 jobs fio test inside ext4 file system over loop block
        
        1) How to run
            - KVM: 4 VCPUs, 2G RAM
            - linux kernel: 4.2-rc6-next-20150814(base) with the patchset
            - the loop block is over one image on SSD.
            - linux psync, 4 jobs, size 1500M, ext4 over loop block
            - test result: IOPS from fio output
        
        2) Throughput(IOPS) becomes a bit better with direct I/O(aio)
                -------------------------------------------------------------
                test cases          |randread   |read   |randwrite  |write  |
                -------------------------------------------------------------
                base                |8015       |113811 |67442      |106978
                -------------------------------------------------------------
                base+loop aio       |8136       |125040 |67811      |111376
                -------------------------------------------------------------
        
        - somehow, it should be caused by more page cache avaiable for
        application or one extra page copy is avoided in case of direct I/O
        
        3) context switch
                - context switch decreased by ~50% with loop direct I/O(aio)
            compared with loop buffered I/O(4.2-rc6-next-20150814)
        
        4) memory usage from /proc/meminfo
                -------------------------------------------------------------
                                           | Buffers       | Cached
                -------------------------------------------------------------
                base                       | > 760MB       | ~950MB
                -------------------------------------------------------------
                base+loop direct I/O(aio)  | < 5MB         | ~1.6GB
                -------------------------------------------------------------
        
        - so there are much more page caches available for application with
        direct I/O
        
        [1] https://lwn.net/Articles/612483/
        
        Signed-off-by: Ming Lei 
        Reviewed-by: Christoph Hellwig 
        Signed-off-by: Jens Axboe 
    
    After that, we get following diagram.
        /dev/loopX            > /home/ubuntu-16.04.4-desktop-amd64.iso
             |               /         |
             v              /          v
        +-------------+    /       +-------+
        | vfs cache|  |   /        |  DIO  |
        +-------------+  /         +-------+
             |          /              |
             v         /               v
        +-------------C-----------------------------+
        | block layer  |                            |
        +-------------C-----------------------------+
             |         |               |
             v         |               v
            blk-loop driver        SCSI layer
    

    blk-stats

    Before look into the implementations of blk-stat in kernel, let's first look at how to utilize the information provided by blk-stats, iostat.

    #iostat -c -d -x /dev/sda2 2 100
    Linux 4.16.0-rc3+ (will-ThinkPad-L470)     03/20/2018     _x86_64_    (4 CPU)
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
              12.61    0.03    2.23    0.82    0.00   84.31
    
    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sda2              0.20     5.86    2.46    4.04    23.54    56.83    24.72     0.14   20.56    6.17   29.31   5.67   3.69
    
    
    rrqm/s      The number of read requests merged per second queued to the device.
    wrqm/s      The number of write requests merged per second queued to the device.
    r/s         The number of read requests issued to the device per second.
    w/s         The number of write requests issued to the device per second.
    avgrq-sz    The average size (in sectors) of the requests issued to the device.
    avgqu-sz    The average queue length of the requests issued to the device.
    await       The average time (milliseconds) for I/O requests issued to the device to be served.
                This includes the time spent by the requests in queue and the time spent servicing them.
    r_await     The average time (in milliseconds) for read requests issued to the device to be served.
                This includes the time spent by the requests in queue and the time spent servicing them.
    w_await     The average time (in milliseconds) for write requests issued to the device to be served.
                This includes the time spent by the requests in queue and the time spent servicing them.
    svctm       The average service time (in milliseconds) for I/O requests issued to the device.
                Warning! Do not trust this field; it will be removed in a future version of sysstat.
    %util       Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device).
                Device saturation occurs when this values is close to 100%.
    
    How to calculate them ? Based on write_ext_stat
    ioj, ioi     two samples, j = i + 1
    itv          interval of two samples
    
    rrqm/s    (ioj->rd_merges, ioi->rd_merges)/itv   
    wrqm/s    (ioj->wr_merges, ioi->wr_merges)/itv   
    r/s       (ioj->rd_ios, ioi->rd_ios)/itv
    w/s       (ioj->wr_ios, ioi->wr_ios)/itv
    avgrq-sz  ((ioj->rd_sect - ioi->rd_sect) + (ioj->wr_sect - ioi->wr_sect))/
              (ioj->nr_ios - ioi->nr_ios)
    avgqu-sz  (ioj->rq_ticks, ioi->rq_ticks)/itv
    await     ((ioj->rd_ticks - ioi->rd_ticks) + (ioj->wr_ticks + ioj->wr_ticks))/
              (ioj->nr_ios - ioi->nr_ios)
             
    r_await    similar with await
    w_await    similar with await
             
    %util      (ioj->tot_ticks - ioi->tot_ticks)/itv
    
    We could refer read_diskstats_stat to know where does these data come from.

    Next, let's find out how to generate this statistics data in kernel.
    Based on diskstats_show There are following members in hd_struct.dkstats ( a percpu variable) Reference
    read_diskstats_stat
    void read_diskstats_stat(int curr)
    {
        ...
        if ((fp = fopen(DISKSTATS, "r")) == NULL) //  proc/diskstats
            return;
    
        while (fgets(line, 256, fp) != NULL) {
    
            /* major minor name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq */
            i = sscanf(line, "%u %u %s %lu %lu %lu %lu %lu %lu %lu %u %u %u %u",
                   &major, &minor, dev_name,
                   &rd_ios, &rd_merges_or_rd_sec, &rd_sec_or_wr_ios, &rd_ticks_or_wr_sec,
                   &wr_ios, &wr_merges, &wr_sec, &wr_ticks, &ios_pgr, &tot_ticks, &rq_ticks);
    
            if (i == 14) {
                /* Device or partition */
                if (!dlist_idx && !DISPLAY_PARTITIONS(flags) &&
                    !is_device(dev_name, ACCEPT_VIRTUAL_DEVICES))
                    continue;
                sdev.rd_ios     = rd_ios;
                sdev.rd_merges  = rd_merges_or_rd_sec;
                sdev.rd_sectors = rd_sec_or_wr_ios;
                sdev.rd_ticks   = (unsigned int) rd_ticks_or_wr_sec;
                sdev.wr_ios     = wr_ios;
                sdev.wr_merges  = wr_merges;
                sdev.wr_sectors = wr_sec;
                sdev.wr_ticks   = wr_ticks;
                sdev.ios_pgr    = ios_pgr;
                sdev.tot_ticks  = tot_ticks;
                sdev.rq_ticks   = rq_ticks;
            }
            ...
            save_stats(dev_name, curr, &sdev, iodev_nr, st_hdr_iodev);
        }
        ...
    }
    
    diskstats_show
    static int diskstats_show(struct seq_file *seqf, void *v)
    {
        struct gendisk *gp = v;
        struct disk_part_iter piter;
        struct hd_struct *hd;
        char buf[BDEVNAME_SIZE];
        unsigned int inflight[2];
        int cpu;
    
        /*
        if (&disk_to_dev(gp)->kobj.entry == block_class.devices.next)
            seq_puts(seqf,    "major minor name"
                    "     rio rmerge rsect ruse wio wmerge "
                    "wsect wuse running use aveq"
                    "\n\n");
        */
    
        disk_part_iter_init(&piter, gp, DISK_PITER_INCL_EMPTY_PART0);
        while ((hd = disk_part_iter_next(&piter))) {
            cpu = part_stat_lock();
            part_round_stats(gp->queue, cpu, hd);
            part_stat_unlock();
            part_in_flight(gp->queue, hd, inflight);
            seq_printf(seqf, "%4d %7d %s %lu %lu %lu "
                   "%u %lu %lu %lu %u %u %u %u\n",
                   MAJOR(part_devt(hd)), MINOR(part_devt(hd)),
                   disk_name(gp, hd->partno, buf),
                   part_stat_read(hd, ios[READ]),
                   part_stat_read(hd, merges[READ]),
                   part_stat_read(hd, sectors[READ]),
                   jiffies_to_msecs(part_stat_read(hd, ticks[READ])),
                   part_stat_read(hd, ios[WRITE]),
                   part_stat_read(hd, merges[WRITE]),
                   part_stat_read(hd, sectors[WRITE]),
                   jiffies_to_msecs(part_stat_read(hd, ticks[WRITE])),
                   inflight[0],
                   jiffies_to_msecs(part_stat_read(hd, io_ticks)),
                   jiffies_to_msecs(part_stat_read(hd, time_in_queue))
                );
        }
        disk_part_iter_exit(&piter);
    
        return 0;
    }
    
    write_ext_stat
    void write_ext_stat(int curr, unsigned long long itv, int fctr,
                struct io_hdr_stats *shi, struct io_stats *ioi,
                struct io_stats *ioj)
    {
        char *devname = NULL;
        struct stats_disk sdc, sdp;
        struct ext_disk_stats xds;
        double r_await, w_await;
        
        /*
         * Counters overflows are possible, but don't need to be handled in
         * a special way: The difference is still properly calculated if the
         * result is of the same type as the two values.
         * Exception is field rq_ticks which is incremented by the number of
         * I/O in progress times the number of milliseconds spent doing I/O.
         * But the number of I/O in progress (field ios_pgr) happens to be
         * sometimes negative...
         */
        sdc.nr_ios    = ioi->rd_ios + ioi->wr_ios;
        sdp.nr_ios    = ioj->rd_ios + ioj->wr_ios;
    
        sdc.tot_ticks = ioi->tot_ticks;
        sdp.tot_ticks = ioj->tot_ticks;
    
        sdc.rd_ticks  = ioi->rd_ticks;
        sdp.rd_ticks  = ioj->rd_ticks;
        sdc.wr_ticks  = ioi->wr_ticks;
        sdp.wr_ticks  = ioj->wr_ticks;
    
        sdc.rd_sect   = ioi->rd_sectors;
        sdp.rd_sect   = ioj->rd_sectors;
        sdc.wr_sect   = ioi->wr_sectors;
        sdp.wr_sect   = ioj->wr_sectors;
        
        compute_ext_disk_stats(&sdc, &sdp, itv, &xds);
        
        r_await = (ioi->rd_ios - ioj->rd_ios) ?
              (ioi->rd_ticks - ioj->rd_ticks) /
              ((double) (ioi->rd_ios - ioj->rd_ios)) : 0.0;
        w_await = (ioi->wr_ios - ioj->wr_ios) ?
              (ioi->wr_ticks - ioj->wr_ticks) /
              ((double) (ioi->wr_ios - ioj->wr_ios)) : 0.0;
    
        /* Print device name */
        if (DISPLAY_PERSIST_NAME_I(flags)) {
            devname = get_persistent_name_from_pretty(shi->name);
        }
        if (!devname) {
            devname = shi->name;
        }
        if (DISPLAY_HUMAN_READ(flags)) {
            printf("%s\n%13s", devname, "");
        }
        else {
            printf("%-13s", devname);
        }
    
        /*       rrq/s wrq/s   r/s   w/s  rsec  wsec  rqsz  qusz await r_await w_await svctm %util */
        printf(" %8.2f %8.2f %7.2f %7.2f %8.2f %8.2f %8.2f %8.2f %7.2f %7.2f %7.2f %6.2f %6.2f\n",
               S_VALUE(ioj->rd_merges, ioi->rd_merges, itv),
               S_VALUE(ioj->wr_merges, ioi->wr_merges, itv),
               S_VALUE(ioj->rd_ios, ioi->rd_ios, itv),
               S_VALUE(ioj->wr_ios, ioi->wr_ios, itv),
               ll_s_value(ioj->rd_sectors, ioi->rd_sectors, itv) / fctr,
               ll_s_value(ioj->wr_sectors, ioi->wr_sectors, itv) / fctr,
               xds.arqsz,
               S_VALUE(ioj->rq_ticks, ioi->rq_ticks, itv) / 1000.0,
               xds.await,
               r_await,
               w_await,
               /* The ticks output is biased to output 1000 ticks per second */
               xds.svctm,
               /*
                * Again: Ticks in milliseconds.
            * In the case of a device group (option -g), shi->used is the number of
            * devices in the group. Else shi->used equals 1.
            */
               shi->used ? xds.util / 10.0 / (double) shi->used
                         : xds.util / 10.0);    /* shi->used should never be null here */
    }
    
    

    blk-timeout

    There is a timer per request_queue to defense blk device no response.

    The timer is armed by blk_add_timer.
    The timer is request_queue.timeout and timeout fn is blk_rq_timed_out_timer.
    static void blk_rq_timed_out_timer(struct timer_list *t)
    {
        struct request_queue *q = from_timer(q, t, timeout);
    
        kblockd_schedule_work(&q->timeout_work);
    }
    
    The main stuff of timeout is executed in kworker context.
    There is a difference between blk-legacy and blk-mq.
    In blk-legacy, when arm the timer, the request will be added on request_queue.timeout_list.
    And when the request is completed, the request will be dequeued from it.
    blk_requeue_request/blk_finish_request
      -> blk_delete_timer
    The blk_timeout_work will check the requests on request_queue.timeout_list.
    
    In blk-mq, the request_queue.timeout_list is not used any more, instead, it
    employ the blk_mq_queue_tag_busy_iter. It use the occupied
    driver tag to track the requests.
    
    static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
    {
        struct bt_iter_data *iter_data = data;
        struct blk_mq_hw_ctx *hctx = iter_data->hctx;
    
        struct blk_mq_tags *tags = hctx->tags;
    
        bool reserved = iter_data->reserved;
        struct request *rq;
    
        if (!reserved)
            bitnr += tags->nr_reserved_tags;
        rq = tags->rqs[bitnr];
    
        /*
         * We can hit rq == NULL here, because the tagging functions
         * test and set the bit before assining ->rqs[].
         */
        if (rq && rq->q == hctx->queue)
            iter_data->fn(hctx, rq, iter_data->data, reserved);
        return true;
    }
    
    When there is no io scheduler, the request will always occupy a driver tag. If the lldd prevent new requests from entering through blk_mq_quiesce_queue or other ways, and the request_queue.timeout has been armed, will the requests in blk-mq queues be expired ?
    So when a request is completed, we don't see blk_delete_timer in __blk_mq_complete_request or __blk_mq_end_request.

    Another difference is the method to handle Race between timeout completion and regular completion
    blk-legacy employs blk_mark_rq_complete.
    void blk_complete_request(struct request *req)
    {
        if (unlikely(blk_should_fake_timeout(req->q)))
            return;
        if (!blk_mark_rq_complete(req))
            __blk_complete_request(req);
    }
    static void blk_rq_check_expired(struct request *rq, unsigned long *next_timeout,
                  unsigned int *next_set)
    {
        const unsigned long deadline = blk_rq_deadline(rq);
    
        if (time_after_eq(jiffies, deadline)) {
            list_del_init(&rq->timeout_list);
    
            /*
             * Check if we raced with end io completion
             */
            if (!blk_mark_rq_complete(rq))
                blk_rq_timed_out(rq);
        } else if (!*next_set || time_after(*next_timeout, deadline)) {
            *next_timeout = deadline;
            *next_set = 1;
        }
    }
    
    In blk-mq, after tejun's blk-mq: reimplement timeout handling (https://lkml.org/lkml/2018/1/9/761), blk_mark_rq_complete has been discarded.
    rcu/srcu is employed to synchronize between timeout path and regular completion path instead of atomic operations. In addition, it could avoid the following scenario below.
    blk_mq_check_expired
    ---
        deadline = READ_ONCE(rq->deadline);
    
    A delay introduced here by preempt or interrupt or other, during this, the rq is
    completed and freed, then got and reinitialized again by others.
    And we could timeout a new instance here.
    
        if (time_after_eq(jiffies, deadline)) {
            if (!blk_mark_rq_complete(rq)) {
                blk_mq_rq_timed_out(rq, reserved);
            }
    ---
    
    After tejun's commit, things become this:
    blk_mq_check_expired
    ---
        /* read coherent snapshots of @rq->state_gen and @rq->deadline */
        while (true) {
            start = read_seqcount_begin(&rq->gstate_seq);
            gstate = READ_ONCE(rq->gstate);
            deadline = blk_rq_deadline(rq);
            if (!read_seqcount_retry(&rq->gstate_seq, start))
                break;
            cond_resched();
        }
    
    A delay introduced here by preempt or interrupt or other, during this, the rq is
    completed and freed, then got and reinitialized again by others.
    
        /* if in-flight && overdue, mark for abortion */
        if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
            time_after_eq(jiffies, deadline)) {
            blk_mq_rq_update_aborted_gstate(rq, gstate);
            data->nr_expired++;
            hctx->nr_expired++;
        } 
    ---
    static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
            struct request *rq, void *priv, bool reserved)
    {
    
        /*
         * We marked @rq->aborted_gstate and waited for RCU.  If there were
         * completions that we lost to, they would have finished and
         * updated @rq->gstate by now; otherwise, the completion path is
         * now guaranteed to see @rq->aborted_gstate and yield.  If
         * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
         */
    Note: the rcu/srcu synchronize is between blk_mq_check_expired and
    blk_mq_terminate_expired.
    
        if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
            READ_ONCE(rq->gstate) == rq->aborted_gstate)
    
    There two parts of the gstate, generation and state.
    When we save the gstate to aborted_gstate, its state was MQ_RQ_IN_FLIGHT.
    If the recycle new instance has not been started, the state will not match,
    because it is MQ_RQ_IDLE, if started, the generation will not match, because the
    generation part of gstate will be increased when state switches to
    MQ_RQ_IN_FLIGHT.
    
            blk_mq_rq_timed_out(rq, reserved);
    }
    

    blk-throttle

    Basis

                                   generic_make_request
                                           |
                                           V
       tg_A->sq->queued (qn_A_r_self (bio, bio, bio))    tg_B->sq->queued (qn_B_r_self (bio, bio, bio))
                                           |   
                                           V
                     tg_ABg->sq->queued (qn_ABg_r_self(bio, bio) qn_A_r_parent (bio), qn_B_r_parent (bio bio))
                                            |
                                            V
                                 td->sq->queued (qn_ABg_r_parent(bio))
                                            | 
                                            V                                 
                                    generic_make_request (td->dispatch_work context)
    
                                    bio (w/ BIO_THROTTLED) will not pass
                                    through blk-throttle again.
    
    qn  per-tg, contains throttled bios.
        when dispatch bios, qn by qn, rather than bio by bio, otherwise, one tg could
        fill up the budget and starve others. (throtl_pop_queued)
        There are two dimensions of qn.
        r/w , when dispatch, 75% read, 25% write (throtl_dispatch_tg)
        self/parent, during dispatching, some bios maybe queued upwards to parent's
        sq, some others not. At the moment, parent qn is used to contained ther bios
        queued to parent's sq, self qn contains others.
    
    sq  throtl_service_queue, per-tg or td
        construct the hierarchy, td->sq is the root node
        queued throl_qnode
        first_pending_disptime
        pending_timer, dispatch bios upwards to parent sq until td->sq, queue td dispatch_work
    
    tg  throtl_grp, per (blk-throt cgroup - request_queue)
        bps,iops limits, bytes, ios dispatched number
    
    td  throtl_data, per-request_queue
        queued[r/w] qn list, only the bios that has reached here could be issued.
        dispatch_work, generic_make_request
        limit_index (LOW/MAX)
    
    
    
    How to account the bps and iops ? 
                                current
                                   |
     tg->slice_start               v         tg->slice_end
              |-------|------|-------|------| ....
              |< - - - -   - - - - ->|
                         V
                    elapsed_rnd
    
    
              limit = tg_bps/iops_limit(tg, rw) * elapsed_rnd
    
    
    
    | - - - |  td->throtl_slice
    Refer to tg_with_in_tg/iops_limit
    
    When the tg->bytes/io_disp is over the limit:
    blk_throtl_bio
      -> throtl_add_bio_tg
        -> set THROTL_TG_WAS_EMPTY when sq->nr_queued == 0
        -> throtl_qnode_add_bio(bio, qn, &sq->queued[rw]);
          -> add bio to qn, add qn to sq
          -> blkg_get(tg_to_blkg(qn->tg))
             throttled bio dispatching is an asynchronous event,
             we need a reference of blkg to prevent it to be freed
             
        -> add tg to parent sq pending rb tree with tg->disptime as key
      if THROTL_TG_WAS_EMPTY is set
      -> tg_update_disptime
      
      next dispatch time will be calculated here through tg_may_dispatch
      
      -> throtl_schedule_next_dispatch(tg->service_queue.parent_sq, true);
        -> update_min_dispatch_time
          -> pick up the leftest node from the parent sq pending rb tree
             and update parent_sq->first_pending_disptime
          -> throtl_schedule_pending_timer
            -> schedule parent_sq pending_timer on first_pending_disptime
    
    Think of a case here:
    A bio is throttled and its dispatch time is 5 jiffies. What if a new bio comes in with a 3 jiffies dispatch time ?
    Why does every tg need a dispatch time ?
    
    bio size
            ^
            |  o - bio
            |
            |                o2
            |                |     o3
            |  o0   o1       |     |
            |   |   |        |     |
            +-----------------------------------------> time
                    t0       t1
    if we issue the o2 on t0, the bps limit will be reached, we have to delay it to
    t1, then bps limit could be complied.
    
    
    Howeve, what if the following case:
    
    bio size
            ^
            |  o - bio
            |
            |                o2 (planed)
            |                |     
            |  o0   o1o3     |    
            |   |   | |      |     
            +-----------------------------------------> time
                    t0       t1
    
    We have schedule the parent_sq pending timer to t1 to dispatch o2, when we have
    a o3 on t0, the pending_timer need to expire ahead to disaptch o3, otherwise, o3
    is delayed. How to handle this case in blk-throtl ?
    
    No such kind of issue
    Except for o3 has a higher priority than o2. What does blk-throl do here is to
    limit the bps.
    In fact, blk-throtl maintain the rq_list of read and write separately, so write
    bios will not block read bios. And blk-throtl will try to dispatch 75% READS and
    25% WRITES, refer to throtl_dispatch_tg.
    
    We have illustrated the hierarchy structure of blk-throtl. Let's walk through the source code here.
    submit path
    generic_make_request
      -> generic_make_request_checks
        -> blkcg_bio_issue_check
          -> blk_throtl_bio
    ---
        while (true) {
            if (tg->last_low_overflow_time[rw] == 0)
                tg->last_low_overflow_time[rw] = jiffies;
            throtl_downgrade_check(tg);
            throtl_upgrade_check(tg);
    /* throtl is FIFO - if bios are already queued, should queue */
            if (sq->nr_queued[rw])
                break;
    
            /* if above limits, break to queue */
            if (!tg_may_dispatch(tg, bio, NULL)) {
                tg->last_low_overflow_time[rw] = jiffies;
                if (throtl_can_upgrade(td, tg)) {
                    throtl_upgrade_state(td);
                    goto again;
                }
                break;
            }
    
            /* within limits, let's charge and dispatch directly */
            throtl_charge_bio(tg, bio);
    
            /*
             * We need to trim slice even when bios are not being queued
             * otherwise it might happen that a bio is not queued for
             * a long time and slice keeps on extending and trim is not
             * called for a long time. Now if limits are reduced suddenly
             * we take into account all the IO dispatched so far at new
             * low rate and * newly queued IO gets a really long dispatch
             * time.
             *
             * So keep on trimming slice even if bio is not queued.
             */ 
            throtl_trim_slice(tg, rw);
    
            /*
             * @bio passed through this layer without being throttled.
             * Climb up the ladder.  If we''re already at the top, it
             * can be executed directly.
             */
            qn = &tg->qnode_on_parent[rw];
            sq = sq->parent_sq;    // check limit upward
            tg = sq_to_tg(sq);
            if (!tg)
                goto out_unlock;
        }
    ---
    
    Dispatch path:
    
    static void throtl_pending_timer_fn(struct timer_list *t)
    {
        ...
    again:
        parent_sq = sq->parent_sq;
        dispatched = false;
    
        while (true) {
            throtl_log(sq, "dispatch nr_queued=%u read=%u write=%u",
                   sq->nr_queued[READ] + sq->nr_queued[WRITE],
                   sq->nr_queued[READ], sq->nr_queued[WRITE]);
    
            ret = throtl_select_dispatch(sq);
              -> throtl_dispatch_tg // if tg_may_dispatch
                -> tg_dispatch_one_bio
                  -> throtl_pop_queued
                  -> throtl_charge_bio
                  -> add to sq of parent tg or td
            if (ret) {
                throtl_log(sq, "bios disp=%u", ret);
                dispatched = true;
            }
    
            there maybe still queued bio in the tg
            if (throtl_schedule_next_dispatch(sq, false))
                break;
    
            /* this dispatch windows is still open, relax and repeat */
            spin_unlock_irq(q->queue_lock);
            cpu_relax(); //give some others chances to get in.
            queued spinlock will ensure the waiters to get this lock in turn.
            spin_lock_irq(q->queue_lock);
        }
    
        if (!dispatched)
            goto out_unlock;
    
        if (parent_sq) {
            /* @parent_sq is another throl_grp, propagate dispatch */
            if (tg->flags & THROTL_TG_WAS_EMPTY) {
                tg_update_disptime(tg);
                if (!throtl_schedule_next_dispatch(parent_sq, false)) {
                    /* window is already open, repeat dispatching */
                    sq = parent_sq;
                    tg = sq_to_tg(sq);
                    goto again;
                }
            }
        } else {
            /* reached the top-level, queue issueing */
            queue_work(kthrotld_workqueue, &td->dispatch_work);
        }
    out_unlock:
        spin_unlock_irq(q->queue_lock);
    }
    
    
    


    low limit

    io.low limit is only available in cgroup2. cgroup with a io.max limit will never dispatch more IO than its max limit, but it cannot ensure the cgroup always has a appropriate bps or iops. For example:

    tasks in cgroup_read have very high read workload, and tasks in cgroup_write
    have very high write workload. They both issues requests on a same disk with wbt
    enabled. The writes operations will be limitted due to wbt and IO performance in
    cgroup_write will be very pool when cgroup_read always issues read operations.
    
    These two cgroup both don't exceed the io.max, but cgroup_write has a very pool
    performance. This is not fair for cgroup_write.
    
    Or another example from https://lwn.net/Articles/709474/
    
    An example usage is we have a high prio cgroup with high 'low' limit and a low
    prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
    prio can run above its 'low' limit, so we don't waste the bandwidth. When the
    high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
    under its 'low' limit. This will protect high prio cgroup to get more
    resources.
    
    The final destination is to limit the bps/iops between io.low ~ io.max.
    There are two questions that need to be figured out.
    When to switch to io.low limit
    Related varaibles in tg
    
    • last_check_time
    • last_bytes/io_disp[R/W] (throtl_charge_bio)
    • last_low_overflow_time[R/W]
    Check the bps or iops through last_bytes/io_disp/(jiffies - last_check_time) If the result > io.low limit, set last_low_overflow_time, which means the bps/iops is higher than io.low during the last period. If jiffies >= tg->last_low_overflow_time + td->throtl_slice, we say the io.low limit is reached. This is done by throtl_downgrade_check. throtl_downgrade_state switches the limit to LOW. static void throtl_downgrade_state(struct throtl_data *td, int new) { td->scale /= 2; throtl_log(&td->service_queue, "downgrade, scale %d", td->scale); if (td->scale) { td->low_upgrade_time = jiffies - td->scale * td->throtl_slice; return; } td->limit_index = new; td->low_downgrade_time = jiffies; }
    After switch to io.low limit, when to get back to io.max ?
    When swith to io.low limit,
    blk_throtl_bio -> tg_may_dispatch -> tg_with_in_bps_limit -> tg_bps_limit will
    return the io.low limit through tg->bps[rw][td->limit_index]
    then more bios will be throttled and queued.
    
    
    last_low_overflow_time ( bps/iops is higher than limit) is updated in following
    
                  if limit_index == MAX
     ^                throttled and queued,   blk_throtl_bio updates last_low_overflow_time
     |                
     |           ----------------------------------------------   LIMIT_MAX
     |
     |           if limit_index == MAX
     |              charge and dispatch,    throtl_downgrade_check updates last_low_overflow_time
     |disp       if limit_index == LOW
     |bps/          throttled and queue,    blk_throtl_bio updates last_low_overflow_time
     |iops       
     |
     |
     |            ----------------------------------------------  LIMIT_LOW
     |             if limit_index == MAX && time_after(now, last_low_overflow_time + throtl_slice)
     |               downgrade
     |             if limit_index == LOW
     |                 charge and dispatch
     |             if limit_index == LOW && time_after(now, last_low_overflow_time + throtl_slice)
                    upgrade
    position:
    
  • throtl_downgrade_check ( only make sense when LIMIT_MAX)
  • tg_may_dispatch return returns false, which indicates bps/iops is above no matter MAX or LOW
  • before queue throttled bio, tg_may_dispatch maybe skipped due to sq->nr_queued > 0 Are the 2nd and 3rd cases necessary ? last_low_overflow_time indicates the bps/iops is above low limit during the past period of time. For 2nd and 3rd cases, if the limit_index is MAX, beyond question, bps/iops is above the low limit, because the blk-throtl pending time will ensure the dispatching bps/iops is equal max limit. However, if the limit_index is LOW, when the bio is throtled and queued, it indicates the submit bps/iops is above low limit, not dispatch, which one will be ensured to be equal to low limit. submit bps/iops is dispatch bps/iops above limit is equal to limit vfs push pop sq->pending_timer blk_throtl_bio ----> sq->queued[] ----> throtl_dispatch_tg sq->nr_queued > 0 tg_dispatch_one_bio (charge, queue up, trim) throtl_add_bio_tg The condition to switch to MAX throtl_upgrade_check -- if (time_after(tg->last_check_time + tg->td->throtl_slice, now)) return; tg->last_check_time = now; -- ... -> throtl_tg_can_upgrade -> time_after_eq(jiffies, tg_last_low_overflow_time(tg) + tg->td->throtl_slice) && throtl_tg_is_idle(tg)) ^^^^ Should it be a '||' ? throtl_upgrade_state does the real work. static void throtl_upgrade_state(struct throtl_data *td) { struct cgroup_subsys_state *pos_css; struct blkcg_gq *blkg; throtl_log(&td->service_queue, "upgrade to max"); td->limit_index = LIMIT_MAX; td->low_upgrade_time = jiffies; td->scale = 0; rcu_read_lock(); blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) { struct throtl_grp *tg = blkg_to_tg(blkg); struct throtl_service_queue *sq = &tg->service_queue; tg->disptime = jiffies - 1; //force this tg to be dispatched throtl_select_dispatch(sq); //Move the bios of child tgs upward throtl_schedule_next_dispatch(sq, true); } rcu_read_unlock(); //Dispatch !!! throtl_select_dispatch(&td->service_queue); throtl_schedule_next_dispatch(&td->service_queue, true); queue_work(kthrotld_workqueue, &td->dispatch_work); } After the the io limit upgrades, blk-throtl tries to dispatch the bios more smoothly. Let's look at tg_bps_limit and throtl_adjusted_limit. --- if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_LOW] && tg->bps[rw][LIMIT_LOW] != tg->bps[rw][LIMIT_MAX]) { uint64_t adjusted; adjusted = throtl_adjusted_limit(tg->bps[rw][LIMIT_LOW], td); ret = min(tg->bps[rw][LIMIT_MAX], adjusted); } --- static uint64_t throtl_adjusted_limit(uint64_t low, struct throtl_data *td) { /* arbitrary value to avoid too big scale */ if (td->scale < 4096 && time_after_eq(jiffies, td->low_upgrade_time + td->scale * td->throtl_slice)) td->scale = (jiffies - td->low_upgrade_time) / td->throtl_slice; return low + (low >> 1) * td->scale; } throtl_adjusted_limit will re-balance the bandwidth between tgs. throtl_upgrade_state has updated td->scale and td->low_upgrade_time. so the limit will not reach to io.max immediately after throtl_upgrade_state. The actual limit is: limit = low + (low >> 1) * (now - td->low_upgrade_time)/td->throtl_slice The tg that has higher low limit will get more bandwidth because it has higher growing limit, so this should be the core idea of io.low
  • When the cgroup is free even idle, it indeed stay below the io.low limit, but it should not count. How to tell this ?
    Quote from comment of throtl_tg_is_idle:
    
    cgroup is idle if:
    - single idle is too long, longer than a fixed value (in case user
      configure a too big threshold) or 4 times of idletime threshold
    - average think time is more than threshold
    - IO latency is largely below threshold
    
    
    Think time
    The interval between the completion of previous IO and submitting of next IO. blk_throtl_bio_endio will record the time of completion in tg->last_finish_time. Then in blk_throtl_bio -> blk_throtl_update_idletime, the average think time will be calculated. static void blk_throtl_update_idletime(struct throtl_grp *tg) { unsigned long now = ktime_get_ns() >> 10; unsigned long last_finish_time = tg->last_finish_time; if (now <= last_finish_time || last_finish_time == 0 || last_finish_time == tg->checked_last_finish_time) return; tg->avg_idletime = (tg->avg_idletime * 7 + now - last_finish_time) >> 3; tg->checked_last_finish_time = last_finish_time; } Latency
    The latency here is the interval between issuing request to device and completion of the request. This is based on the processing capability of the storage device. If a cgroup's IO latency is blow the IO latency threshold, it means this cgroup is handled by device fairly. My question is: if one cgroup is below the low limit, but its IO latency is acceptable, we could say this cgroup is served by device fairly, but not served fairly by block layer, right ? commit comment of b9147dd (blk-throttle: add a mechanism to estimate IO latency) User configures latency target, but the latency threshold for each request size isn't fixed. For a SSD, the IO latency highly depends on request size. To calculate latency threshold, we sample some data, eg, average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency threshold of each request size will be the sample latency (I'll call it base latency) plus latency target. For example, the base latency for request size 4k is 80us and user configures latency target 60us. The 4k latency threshold will be 80 + 60 = 140us. To sample data, we calculate the order base 2 of rounded up IO sectors. If the IO size is bigger than 1M, it will be accounted as 1M. Since the calculation does round up, the base latency will be slightly smaller than actual value. Also if there isn't any IO dispatched for a specific IO size, we will use the base latency of smaller IO size for this IO size. But we shouldn't sample data at any time. The base latency is supposed to be latency where disk isn't congested, because we use latency threshold to schedule IOs between cgroups. If disk is congested, the latency is higher, using it for scheduling is meaningless. Hence we only do the sampling when block throttling is in the LOW limit, with assumption disk isn't congested in such state. If the assumption isn't true, eg, low limit is too high, calculated latency threshold will be higher. Hard disk is completely different. Latency depends on spindle seek instead of request size. Currently this feature is SSD only, we probably can use a fixed threshold like 4ms for hard disk though. td will have average latency for echo size separately, every tg has its own latency_target, ITOW, a tolerance. For SSD, when the td's average latency is low, we could say the device is relatively relaxed. This explains why it is '&&' throtl_tg_is_idle, which means the device fall into idle. The samples collection is hooked in blk_stat_add. blk_stat_add //the latency here is the interval between blk_mq_start_request and __blk_mq_complete_request -> blk_throtl_stat_add -> throtl_track_latency static void throtl_track_latency(struct throtl_data *td, sector_t size, int op, unsigned long time) { struct latency_bucket *latency; int index; if (!td || td->limit_index != LIMIT_LOW || !(op == REQ_OP_READ || op == REQ_OP_WRITE) || !blk_queue_nonrot(td->queue)) //We assume there is no congestion when LIMIT_LOW, //and the latency make sense only when there is no congestion in device return; index = request_bucket_index(size); latency = get_cpu_ptr(td->latency_buckets[op]); latency[index].total_latency += time; latency[index].samples++; put_cpu_ptr(td->latency_buckets[op]); }

    bsg

    The Linux sg driver is a upper level SCSI subsystem device driver that is used primarily to handle devices _not_ covered by the other upper
    level drivers: sd (disks), st (tapes) and sr (CDROMs and DVDs). The sg driver is used for enclosure management, cd writers,
    applications that read cd audio digitally and scanners. Sg can also be used for less usual tasks performed on disks, tapes and cdroms.
    Sg is a character device driver which, in some contexts, gives it advantages over block device drivers such as sd and sr. The interface of sg
    is at the level of SCSI command requests and their associated responses.
    
    From about Linux kernel 2.6.24, there is an alternate SCSI pass-through driver called "bsg" (block SCSI generic driver). The bsg driver has
    device names of the form /dev/bsg/0:1:2:3 and supports the SG_IO ioctl with the sg version 3 interface. The bsg driver also supports the sg
    version 4 interface which at this time the sg driver does not. Amongst other improvements the sg version 4 interface supports SCSI bidirectional commands.
    
    How does it work ?
    • setup
      bsg_setup_queue
      ---
      
          // A new request_queue
      
          q = blk_alloc_queue(GFP_KERNEL);
          if (!q)
              return ERR_PTR(-ENOMEM);
          q->cmd_size = sizeof(struct bsg_job) + dd_job_size;
          q->init_rq_fn = bsg_init_rq;
          q->exit_rq_fn = bsg_exit_rq;
          q->initialize_rq_fn = bsg_initialize_rq;
      
          q->request_fn = bsg_request_fn;
      
      
          ret = blk_init_allocated_queue(q);
          if (ret)
              goto out_cleanup_queue;
      
          q->queuedata = dev;
          q->bsg_job_fn = job_fn;
          blk_queue_flag_set(QUEUE_FLAG_BIDI, q);
          blk_queue_softirq_done(q, bsg_softirq_done);
          blk_queue_rq_timeout(q, BLK_DEFAULT_SG_TIMEOUT);
      
          ret = bsg_register_queue(q, dev, name, &bsg_transport_ops, release);
      ---
      
    • issue request
      take write as example:
      bsg_write
        -> __bsg_write
          -> bsg_map_hdr
            -> blk_get_request
            -> q->bsg_dev.ops->fill_hdr
            -> blk_rq_map_user //hdr->dout_xferp points to userland buffer
              -> blk_rq_map_user_iov // userland buffer will be mapped directly for zero copy I/O
         -> bsg_add_command
           -> blk_execute_rq_nowait
      
      bsg_request_fn
        -> blk_fetch_request
          -> blk_peek_request
          -> blk_start_request
        -> bsg_prepare_job // kref_init(&job->kref)
        -> q->bsg_job_fn
      
    • complete request
      bsg_softirq_done
        -> bsg_job_put
          -> kref_put(&job->kref, bsg_teardown_job)
      bsg_teardown_job
        -> blk_end_request_all
       
      This is a very interesting method.
      Unless the job->kref reaches zero, the bsg request will not be completed.
      It will fix the race between blk-timeout and completion path.
      Look at the following code:
      fc_bsg_job_timeout
      ---
          inflight = bsg_job_get(job);
      
          if (inflight && i->f->bsg_timeout) {
              /* call LLDD to abort the i/o as it has timed out */
              err = i->f->bsg_timeout(job);
              if (err == -EAGAIN) {
                  bsg_job_put(job);
                  return BLK_EH_RESET_TIMER;
              } else if (err)
                  printk(KERN_ERR "ERROR: FC BSG request timeout - LLD "
                      "abort failed with status %d\n", err);
          }
      
          /* the blk_end_sync_io() doesn't check the error */
          if (!inflight)
              return BLK_EH_NOT_HANDLED;
          else
              return BLK_EH_HANDLED;
      ---
      
    bidi request
    bidi aka bidirectional commands. There will be output and input in this kind of
    command concurrently.
    Look at bsg_map_hdr
    ---
        if (hdr->dout_xfer_len && hdr->din_xfer_len) {
            if (!test_bit(QUEUE_FLAG_BIDI, &q->queue_flags)) {
                ret = -EOPNOTSUPP;
                goto out;
            }
    
            next_rq = blk_get_request(q, REQ_OP_SCSI_IN, GFP_KERNEL);
            if (IS_ERR(next_rq)) {
                ret = PTR_ERR(next_rq);
                goto out;
            }
    
            rq->next_rq = next_rq;
    
            ret = blk_rq_map_user(q, next_rq, NULL, uptr64(hdr->din_xferp),
                           hdr->din_xfer_len, GFP_KERNEL);
            if (ret)
                goto out_free_nextrq;
        }
    ---
    
    

    direct_IO

    What will be done when direct IO on a block device ?

    __generic_file_write_iter
    ---
        if (iocb->ki_flags & IOCB_DIRECT) {
            loff_t pos, endbyte;
    
            written = generic_file_direct_write(iocb, from);
            if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
                goto out;
    
            // if direct_IO doesn't complete all of the IO, fallback to buffered IO.
    
            status = generic_perform_write(file, from, pos = iocb->ki_pos);
            ...
    
            /*
             * We need to ensure that the page cache pages are written to
             * disk and invalidated to preserve the expected O_DIRECT
             * semantics.
             */
    
            endbyte = pos + status - 1;
            err = filemap_write_and_wait_range(mapping, pos, endbyte);
            if (err == 0) {
                iocb->ki_pos = endbyte + 1;
                written += status;
                invalidate_mapping_pages(mapping,
                             pos >> PAGE_SHIFT,
                             endbyte >> PAGE_SHIFT);
            } else {
                /*
                 * We don't know how much we wrote, so just return
                 * the number of bytes which were direct-written
                 */
            }
        }
    ---
    
    generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
    {
        ...
        if (iocb->ki_flags & IOCB_NOWAIT) {
            /* If there are pages to writeback, return */
            if (filemap_range_has_page(inode->i_mapping, pos,
                           pos + iov_iter_count(from)))
                return -EAGAIN;
        } else {
    
            written = filemap_write_and_wait_range(mapping, pos,
                                pos + write_len - 1);
    
            if (written)
                goto out;
        }
    
        /*
         * After a write we want buffered reads to be sure to go to disk to get
         * the new data.  We invalidate clean cached page from the region we're
         * about to write.  We do this *before* the write so that we can return
         * without clobbering -EIOCBQUEUED from ->direct_IO().
         */
    
        written = invalidate_inode_pages2_range(mapping,
                        pos >> PAGE_SHIFT, end);
        ...
        written = mapping->a_ops->direct_IO(iocb, from);
        ...
        if (written > 0) {
            pos += written;
            write_len -= written;
    
            //Interesting thing here, the file is expanded by the direct IO.
            // we have to modify the size of the inode.
            if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
                i_size_write(inode, pos);
                mark_inode_dirty(inode);
            }
            iocb->ki_pos = pos;
        }
    
        iov_iter_revert(from, write_len - iov_iter_count(from));
    out:
        return written;
    }
    
    blkdev_direct_IO
      -> __blkdev_direct_IO_simple // Let's look at the simpler case.
    ---
        ...
        struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs, *bvec;
        ...
        if (nr_pages <= DIO_INLINE_BIO_VECS)
            vecs = inline_vecs;
        else {
            vecs = kmalloc_array(nr_pages, sizeof(struct bio_vec),
                         GFP_KERNEL);
            if (!vecs)
                return -ENOMEM;
        }
    
        bio_init(&bio, vecs, nr_pages);
        bio_set_dev(&bio, bdev);
        bio.bi_iter.bi_sector = pos >> 9;
        bio.bi_write_hint = iocb->ki_hint;
        bio.bi_private = current;
        bio.bi_end_io = blkdev_bio_end_io_simple;
        bio.bi_ioprio = iocb->ki_ioprio;
    
        // The most important thing here is to fill the bi_io_vec
                                    /
                                    | bv_page
        bio->bi_io_vec [ bio_vec ] <  bv_len
                       [ bio_vec ]  | bv_offset
                       [ bio_vec ]  \
                       ...
        bio_iov_iter_get_pages
          -> iov_iter_get_pages
            -> get_user_pages_fast
        It will get and pin the pages behind the userland buffers.
        Then these pages will be sent to block layer directly.
        So we could say this is zero-copy.
        Note: get_user_pages_fast will not ensure all of the requested pages will be got
              and pined.
    
        ret = bio_iov_iter_get_pages(&bio, iter);
        if (unlikely(ret))
            return ret;
        ret = bio.bi_iter.bi_size;
    
        if (iov_iter_rw(iter) == READ) {
            bio.bi_opf = REQ_OP_READ;
            if (iter_is_iovec(iter))
                should_dirty = true;
        } else {
            bio.bi_opf = dio_bio_write_op(iocb);
            task_io_account_write(ret);
        }
    
        qc = submit_bio(&bio);
        for (;;) {
            set_current_state(TASK_UNINTERRUPTIBLE);
            if (!READ_ONCE(bio.bi_private))
                break;
            if (!(iocb->ki_flags & IOCB_HIPRI) ||
                !blk_poll(bdev_get_queue(bdev), qc))
                io_schedule();
        }
    
        // we will sleep here to wait for the completion.
        // the blkdev_bio_end_io_simple will wake up us.
    
        __set_current_state(TASK_RUNNING);
    
        bio_for_each_segment_all(bvec, &bio, i) {
            if (should_dirty && !PageCompound(bvec->bv_page))
                set_page_dirty_lock(bvec->bv_page);
            put_page(bvec->bv_page);
        }
    
        if (vecs != inline_vecs)
            kfree(vecs);
    
        if (unlikely(bio.bi_status))
            ret = blk_status_to_errno(bio.bi_status);
    
        bio_uninit(&bio);
    ---
    

    blk RPM

    RPM

    Traditional suspend/resume

    Runtime suspend/resume
    * Once the subsystem-level suspend callback (or the driver suspend callback, 
      if invoked directly) has completed successfully for the given device, the PM 
      core regards the device as suspended, which need not mean that it has been 
      put into a low power state.  It is supposed to mean, however, that the 
      device will not process data and will not communicate with the CPU(s) and 
      RAM until the appropriate resume callback is executed for it.  The runtime 
      PM status of a device after successful execution of the suspend callback is 
      'suspended'.
    

    Hooks in BLK

    Hooks in blk-legacy

    __elv_add_request
      -> blk_pm_add_request
    ---
        if    q->dev // support RPM
           && rq->rq_flags & RQF_PM //not PM command
           && q->nr_pending++ == 0
           && (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
    
           pm_request_resume(q->dev) // start resume
    ---
    elv_requeue_request
      -> blk_pm_requeue_request
        ---
        if (rq->q->dev && !(rq->rq_flags & RQF_PM))
            rq->q->nr_pending--;
        ---
      -> __elv_add_request()//ELEVATOR_INSERT_REQUEUE
    
    __blk_put_request
      -> blk_pm_put_request
    ---
        if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
            pm_runtime_mark_last_busy(rq->q->dev);
    ---
    
    blk_peek_request
      -> elv_next_request
        -> iterate q->queue_head
           if blk_pm_allow_request
             return it
        ---
        switch (rq->q->rpm_status) {
        case RPM_RESUMING:
        case RPM_SUSPENDING:
            return rq->rq_flags & RQF_PM;
        case RPM_SUSPENDED:
            return false;
        }
    
        return true;
        ---
    
    Don't process normal requests when queue is suspended
    or in the process of suspending/resuming
    
    

    Work process

    The normal process of the runtime PM running in block layer is:

            blk_pre_runtime_suspend
              if q->nr_pending is zero
                 set q->rpm_status to RPM_SUSPENDING
                   |
                   v
            sdev_runtime_suspend
              -> pm->runtime_suspend
                   |
                   v
            blk_post_runtime_suspend
              -> set state to RPM_SUSPENDED
    
    When new request is added:
            __elv_add_request
              -> blk_pm_add_request
              ---
        if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
            (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
            pm_request_resume(q->dev);
              ---
    The resume process will be started here.
    
    Before the resume is completed, the requests will not been issued to LLDD.
    
            blk_peek_request
              -> elv_next_request
              ---
            list_for_each_entry(rq, &q->queue_head, queuelist) {
                if (blk_pm_allow_request(rq))
                    return rq;
              ---
    
    During the process of pm runtime resuming:
            blk_pre_runtime_resume
              -> set rpm_status to RPM_RESUMING
            pm->runtime_resume
            blk_post_runtime_resume
            ---
            q->rpm_status = RPM_ACTIVE;
            __blk_run_queue(q);
            pm_runtime_mark_last_busy(q->dev);
            pm_request_autosuspend(q->dev);
            ---
    
    rpm_suspend // if RPM_AUTO
      -> pm_runtime_autosuspend_expiration
        -> last_busy = READ_ONCE(dev->power.last_busy);
    
        it will check whether the device has been idle for some time,
        if yes, the suspend process will proceed, otherwise, set up the
        suspend_timer.
        
        the check here depends on the dev->power.last_busy
        it is updated around the blk-legacy layer.
    	the most important one is in blk_pm_put_request.
    
    
    pm_suspend_timer_fn
    ---
        if (expires > 0 && !time_after(expires, jiffies)) {
            dev->power.timer_expires = 0;
            rpm_suspend(dev, dev->power.timer_autosuspends ?
                (RPM_ASYNC | RPM_AUTO) : RPM_ASYNC);
        }
    ---
    

    RPM Core

    pm_runtime_put
      -> __pm_runtime_idle //RPM_GET_PUT | RPM_ASYNC
        ---
        if (rpmflags & RPM_GET_PUT) {
            if (!atomic_dec_and_test(&dev->power.usage_count))
                return 0;
        }
    
        might_sleep_if(!(rpmflags & RPM_ASYNC) && !dev->power.irq_safe);
    
        spin_lock_irqsave(&dev->power.lock, flags); 
        //This spinlock will serialize all the things
        retval = rpm_idle(dev, rpmflags);
        spin_unlock_irqrestore(&dev->power.lock, flags);
        ---
    
    rpm_idle
    ---
        ...
        callback = RPM_GET_CALLBACK(dev, runtime_idle);
    
        if (callback)
            retval = __rpm_callback(callback, dev);
    
        // __rpm_callback will unlock the dev->power.lock before invokes the
        // driver's callback.
    
        ...
        return retval ? retval : rpm_suspend(dev, rpmflags | RPM_AUTO);
    ---
    scsi_runtime_idle will always returns -EBUSY.
    Let's look at RPM_SUSPENDED
    ---
     repeat:
        retval = rpm_check_suspend_allowed(dev);
          -> if dev->power.runtime_status == RPM_SUSPENDED, return 1
        ...
        if (retval)
            goto out;
    
        ...
        /* Other scheduled or pending requests need to be canceled. */
        pm_runtime_cancel_pending(dev);
    
        if (dev->power.runtime_status == RPM_SUSPENDING) {
            DEFINE_WAIT(wait);
            ...
    
            /* Wait for the other suspend running in parallel with us. */
    
            for (;;) {
                prepare_to_wait(&dev->power.wait_queue, &wait,
                        TASK_UNINTERRUPTIBLE);
                if (dev->power.runtime_status != RPM_SUSPENDING)
                    break;
    
                spin_unlock_irq(&dev->power.lock);
    
                schedule();
    
                spin_lock_irq(&dev->power.lock);
            }
            finish_wait(&dev->power.wait_queue, &wait);
            goto repeat;
        }
    
        __update_runtime_status(dev, RPM_SUSPENDING);
    
        callback = RPM_GET_CALLBACK(dev, runtime_suspend);
    
        dev_pm_enable_wake_irq_check(dev, true);
        retval = rpm_callback(callback, dev);
        if (retval)
            goto fail;
    
     no_callback:
    
        __update_runtime_status(dev, RPM_SUSPENDED);
    
        pm_runtime_deactivate_timer(dev);
    
        if (dev->parent) {
            parent = dev->parent;
            atomic_add_unless(&parent->power.child_count, -1, 0);
        }
        wake_up_all(&dev->power.wait_queue);
    
    
    ---
    

    blk and hardware

    dma alignment

    Some storage controllers have DMA alignment requirement, which is often set through blk_queue_dma_alignment, such as 512 bytes.

    One of the usages of dma_alignment of request_queue.

    blk_rq_map_kern
    ---
        do_copy = !blk_rq_aligned(q, addr, len) || object_is_on_stack(kbuf);
    
        //unsigned int alignment = queue_dma_alignment(q) | q->dma_pad_mask;
        //return !(addr & alignment) && !(len & alignment);
    
        if (do_copy)
            bio = bio_copy_kern(q, kbuf, len, gfp_mask, reading);
    
        //New page will be allocated and copy data in it.
        //When bio is done, the data will be copied back to the original buffer.
        //Refer to bio_copy_kern_endio_read
    
        else
            bio = bio_map_kern(q, kbuf, len, gfp_mask);
    
        //Add the page associated with the buffer into bio.
    
    ---
    
    The caller of this blk_rq_map_kern:
     - __scsi_execute
     - __nvme_submit_sync_cmd
    
    Another similar interface is blk_rq_map_user_iov.

    block size

    The blocksize of filesystem and block device.
    Block: The smallest unit writable by a disk or file system. Everything a file system does is
    composed of operations done on blocks. A file system block is always the same size as or larger
    (in integer multiples) than the disk block size.

    
    
    The bdev_logical_block_size is the q->limits.logical_block_size. Look at how does the nvme set it.
    __nvme_revalidate_disk
    ---
        ns->lba_shift = id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ds;
        ...
        nvme_update_disk_info
        ---
            unsigned short bs = 1 << ns->lba_shift;
    
            blk_mq_freeze_queue(disk->queue);
            blk_integrity_unregister(disk);
    
            blk_queue_logical_block_size(disk->queue, bs);
            blk_queue_physical_block_size(disk->queue, bs);
            blk_queue_io_min(disk->queue, bs);
        ---
    ---
    
    The most important point here is that the blocksize is set during mkfs.

    dma alignment

    What is the gap ?
    It is indicated by queue_virt_boundary.

    The NVME PRP descriptor which is PAGE_SIZE aligned
                                             
      page A+-----+                page A+-----+            
            |     | \ PAGE_SIZE          |     | \ PAGE_SIZE
            |     | /                    |     | /              
      page B+-----+                page B+-----+                
            |     | \ PAGE_SIZE          |_ _ _| > PAGE_SIZE/2
            |     | /                    | GAP |
      page C+-----+                page C+-----+                
            |     | \ PAGE_SIZE          |     | \ PAGE_SIZE
            |     | /                    |     | /              
            +-----+                      +-----+ 
    
    So if we want to handle the unaligned PAGE_SIZE IO, need to
    split the IO into 3 parts as following,
    
    page A+-----+              page B+-----+                page C+-----+                
          |     | \ PAGE_SIZE        |_ _ _| > PAGE_SIZE/2        |     | \ PAGE_SIZE
          |     | /                                               |     | /              
          +-----+                                                 +-----+ 
    
    This is done by blk_queue_split.
    
    blk_queue_split
      -> blk_bio_segment_split
      ---
        bio_for_each_segment(bv, bio, iter) {
            /*
             * If the queue doesn't support SG gaps and adding this
             * offset would create a gap, disallow it.
             */
            if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
                goto split;
            ....
        }
    split:
        *segs = nsegs;
    
        if (do_split) {
            new = bio_split(bio, sectors, GFP_NOIO, bs);
            if (new)
                bio = new;
        }
      ---
    
    And let's check other places that need to check this.
    // the buffer may come from userspace and not aligned
    blk_rq_map_user_iov
    // don't merge bios or requests if will gap
    bio_will_gap <- req_gap_back_merge <- ll_back_merge_fn
                                       <- ll_merge_requests_fn
    
    bvec_gap_to_prev <- bio_integrity_add_page
             <- bio_add_pc_page
                     <- integrity_req_gap_back_merge
    
    Before queue_virt_boundary is introduced, we use QUEUE_FLAG_SG_GAPS
    QUEUE_FLAG_SG_GAPS
    And we check this flag in following positions.
    
    __bio_add_page
    ll_merge_requests_fn
    blk_rq_merge_ok
    

    DISCARD

    What is discard

    write amplification

     
        |----|                         Write granularity (e.g 32K)
        |----------------------------| Erase granularity (e.g 128K)
     
        There are contiguous user data blocks.
        If we want to write a 32K block in it, we have to
         - read in 128K data, and update data in it
         - erase
         - write this 128K
    
    Wear leveling
        A write can only occur to those pages that are erased, thereforehost write commands
        invoke flash erase cycles prior to writing to the flash. This write/erase cycling causes
        cell wear which imposes the limited write-life. Host write accesses can occur to any location
        which can cause hot-spots, which causes premature wear in these locations.
        wear-leveling is used to prevent the hot-spots
    
        Mapping
    
        In most cases, the controller maintains a lookup table to translate the memory array physical
        block address (PBA) to the logical block address (LBA) used by the host system. The controller's
        wear-leveling algorithm determines which physical block to use each time data is programmed,
        eliminating the relevance of the physical location of data and enabling data to be stored
        anywhere within the memory array.
    
        Selecting
    
        The controller typically either writes to the available erased block with the lowest erase count
        (dynamic wear leveling); or it selects an available target block with the lowest overall erase
        count, erases the block if necessary
    
        Garbage collection
    
        Given that previously written-to blocks must be erased before they are able to receive data again,
        the SSD controller must, for performance, actively pre-erase blocks so new write commands can always
        get an empty block. 
    
    What is the discard command for ?
    If the user or operating system erases a file (not just remove parts of it), the file
    will typically be marked for deletion, but the actual contents on the disk are never
    actually erased. Because of this, the SSD does not know that it can erase the LBAs
    previously occupied by the file, so the SSD will keep including such LBAs in the
    garbage collection.
    
    Enables the operating system to tell an SSD which blocks of previously saved data are
    no longer needed as a result of file deletions or volume formatting. When an LBA is
    replaced by the OS, as with an overwrite of a file, the SSD knows that the original
    LBA can be marked as stale or invalid and it will not save those blocks during Garbage
    collection.
    
    
    A simple example of SSD write,
    (Assume application only write in erase blocks)
    
    |----| erase block
      -    free
      o    used
      i    invalid 
    
       |ooooo|-----|-----|-----|-----|
       \__ __/                 \__ __/
          v                       v
    	File1                  Reserved
    
    
    When we write to File1,
    
    	    RMW
          .-----.
         /      v
       |iiiii|ooooo|-----|-----|-----|
             \__ __/           \__ __/
                v                 v
              File1            Reserved
    
    The original position of File1 will be reclaimed then.
    
    If we delete File1 in filesystem layer,
    
       |-----|ooooo|-----|-----|-----|
                               \__ __/
                                  v
                              Reserved
       
    SSD controller doesn't know this the File1 has been deleted,
    so it still think there is valid data in the block. And if
    such things happen multiple times, we would get,
    
       |ooooo|ooooo|ooooo|ooooo|-----|
       \__ __/     \__ __/     \__ __/
          v           v           v
        File2       File3      Reserved
    
    And only two of them has a valid file on it. (filesytem knows
    which block is free).
    
    When we write data in File2 and File3 in parallel,
    SSD controller has to use the Reserved block. However, there
    is only one in our case, when one write is ongoing, another one
    has to wait.
    
    This is why the SSDs would become slower when it fills up.
    
    If we DISCARD support in filesystem, when a file is deleted, filesystem
    will tell the SSD controller that the associated blocks are invalid
    and could be reclaimed. Then we would have,
    
       |ooooo|-----|ooooo|-----|-----|
       \__ __/     \__ __/     \__ __/
          v           v           v
        File2       File3      Reserved
    
    

    Another useful link about this Block layer discard requests

    Linux calls this as DISCARD
    Different storage protocol has a different name, e.g TRIM ( ATA ) UNMAP ( SBC ) Deallocate ( NVME )

    discard in blk

    discard in fs

    There is a danger about discard in fs

    the filesystem may well discard a set of sectors, then write new data to them once they are allocated to
    a new file. It would be a serious mistake to reorder the new writes ahead of the discard operation,
    causing the newly-written data to be lost.
    
    Let's look at how to handle this in individual fs.

    The trouble with discard

    https://lwn.net/Articles/347511/
    
    At the ATA protocol level, a discard request is implemented by a "TRIM" command sent to the device.
    For reasons unknown to your editor, the protocol committee designed TRIM as a non-queued command.
    That means that, before sending a TRIM command to the device, the block layer must first wait for
    all outstanding I/O operations on that device to complete; no further operations can be started
    until the TRIM command completes. So every TRIM operation stalls the request queue. Even if TRIM 
    were completely free, its non-queued nature would impose a significant I/O performance cost. (It's
    worth noting that the SCSI equivalent to TRIM is a tagged command which doesn't suffer from this
    problem).
    
    With current SSDs, TRIM appears to be anything but free. Mark Lord has measured regular delays of
    hundreds of milliseconds. Delays on that scale would be most unwelcome on a rotating storage device.
    On an SSD, hundred-millisecond latencies are simply intolerable.
    
    In one word, discard is not free.

    Someone complained that
    XFS has had async discard support, but it has been problematic for our
    fleet. We were seeing bursts of large discard requests caused by async
    discard in XFS. This resulted in degraded drive performance increasing
    latency for dependent services.
    
    And proposed alternative that filesystem layer could reuse the blocks which has just been freed.
       |ooooo|-----|-----|-----|-----|
       \__ __/                 \__ __/
          v                       v
    	File1                  Reserved
    
    Deleted File1 and then create File2,
    
       |ooooo|-----|-----|-----|-----|
       \__ __/                 \__ __/
          v                       v
    	File2                  Reserved