Block Basis

whether EIO is fatal or not depends on the component that is receiving it,
and they behave accordingly. If a file system encounters EIO error during
normal I/O (no metadata updates are involved), the error is bubbled back to
user space. Here even userspace application can choose to behave differently.
They can resubmit if possible, or crash if the I/O is related to recovery.

In file systems case, if an EIO error is returned during journal
update(metadata update) like in this case, it has 2 choices. 1) remount FS to
read-only or 2) crash the node. If FS is in single user mode, it can take the
FS to read-only, however if it's in clustered mode, it has to evict itself
hoping at least other nodes can continue fine
So avoid IO error as much as possible

blk-mq

sbitmap

There are two parts in sbitmap:

bitmap, used to record the resource

wait queue, used to maintain the resource waiters

wait queue

The core idea of sbitmap_queue is 'batch' and 'scatter'
scatter

Caller of sbq_wait_ptr has its owner wait_index.
static inline struct sbq_wait_state *sbq_wait_ptr(struct sbitmap_queue *sbq,
                          atomic_t *wait_index)
{
    struct sbq_wait_state *ws;

    ws = &sbq->ws[atomic_read(wait_index)];
    sbq_index_atomic_inc(wait_index);

    the wait_index will be increased every time.

    return ws;
}
Every time the caller try to get the sbq_wait_state, its wait_index will be
increased 1.
Take blk_mq_get_request as example, when there are multiple tasks try to
allocate tag, if all of them fail, they will try to get a wait queue and sleep
on it. sbq_wait_ptr will ensure they get different wait queue, so there will be
no contending when the wait entry is adding on the wait queue.

we could check this on the /sys/kernel/debug/block/nvme0n1/hctx0/tags  (driver tag )
wake_index=0
ws={
    {.wait_cnt=1, .wait=inactive},
    {.wait_cnt=1, .wait=active},
    {.wait_cnt=1, .wait=inactive},
    {.wait_cnt=1, .wait=active},
    {.wait_cnt=1, .wait=inactive},
    {.wait_cnt=1, .wait=active},
    {.wait_cnt=1, .wait=inactive},
    {.wait_cnt=1, .wait=active},
}

batch

static void sbq_wake_up(struct sbitmap_queue *sbq)
{
    ...
    ws = sbq_wake_ptr(sbq);
    if (!ws)
        return;

    wait_cnt = atomic_dec_return(&ws->wait_cnt);
    if (wait_cnt <= 0) {
        wake_batch = READ_ONCE(sbq->wake_batch);
        smp_mb__before_atomic();
        atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wait_cnt + wake_batch);
        sbq_index_atomic_inc(&sbq->wake_index);
        wake_up_nr(&ws->wait, wake_batch);
    }
}

wake_index=0
ws={
    {.wait_cnt=1, .wait=inactive},
    {.wait_cnt=1, .wait=active},
    {.wait_cnt=1, .wait=inactive},
    {.wait_cnt=1, .wait=active},
    {.wait_cnt=1, .wait=inactive},
    {.wait_cnt=1, .wait=active},
    {.wait_cnt=1, .wait=inactive},
    {.wait_cnt=1, .wait=active},         only one wait queue will be waked up per wait_cnt
}

Does the wake_batch introduce delay on high speed device ?

There is a interesting bug about wake_batch.
The wake_batch is calculated based on the sbitmap_queue depth which is actually
the tagset depth.
But the runtime tagset depth could be changed due to shallow_depth and
.limit_depth callback.

BFQ could ends up limiting shallow_depth to something low that is smaller than
the wake batch sizing for sbitmap, we can run into cases where we never wake up
folks waiting for a tag. The end result is an idle system with no IO pending,
but with tasks waiting for a tag with no one to wake them up because the
wake_batch.

Kyber could run into the same issue, if the async depth is limited low enough.

tag

There two types of tags.

sched tags

driver tags

In the comment of the commit which add the MQ capable IO scheduler framework (bd166ef), Jens Axboe said:

We split driver and scheduler tags, so we can run the scheduling independently of device queue depth.

  sched tags  sched tags  sched tags  sched tags
  
  Queue0      Queue1      Queue2      Queue3

               shared driver tags

           HBA cmd queue [C][C][C][C]

   LUN0        LUN1       LUN2        LUN3

tag allocation

blk_mq_get_tag is used to allocate tag.
There are following points need to be noded:

sched tags or driver tags ?

There are two interfaces will invoke blk_mq_get_tag to get tag.

 blk_mq_get_request       //based on whether there is io scheduler
 blk_mq_get_driver_tag

blk_mq_get_tag -> blk_mq_tags_from_data
will decide from where tag is allocated based on BLK_MQ_REQ_INTERNAL.

tag is request

No matter sched tagset or driver tagset, every tag corresponds to a static request entry.
blk_mq_get_request
  -> blk_mq_get_tag // get tag
  -> blk_mq_rq_ctx_init
---
    struct blk_mq_tags *tags = blk_mq_tags_from_data(data);

    struct request *rq = tags->static_rqs[tag];

---
The static_rqs is filled in blk_mq_alloc_rqs

So we know, when we get a request, it may belong to sched tagset or driver
tagset.

a driver tag is mandatory

driver tag indicates the capability of the HBA cmd queue.

even if a request is from sched tagset and already has a sched tag in
rq->internal_tag, it has to be assigned a driver tag before being issued.

blk_mq_dispatch_rq_list -> blk_mq_get_driver_tag

after get a driver tag, the request will be installed on driver tags->rqs[].
blk_mq_get_request
  -> blk_mq_rq_ctx_init
---
    if (data->flags & BLK_MQ_REQ_INTERNAL) {
        rq->tag = -1;
        rq->internal_tag = tag;
    } else {
        if (blk_mq_tag_busy(data->hctx)) {
            rq_flags = RQF_MQ_INFLIGHT;
            atomic_inc(&data->hctx->nr_active);
        }
        rq->tag = tag;
        rq->internal_tag = -1;

        data->hctx->tags->rqs[rq->tag] = rq;

    }
---
blk_mq_get_driver_tag
---
    rq->tag = blk_mq_get_tag(&data);
    if (rq->tag >= 0) {
        ...

        data.hctx->tags->rqs[rq->tag] = rq;

    }

---
driver need to use driver tag to get the associated request entry.
struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
{
    if (tag < tags->nr_tags) {
        prefetch(tags->rqs[tag]);
        return tags->rqs[tag];
    }

    return NULL;
}

It seems that the requests will only be installed in driver tagset

shallow depth

a low overhead method to limit the tags depth compared with tagset resize.
it usually cooperates with elevator .limit_depth.
blk_mq_get_request
  -> .limit_depth // change the shallow_depth
  -> blk_mq_get_tag
    -> __blk_mq_get_tag
---
    if (data->shallow_depth)
        return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
    else
        return __sbitmap_queue_get(bt);
---

driver tags sharing Refer to tag sharing

every request occupies a q->q_usage_counter

blk_freeze_queue will wait all the requests that have been allocated to be completed.
blk_mq_get_request
  -> blk_queue_enter_live // get q->q_usage_counter
blk_mq_free_request
  -> blk_queue_exit       // release q->q_usage_counter

if the tag is used up, there mainly two methods to wait the tags.

task sleep to wait

blk_mq_get_tag
---
    DEFINE_WAIT(wait);

    wake up func is autoremove_wake_function

    ...
    ws = bt_wait_ptr(bt, data->hctx);
    drop_ctx = data->ctx == NULL;
    do {
        ...
        tag = __blk_mq_get_tag(data, bt);
        if (tag != -1)
            break;

        prepare_to_wait_exclusive(&ws->wait, &wait,
                        TASK_UNINTERRUPTIBLE);

        tag = __blk_mq_get_tag(data, bt);
        if (tag != -1)
            break;

        if (data->ctx)
            blk_mq_put_ctx(data->ctx);

        io_schedule();
        

        after the task is scheduled back, it maybe migrated to other cpu,
        the hctx has to be reassigned.

        data->ctx = blk_mq_get_ctx(data->q);
        data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
        tags = blk_mq_tags_from_data(data);
        if (data->flags & BLK_MQ_REQ_RESERVED)
            bt = &tags->breserved_tags;
        else
            bt = &tags->bitmap_tags;

        finish_wait(&ws->wait, &wait);
        ws = bt_wait_ptr(bt, data->hctx);
    } while (1);

---

there is a interesting issue here.
consider the following scenario:

Time0, task0, task1, task2 are all waiting tag on hctx0

hctx0 tags ws={
    {.wait_cnt=1, .wait=active}, - task0
    {.wait_cnt=1, .wait=active}, - task1
    {.wait_cnt=1, .wait=active}, - task2
    {.wait_cnt=1, .wait=inactive},
}

Time2, tags are released and task0 is waked up, but it is migrated to other cpu During
waking up. task0 will run and allocate tags on another hctx.
Consequently, even if there is free tag on hctx0 tagset, but no one will allocate them.
The worst thing is, task1 and task2 are still sleeping and no one could wake them up.

hctx0 tags ws={
    {.wait_cnt=1, .wait=inactive},
    {.wait_cnt=1, .wait=active}, - task1
    {.wait_cnt=1, .wait=active}, - task2
    {.wait_cnt=1, .wait=inactive},
}

run hctx queue

This happens when blk_mq_dispatch_rq_list try to allocate driver tag for the
requests from sched tagset.
When the driver tagset is shared, it depends on the sbitmap_queue wakeup mechanism,
otherwise, the blk-mq restart mechanism will run the hw queue

blk_mq_dispatch_rq_list
  -> blk_mq_get_driver_tag
  -> blk_mq_mark_tag_wait
---
    if (!(this_hctx->flags & BLK_MQ_F_TAG_SHARED)) {
        if (!test_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state))
            set_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state);
        return blk_mq_get_driver_tag(rq, hctx, false);
    }

    wait = &this_hctx->dispatch_wait;

    the wake up func is blk_mq_dispatch_wake.
    it will remove the task from wait list and run the hw queue asynchronously.

    if (!list_empty_careful(&wait->entry))
        return false;

    spin_lock(&this_hctx->lock);
    if (!list_empty(&wait->entry)) {
        spin_unlock(&this_hctx->lock);
        return false;
    }

    ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
    add_wait_queue(&ws->wait, wait);
---

tag sharing

On HBA could connect to multiple LU, every LU has a request queue, all of these request_queue share a tagset of the HBA.
From the view of scsi source code:

scsi_alloc_sdev
  -> scsi_mq_alloc_queue
---
    sdev->request_queue = blk_mq_init_queue(&sdev->host->tag_set);

    all of the scsi dev (LU) share the same tagset of the host (HBA).

    if (IS_ERR(sdev->request_queue))
        return NULL;

    sdev->request_queue->queuedata = sdev;
    __scsi_init_queue(sdev->host, sdev->request_queue);
    blk_queue_flag_set(QUEUE_FLAG_SCSI_PASSTHROUGH, sdev->request_queue);
    return sdev->request_queue;
---

For shared tag users, we track the number of currently active users and attempt to provide a fair share of the tag depth for each of them.

blk_mq_get_request/blk_mq_get_driver_tag
  -> blk_mq_get_tag
    -> __blk_mq_get_tag
      -> hctx_may_queue
static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
                  struct sbitmap_queue *bt)
{
    unsigned int depth, users;

    if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
        return true;
    if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
        return true;

    /*
     * Don't try dividing an ant
     */
    if (bt->sb.depth == 1)
        return true;

    users = atomic_read(&hctx->tags->active_queues);
    if (!users)
        return true;

    /*
     * Allow at least some tags
     */

    depth = max((bt->sb.depth + users - 1) / users, 4U);

    return atomic_read(&hctx->nr_active) < depth;
}

There are two key points here:

active users of the tagset (hctx->tags->active_queues)
active requests of one hctx (hctx->nr_active)

Where to activate them ?

blk_mq_rq_ctx_init
---
    if (data->flags & blk_mq_req_internal) {
        rq->tag = -1;
        rq->internal_tag = tag;
    } else {

        if (blk_mq_tag_busy(data->hctx)) {
            rq_flags = RQF_MQ_INFLIGHT;
            atomic_inc(&data->hctx->nr_active);
        }

        rq->tag = tag;
        rq->internal_tag = -1;
        data->hctx->tags->rqs[rq->tag] = rq;
    }
---
blk_mq_get_driver_tag
---
    rq->tag = blk_mq_get_tag(&data);
    if (rq->tag >= 0) {
        if (blk_mq_tag_busy(data.hctx)) {
            rq->rq_flags |= RQF_MQ_INFLIGHT;
            atomic_inc(&data.hctx->nr_active);
        }
        data.hctx->tags->rqs[rq->tag] = rq;
    }
---
blk_mq_tag_busy
  -> __blk_mq_tag_busy
  ---
    if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
        !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
        atomic_inc(&hctx->tags->active_queues);
  ---

When to deactivate them ?

active_queues


blk_mq_exit_hctx/blk_mq_timeout_work
  -> blk_mq_tag_idle
    -> __blk_mq_tag_idle
---
    struct blk_mq_tags *tags = hctx->tags;

    if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
        return;

    atomic_dec(&tags->active_queues);

    blk_mq_tag_wakeup_all(tags, false);
---
Let's look into the case of blk_mq_timeout_work
---
    if (data.next_set) {
        data.next = blk_rq_timeout(round_jiffies_up(data.next));
        mod_timer(&q->timeout, data.next);
    } else {
        /*
         * Request timeouts are handled as a forward rolling timer. If
         * we end up here it means that no requests are pending and
         * also that no request has been pending for a while. Mark
         * each hctx as idle.
         */
        queue_for_each_hw_ctx(q, hctx, i) {
            /* the hctx may be unmapped, so check it here */
            if (blk_mq_hw_queue_mapped(hctx))
                blk_mq_tag_idle(hctx);
        }
    }
---
When to set the next_set ?
blk_mq_check_expired
---
    if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
        time_after_eq(jiffies, deadline)) {
        blk_mq_rq_update_aborted_gstate(rq, gstate);
        data->nr_expired++;
        hctx->nr_expired++;
    } else if (!data->next_set || time_after(data->next, deadline)) {
        data->next = deadline;
        data->next_set = 1;
    }
---
If any pending, non-timeout request exists, we set next_set.

active requests of one hctx

blk_mq_free_request
---
    if (rq->rq_flags & RQF_MQ_INFLIGHT)
        atomic_dec(&hctx->nr_active);
---
__blk_mq_put_driver_tag
---
    blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
    rq->tag = -1;

    if (rq->rq_flags & RQF_MQ_INFLIGHT) {
        rq->rq_flags &= ~RQF_MQ_INFLIGHT;
        atomic_dec(&hctx->nr_active);
    }
---

A interesting question:

BLK-MQ
        q of LUN0  q of LUN1   q of LUN2   q of LUN3
                                         
        hctx       hctx        hctx        hctx

        active     active      active      inactive

                      driver tags
------------------------------------------------------
LLDD    
                         HBA
All the driver tags have been used up by the 3 active q.
At the moment, we submit bio to an inactive q of LUN3, it cannot get driver tag
and queue the req on the hctx->dispatch list.
When will this hctx of LUN3 be waked up ?

blk_mq_mark_tag_wait will put this hctx of LUN3 on the shared-tag's wait queue.
When a driver tag is freed, it will wake up the waiters on the tag's wait queue
in round-robin fashion.
The active_queues of the shared-tags has been changed, so reqs to LUN0/1/2 have
to wait for its budget even if hctxs of LUN0/1/2 are waked up prio to LUN3's.

blk-mq io scheduler

Here is part of the comment about io scheduler for blk-mq from the paper [Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems]

While global sequential re-ordering is still possible across the multiple
software queues, it is only necessary for HDD based devices, where the additional
latency and locking overhead required to achieve total ordering does not hurt IOPS
performance. It can be argued that, for many users, it is no longer necessary to
employ advanced fairness scheduling as the speed of the devices are often
exceeding the ability of even multiple applications to saturate their performance.
If fairness is essential, it is possible to design a scheduler that exploits the 
characteristics of SSDs at coarser granularity to achieve lower performance overhead.
Whether the scheduler should reside in the block layer or on the SSD controller
is an open issue. If the SSD is responsible for fair IO scheduling, it can leverage
internal device parallelism, and lower latency, at the cost of additional interface
complexity between disk and OS

We could get following points from the comment above:

To make blk-mq compatible with HDD, global sequential re-ordering is still needed.

blk-mq need the io scheduler to get this ability. bfq is an io scheduler for
HDD, but looks like it have not been used widely, so if we want a stable io scheduler
with advanced fairness scheduling, we have to go back to blk-legacy and use the
cfq.

The advanced fairness scheduling is no longer necessary for fast speed device, but there could be a coarser granularity one

So we don't need the kind of scheduler, such as cfq or bfq, for nvme device.
Kyber is a coarser granularity and low overhead io scheduler for fast device.

There could be an internal io scheduler in storage device

Such as nvme weighted round robin with urgent priority class arbitration.

[blk-mq io scheduler framework]

        [scheduler init]
        elevator_switch_mq
            -> blk_mq_init_sched //freezed and quiesced
              -> [.init_sched]
              -> [.init_hctx]

        [bio submit]
        blk_mq_make_request
          -> blk_mq_sched_bio_merge
            -> __blk_mq_sched_bio_merge
              -> [.bio_merge]
                -> blk_mq_sched_try_merge //bfq and mq-deadline, use it to merge a bio to existing request
                  elv_merge // get the merge decision and req
                    -> [.request_merge]
                  if ELEVATOR_BACK_MERGE
                     blk_mq_sched_allow_merge
                       -> [.allow_merge]
                     bio_attempt_back_merge // merge the bio to the tail of req
                     attempt_back_merge // the new bio may have fill the hole between req and the latter req
                       -> elv_latter_request
                         -> [.next_request]
                       -> attempt_merge
                         -> [.requests_merged] // notify the io  scheduler that the two reqs have been merged
                     elv_merged_request // if attempt_back_merge do nothing
                       -> [.request_merged] // one bio is merged into this req
                  else if ELEVATOR_FRONT_MERGE
                     blk_mq_sched_allow_merge
                       -> [.allow_merge]
                     bio_attempt_front_merge // merge the bio to the head of req
                     attempt_front_merge // the new bio may have fill the hole between req and the former req
                       -> elv_former_request
                         -> [.former_request]
                       -> attempt_merge
                         -> [.requests_merged] // notify the io  scheduler that the two reqs have been merged
                     elv_merged_request // if attempt_front_merge do nothing
                       -> [.request_merged]
                -> if there is request merging happen, invoke blk_mq_free_request ot free the merged request
                  blk_mq_free_request
                    -> [.finish_request]

        [request allocation]
          blk_mq_get_request
            -> [.limit_depth] //update the blk_mq_alloc_data->shallow_depth
            -> blk_mq_get_tag
              -> shallow_depth? __sbitmap_queue_get_shallow : __sbitmap_queue_get
            -> blk_mq_rq_ctx_init
            -> blk_mq_sched_assign_ioc
              -> ioc_create_icq
                -> [.init_icq] // only bfq use it
            -> [.prepare_request]

        [request enqueue]
          blk_mq_sched_insert_request
            -> [.insert_requests]
              -> blk_mq_sched_try_merge
                -> elv_attempt_insert_merge
                try blk_attempt_req_merge on q->last_merge or req from elv_rqhash tree
                  -> attempt_merge
                    -> [.requests_merged] // notify the io  scheduler that the two reqs have been merged
                  //if there is request merging happen, invoke blk_mq_free_request ot free the merged request
                  -> blk_mq_free_request
                    -> [.finish_request]

        [dispatch request]
        blk_mq_sched_dispatch_requests
          -> blk_mq_do_dispatch_sched
            -> [.has_work] // blk_mq_sched_has_work
            -> [.dispatch_request]
        blk_mq_start_request
          -> blk_mq_sched_started_request
            -> [.started_request]

        [requeue request]
        blk_mq_requeue_request
          -> __blk_mq_requeue_request
            -> blk_mq_put_driver_tag // very important
          -> blk_mq_sched_requeue_request
            -> [.requeue_request]
        blk_mq_requeue_work
          -> blk_mq_sched_insert_request


        Note: in blk-mq, a requeued request will be inserted to io scheduler
        again, this is very different with blk-legacy. For the io scheduler of
        blk-mq, .requeue_request is same with .finish_request (bfq and kyber)

        [complete request]
        __blk_mq_complete_request
          -> blk_mq_sched_completed_request
            -> [.completed_request]
        blk_mq_free_request
          -> [.finish_request]

        We should notice: LLDD will not always complete a request with blk_mq_complete_request,
        but also blk_mq_end_request. At the moment, .completed_request will not be invoked.

hctx

issue directly

This is a special path for high speed device.

blk_mq_make_request
  -> blk_mq_try_issue_directly
    -> __blk_mq_try_issue_directly
---
    if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
        run_queue = false;
        bypass_insert = false;
        goto insert;
    }

    // No io scheduler

    if (q->elevator && !bypass_insert)
        goto insert;

    // No .get_budget

    if (!blk_mq_get_dispatch_budget(hctx))
        goto insert;

    // No io scheduler, so driver tag has been got

    if (!blk_mq_get_driver_tag(rq, NULL, false)) {
        blk_mq_put_dispatch_budget(hctx);
        goto insert;
    }

    return __blk_mq_issue_directly(hctx, rq, cookie);

    // invoke .queue_rq directly here

insert:
    if (bypass_insert)
        return BLK_STS_RESOURCE;

    // if io scheduler is set, fallback to normal path

    blk_mq_sched_insert_request(rq, false, run_queue, false);
    return BLK_STS_OK;
---

w/o io scheduler attached, the sync io could nearly bypass the whole blk-mq stack.

            submit_bio
----------------|---------------------
BLK-MQ          v
            blk_mq_make_request
                |
            ----^---- insert to ctx
                |
            ----^---- run hctx
----------------|--------------------
LLDD            v
            .queue_rq

Where to run hctx

Where to run the hctx ? or in the other word, will be a hctx ran on the cpu which is not mapped to this hctx ?
Let's see the two basic scenario that the hctx will be ran.

issue directly

map_request
  -> dm_dispatch_clone_request
    -> blk_insert_cloned_request
      -> blk_mq_request_issue_directly
        -> __blk_mq_try_issue_directly // under hctx_lock interface
          -> __blk_mq_issue_directly
blk_mq_make_request
  -> blk_mq_try_issue_directly
    -> __blk_mq_try_issue_directly // under hctx_lock interface
      -> __blk_mq_issue_directly

There is many holes where the task will be preempted and migrated out.

__blk_mq_run_hw_queue
__blk_mq_run_hw_queue could be run synchronously and asynchronously.

sync

It will be invoked by __blk_mq_delay_run_hw_queue
---
    if (!async && !(hctx->flags & BLK_MQ_F_BLOCKING)) {
        int cpu = get_cpu();preempt is disabled here
        if (cpumask_test_cpu(cpu, hctx->cpumask)) {
            __blk_mq_run_hw_queue(hctx);
            put_cpu();
            return;
        }

        put_cpu();
    }

    kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,
                    msecs_to_jiffies(msecs));
---
The basic condition is:
1. parameter async is false
2. not BLK_MQ_F_BLOCKING
3. current cpu is mapped to the hctx
At the same time, __blk_mq_run_hw_queue will be run with preempt disabled.

So hctx will not be ran on the cpu which is not mapped on this hctx

async

It seems more obvious here, the __blk_mq_run_hw_queue will be ran in by
workqueue worker kthread which will be pined on its cpu. But if the only cpu
where the hctx is mapped is offlined, the hctx will have to be run on the other
cpu. Except for this, the hctx will not be ran on the cpu where it is not mapped
to.

Whether the hctx will be executed on different mapped cpus concurrently ?

  cpu0    cpu1    cpu2    cpu3  
   .      flush   i_d     run_work
   .       .       .       .
   v       .       .       v
           v hctx0 .
-------------------.---------------
                   v
             HBA

i_d  issue directly

The possible concurrent path:

sync path (cpu0)

A common case is:
blk_mq_make_request
  -> blk_mq_sched_insert_request
    -> blk_mq_run_hw_queue
      -> __blk_mq_delay_run_hw_queue
        -> __blk_mq_run_hw_queue
          -> blk_mq_sched_dispatch_requests

flush req is inserted to hctx->dispatch directly

blk_mq_sched_insert_request
  -> blk_mq_sched_bypass_insert // RQF_FLUSH_SEQ

directly path (cpu2)
```
    refer to issue directly
```

async path

blk_mq_run_work_fn
  -> __blk_mq_run_hw_queue
    -> blk_mq_sched_dispatch_requests

hctx restart

There are some cases where the requests cannot be dispatched immediately.

some limits in io scheduler

     in this case, the io scheduler itself is responsible for dispatch the
     deferred requests

driver tag is used up on when there is io scheduler

     for shared tags, tag wakeup hook is in charge of this, otherwise, hctx_restart
     blk_mq_mark_tag_wait will mark BLK_MQ_SCHED_RESTART on the hctx for non shared-tag
     case.

some other limit of LLDD, such as scsi cmd-per-lun

     after blk_mq_dispatch_rq_list queue the reqs on hctx->dispatch list, it will try
     to rerun the hctx again. then the next blk_mq_sched_dispatch_requests will mark
     restart.
     we check the restart mark after enqueue reqs on hctx->dispatch
     blk_mq_run_hw_queue
       -> blk_mq_hctx_has_pending
         -> !list_empty_carefull(&hctx->dispatch)

hctx-restart is a supplement to tag wakeup hook, because not all dispatch deferring is due to lack of driver tag

Let's look into the hctx restart next.
Mark restart
Currently, blk_mq_sched_mark_restart_hctx will only be invoked by blk_mq_sched_dispatch_requests when there are requests in hctx->dispatch list. The requests could be inserted into hctx->dispatchlist in following cases

cpu is offlined, requests in ctx->rq_list will be inserted into hctx->dispatch directly
```
     refer to blk_mq_hctx_notify_dead
```
.queue_rq return BLK_STS_RESOURCE/BLK_STS_DEV_RESOURCE
```
     refer to blk_mq_dispatch_rq_list
```

there's data but flush is not necessary

blk_insert_flush
---
    /*
     * If there's data but flush is not necessary, the request can be
     * processed directly without going through flush machinery.  Queue
     * for normal execution.
     */
    if ((policy & REQ_FSEQ_DATA) &&
        !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
        if (q->mq_ops)
            blk_mq_request_bypass_insert(rq, false);
        else
            list_add_tail(&rq->queuelist, &q->queue_head);
        return;
    }
---

RQF_FLUSH_SEQ

blk_mq_sched_insert_request
  -> blk_mq_sched_bypass_insert
  ---
   /* dispatch flush rq directly */
   if (rq->rq_flags & RQF_FLUSH_SEQ) {
       spin_lock(&hctx->lock);
       list_add(&rq->queuelist, &hctx->dispatch);
       spin_unlock(&hctx->lock);
       return true;
   }
  ---
who will own this flag ?

 flush req
     blk_kick_flush
     ---
        flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
        flush_rq->rq_flags |= RQF_FLUSH_SEQ;
        flush_rq->rq_disk = first_rq->rq_disk;
        flush_rq->end_io = flush_end_io;

        return blk_flush_queue_rq(flush_rq, false);
     ---
 rq with flush and data
     blk_insert_flush
     ---
        INIT_LIST_HEAD(&rq->flush.list);
        rq->rq_flags |= RQF_FLUSH_SEQ;
        rq->flush.saved_end_io = rq->end_io; /* Usually NULL */
     ---
     And req_bio_endio need RQF_FLUSH_SEQ to identify this rq and doesn't end it.
     ---
        /* don't actually finish bio if it's part of flush sequence */
        if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
            bio_endio(bio);
     ---

static void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx)
{
    if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
        return;

    if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
        struct request_queue *q = hctx->queue;

        if (!test_and_set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
            atomic_inc(&q->shared_hctx_restart);

        //if not set, increase the q->shared_hctx_restart
        // shared_hctx_restart counts the number of hctx need to be restarted.

    } else
        set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
}

Restart

For the non-shared tag case, it is very simple, just invoke blk_mq_run_hw_queue(hctx, true) finally.
But for shared tag case, it is a bit complicated.
We will do hctx restart around all the hctxs that share same tags in round-robin fashion.

Why we need this ?

for sharing the resource of lldd fairly
if we always restart the hctx which the freed request points to,
other hctxs that share the same tagset will be starved.

restart
/'---------------------------------------,
BLK-MQ V \
q of LUN0 q of LUN1 q of LUN2 q of LUN3 |
|
hctx hctx hctx hctx |
^ |
driver tags | blk_mq_free_request
------------------------------------------------------
LLDD
HBA

We needn't worry about the fairly sharing on driver tag.
sbitmap wakeup hook and tag-sharing (hctx_may_queue) will work well.

Loop every q and hctx sharing the same tagset causes a massive performance regression if you have a lot of
shared devices. 8e8320c (blk-mq: fix performance regression with shared tags) will fix this.

A atomic shared_hctx_restart is added in request_queue to mark there is hctx need to be restarted in this
request_queue. Then blk_mq_sched_restart_hctx don't need to loop every time.

There is a question here:
The rr fashion hctx restart check would only happen:
- there is hctx marked as need restart
- there is req freed on the request_queue

What if there is no other req in-flight when hctx restart is marked ?
Who restart the hctx ? The others sharing the same tagset will not do that, because they are not marked as
restart in q->shared_hctx_restart.

This is genernal issue no matter sharing tag or not.
If there is no in-flight request, and .queue_rq need to requeue the request:
- return BLK_STS_RESOURCE
- LLDD rerun the hw queue itself

In fact, it looks that we don't always need to restart the hctxs in rr fashion.
- if we fail to get driver tag, tags wakeup hook could save us
- if we have reqs on hctx->dispatch which is inserted directly, it doesn't matter to other hctxs

There are also some special cases, look at the code segment in blk_mq_dispatch_rq_list:

if (!list_empty(list)) {
        bool needs_restart;

    // we reach here, because the .queue_rq returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE

        spin_lock(&hctx->lock);
        list_splice_init(list, &hctx->dispatch);
        spin_unlock(&hctx->lock);

        needs_restart = blk_mq_sched_needs_restart(hctx);
        if (!needs_restart ||
            (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
            blk_mq_run_hw_queue(hctx, true);
        else if (needs_restart && (ret == BLK_STS_RESOURCE))
            blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
    }

When there is request left in hctx->dispatch list, there are some cases need to be handled:

!needs_restart　(BLK_MQ_S_SCHED_RESTART is not set)

     run hctx asynchronously, SCHED_RESTART will be marked in blk_mq_sched_dispatch_requests
     Why not invoke blk_mq_sched_mark_restart_hctx directly?
     
     Look at the scenario below:

     blk_mq_dispatch_rq_list                        blk_mq_free_request
       -> .queue_rq return BLK_STS_DEV_RESOURCE       -> blk_mq_sched_restart
       -> queue rq on hctx->dispatch                    -> blk_mq_sched_restart_hctx
                                                          -> test BLK_MQ_S_SCHED_RESTART
       -> blk_mq_sched_mark_restart_hctx

     Think of the blk_mq_free_request is invoked for the last in-flight req, it would miss
     the restart mark and incur io hang.
     
     if we run the hw queue again, we will get the resource when invoke .queue_rq, even if we
     still don't get the resource, the restart mark will not be missed.

no_tag && list_empty_careful(&hctx->dispatch_wait.entry))

     there could be a narrow window as below:

                                       blk_mq_dispatch_rq_list
     blk_mq_dispatch_wake                -> blk_mq_mark_tag_wait
                                           -> add_wait_queue
       -> list_del_init(&wait->entry)
       -> blk_mq_run_hw_queue
         -> blk_mq_hctx_has_pending
                                         -> list_splice_init(list, &hctx->dispatch);

(needs_restart && (ret == BLK_STS_RESOURCE)

     lldd will return BLK_STS_DEV_RESOURCE when it is lacking in resources due to pending
     requests, otherwise, return BLK_STS_RESOURCE.
     When return BLK_STS_RESOURCE, it indicates there is no pending requests, so hctx-restart
     mechanism will not work, because there will be no blk_mq_free_request to be invoked.
     at this moment, rerun the hctx with a delay to avoid stuck.

requeue

__blk_mq_requeue_request is used to prepare for a requeue.

---

    //w/ io scheduler attached, there will be no in-queue req that
    //holds driver tag.

    blk_mq_put_driver_tag(rq);

    trace_block_rq_requeue(q, rq);
    wbt_requeue(q->rq_wb, &rq->issue_stat);

    if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {

    // switch to IDLE state

        blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
    ...
    }
---

Where will be the req requeued ?

hctx->dispatch list

blk_mq_dispatch_rq_list
---
        ret = q->mq_ops->queue_rq(hctx, &bd);
        if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
            ...
            list_add(&rq->queuelist, list);

            __blk_mq_requeue_request(rq);

            break;
        }
        ...
    } while (!list_empty(list));

    hctx->dispatched[queued_to_index(queued)]++;

    /*
     * Any items that need requeuing? Stuff them into hctx->dispatch,
     * that is where we will continue on next queue run.
     */
    if (!list_empty(list)) {
        bool needs_restart;


        spin_lock(&hctx->lock);
        list_splice_init(list, &hctx->dispatch);
        spin_unlock(&hctx->lock);

        ...
    }
    ...
---

io scheduler or ctx->rq_list

The request is requeued through blk_mq_sched_insert_request
There are two paths:

     __blk_mq_try_issue_directly
    ---
        if (!blk_mq_get_dispatch_budget(hctx))
            goto insert;

        if (!blk_mq_get_driver_tag(rq, NULL, false)) {
            blk_mq_put_dispatch_budget(hctx);
            goto insert;
        }

        return __blk_mq_issue_directly(hctx, rq, cookie);
            -> __blk_mq_requeue_request

    // if .queue_rq return BLK_STS_RESOURCE/BLK_STS_DEV_RESOURCE
    // we invoke __blk_mq_requeue_request in __blk_mq_issue_directly because
    // .queue_rq is invoked and also blk_mq_start_request.

        insert:
        if (bypass_insert)
            return BLK_STS_RESOURCE;
    
        blk_mq_sched_insert_request(rq, false, run_queue, false);
    ---
    
 blk_mq_requeue_request
    It is used by the driver to requeue a request.
    And one difference is blk_mq_sched_requeue_request is invoked in it which will
    invoke the io scheduler's requeue callback.
    Finally, the request will be requeued in blk_mq_requeue_work through blk_mq_sched_insert_request.

Question: why the blk_mq_sched_requeue_request is only invoked in blk_mq_requeue_request ?

Look at the bfq and kyber, the callbacks of .requeue_request and .finish_request are the same one.

For blk_mq_dispatch_rq_list, the request is not queued back to io scheduler, we can say the request
is still being dispatched, so needn't invoke .requeue_request callback.

For the __blk_mq_try_issue_directly, the direct issue path only works w/o io scheduler attached.

Only the blk_mq_requeue_request case, the request is dequeued from io scheduler and will be requeued
back to io scheduler.

In fact, there is a big difference between block legacy and blk-mq in requeue.

blk_requeue_request
  -> elv_requeue_request
    -> __elv_add_request //ELEVATOR_INSERT_REQUEUE
      -> list_add(&rq->queuelist, &q->queue_head);
The request is requeued to q->queue_head which is similar with hctx->dispatch.

Block legacy

Tag

There is also a tag mechanism in block legacy. Quote comment from blk-mq about tagging.

Device command tagging was first introduced with hardware supporting native command queuing. A tag is an integer value that uniquely identifies the position of the block IO in the driver submission queue, so when completed the tag is passed back from the device indicating which IO has been completed. This eliminates the need to perform a linear search of the in-flight window to determine which IO has completed.

We don't look into how to implement it but just how to employ it in block legacy and do some comparing with tagging in blk-mq.
How to use it in driver level ?

static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host *shost,
        int tag)
{
    struct request *req = NULL;

    if (tag == SCSI_NO_TAG)
        return NULL;

    if (shost_use_blk_mq(shost)) {
        u16 hwq = blk_mq_unique_tag_o_hwq(tag);

        if (hwq < shost->tag_set.nr_hw_queues) {
            req = blk_mq_tag_to_rq(shost->tag_set.tags[hwq],
                blk_mq_unique_tag_to_tag(tag));
        }
    } else {
        req = blk_map_queue_find_tag(shost->bqt, tag);
    }

    if (!req)
        return NULL;
    return blk_mq_rq_to_pdu(req);
}

A reverse mapping tag -> req -> driver pdu
How to assign tag to a req ?

scsi_request_fn()
>>>>
        /*
         * Remove the request from the request list.
         */
        if (!(blk_queue_tagged(q) && !blk_queue_start_tag(q, req)))
            blk_start_request(req);
        /*
         blk_queue_tagged() will check QUEUE_FLAG_QUEUED in the q->flags, means the hardware support native command queuing.
         blk_queue_start_tag() will try to assign tag for this rq, if tags has been used up, return 1.
         otherwise,
         bqt->next_tag = (tag + 1) % bqt->max_depth;
         rq->rq_flags |= RQF_QUEUED; //indicates tag has been assigned
         rq->tag = tag;
         bqt->tag_index[tag] = rq;
         blk_start_request(rq);
         list_add(&rq->queuelist, &q->tag_busy_list);
         */
>>>>
        /*
         * We hit this when the driver is using a host wide
         * tag map. For device level tag maps the queue_depth check
         * in the device ready fn would prevent us from trying
         * to allocate a tag. Since the map is a shared host resource
         * we add the dev to the starved list so it eventually gets
         * a run when a tag is freed.
         */
        if (blk_queue_tagged(q) && !(req->rq_flags & RQF_QUEUED)) {
            spin_lock_irq(shost->host_lock);
            if (list_empty(&sdev->starved_entry))
                list_add_tail(&sdev->starved_entry,
                          &shost->starved_list);
            spin_unlock_irq(shost->host_lock);
            goto not_ready;
        }
>>>>
 not_ready:
    /*
     * The tag here looks like the driver tag in blk-mq.
     * In block legacy, the req is requeued and inserted to the head of q->queue_head directly.
     * In blk-mq, the action is similar, refer to blk_mq_dispatch_rq_list. (but __blk_mq_try_issue_directly looks like not assigned with this.)
     */
    spin_lock_irq(q->queue_lock);
    blk_requeue_request(q, req);
    atomic_dec(&sdev->device_busy);
>>>>

plug

There are mainly two aspects about blk plug's benifit.

allow merging of sequential requests into single one larger request.

in fact, the merging could also be done in block core or io scheduler, but lock contending will be introduced, especially for the block-legacy. current->plug is a private one, there is no such issue there.

where is the plug list flushed from schedule ?

schedule
  -> sched_submit_work
    -> blk_schedule_flush_plug

io_schedule_timeout/io_schedule
  -> io_schedule_prepare
    -> blk_schedule_flush_plug


However, the preempt schedule path doesn't flush plug list

asmlinkage __visible void __sched preempt_schedule_irq(void)
{
    enum ctx_state prev_state;

    /* Catch callers which need to be fixed */
    BUG_ON(preempt_count() || !irqs_disabled());

    prev_state = exception_enter();

    do {
        preempt_disable();
        local_irq_enable();
        __schedule(true);
        local_irq_disable();
        sched_preempt_enable_no_resched();
    } while (need_resched());

    exception_exit(prev_state);
}

BIO

Let's look into the _basic unit_ in block layer, the bio.
We could deem there is a bio layer between the fs and block layer.

                         
                FS LAYER
     ------------------------------------------------
                          | submit_bio 
                          |
                          V generic_make_request <-------+
     ------------------------------------------------    |
                             blk-throttl                 |
                BIO LAYER    bio remap +--> partition    |
                                       |                 |
                                       +--> bio based device mapper (stackable)
    -------------------------------------------------    |
                          |                              |
                          V  blk_queue_bio/blk_mq_make_request

                BLOCK LAGACY/BLK-MQ

The basic architecture of a bio.

request->bio __                    
               \                  
                \     bio        
                 \   ________    
                  ->| bi_next        next bio in one request, the blocks in these bios should be contigous on disk
                    |
                    | bi_disk        gendisk->request_queue 
                    |
                    | bi_partno      partition NO.
                    |
                    | bi_opf         bio_op, req_flag_bits, same with req->cmd_flags
                    |
                    | bi_phys_segments  Number of segments in this BIO after physical address coalescing is performed.
                    |
                    | bi_end_io　　　blk_update_request->req_bio_endio->bio_endio
                    |
                    | bi_vcnt        how many bio_vec's
                    | bi_max_vecs    max bio_vecs can hold
                    | bi_io_vec      pointer to bio_io_vec list    
                    |         \     　________    
                    |          --->  | bv_page       
                    |                | bv_len        
                    |                | bv_offset     
                    |                 ________       
                    |                | bv_page       
                    |                | bv_len        
                    |                | bv_offset    These two pages could be non physical contigously
                    |                               But the corresponding blocks on storage disk should be contigous.
                    | bi_pool        as its name
                    | 
                    | bi_iter        the current iterating status in bio_vec list
                                      ___________
                                     | bi_sector    device address in 512 byte sectors
                                     | bi_size      residual I/O count
                                     | bi_idx       current index into bvl_vec
                                     | bi_done      number of bytes completed
                                     | bi_bvec_done number of bytes completed in current bvec


(Some members associated with cgroup,blk-throttle,merge-assistant are ignored here.)

Setup and complete a bio

Let's take the submit_bh_wbc() as example to show how to setup a bio

static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
             enum rw_hint write_hint, struct writeback_control *wbc)
{
    struct bio *bio;
    >>>>
    bio = bio_alloc(GFP_NOIO, 1); // the second parameter is the count of bvec

    if (wbc) {
        wbc_init_bio(wbc, bio);
        wbc_account_io(wbc, bh->b_page, bh->b_size);
    }

    bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
    bio_set_dev(bio, bh->b_bdev);
    //(bio)->bi_disk = (bdev)->bd_disk;
    //(bio)->bi_partno = (bdev)->bd_partno;
    bio->bi_write_hint = write_hint;

    bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
    >>>>//Fs with blocksize smaller than pagesize, could reach here.
        if (bio->bi_vcnt > 0) {
            bv = &bio->bi_io_vec[bio->bi_vcnt - 1];

            if (page == bv->bv_page &&
                offset == bv->bv_offset + bv->bv_len) {
                bv->bv_len += len;
                goto done;
            } 
        } //merged with previous one 

        if (bio->bi_vcnt >= bio->bi_max_vecs)
            return 0;

        bv        = &bio->bi_io_vec[bio->bi_vcnt];
        bv->bv_page    = page;
        bv->bv_len    = len;
        bv->bv_offset    = offset;

        bio->bi_vcnt++;
    done:
        bio->bi_iter.bi_size += len;
    >>>>
    BUG_ON(bio->bi_iter.bi_size != bh->b_size);

    bio->bi_end_io = end_bio_bh_io_sync;
    bio->bi_private = bh; //reverse mapping to the bh

    /* Take care of bh's that straddle the end of the device */
    guard_bio_eod(op, bio);

    if (buffer_meta(bh))
        op_flags |= REQ_META;
    if (buffer_prio(bh))
        op_flags |= REQ_PRIO;
    bio_set_op_attrs(bio, op, op_flags);
    
    submit_bio(bio);
    return 0;
}

Most of the information to construct a bio is from the bh. If we want to dig deeper, we have to look into how to setup a bh.

static int
grow_dev_page(struct block_device *bdev, sector_t block,
          pgoff_t index, int size, int sizebits, gfp_t gfp)
{
    >>>>
    page = find_or_create_page(inode->i_mapping, index, gfp_mask);
        -> pagecache_get_page()
            -> __page_cache_alloc() //no_page case
                -> __alloc_pages_node(n, gfp, 0);
    /*
     The pages of page cache are allocated one by one. It's more flexible to
     map and unmap, page in and swap out. And in the past, the memory is limited, there is not
     enougth contiguous pages to take advantage of.
     */
    BUG_ON(!PageLocked(page));
    >>>>`
    /*
     * Allocate some buffers for this page
     */
    bh = alloc_page_buffers(page, size, true);

    /*
     * Link the page to the buffers and initialise them.  Take the
     * lock to be atomic wrt __find_get_block(), which does not
     * run under the page lock.
     */
    spin_lock(&inode->i_mapping->private_lock);
    link_dev_buffers(page, bh);
    end_block = init_page_buffers(page, bdev, (sector_t)index << sizebits,
            size);
    >>>>
    do {
        if (!buffer_mapped(bh)) {
            init_buffer(bh, NULL, NULL);
            bh->b_bdev = bdev;
            bh->b_blocknr = block;
            if (uptodate)
                set_buffer_uptodate(bh);
            if (block < end_block)
                set_buffer_mapped(bh);
        }
        block++;
        bh = bh->b_this_page;
    } while (bh != head);
    >>>>
    spin_unlock(&inode->i_mapping->private_lock);
done:
    ret = (block < end_block) ? 1 : -ENXIO;
failed:
    unlock_page(page);
    put_page(page);
    return ret;
}

One page from pagecache could be broken up into several bh's based on the blocksize of the associated filesystem (sb->s_blocksize). One bh corresponds to one block in disk. Then echo bh will be used to constructed a bio and submitted to block layer. At the moment, the bio only contain one bio_vec pointing to page of the bh. This is the classical path to setup a bio. Nowadays, some filesystems would like to create bios itself, during the procedure, the bio containing multiple bio_vec maybe created. For example:

static int io_submit_add_bh(struct ext4_io_submit *io,
                struct inode *inode,
                struct page *page,
                struct buffer_head *bh)
{
    int ret;

    if (io->io_bio && bh->b_blocknr != io->io_next_block) {
submit_and_retry:
        ext4_io_submit(io);
    }
    if (io->io_bio == NULL) {
        ret = io_submit_init_bio(io, bh);
        if (ret)
            return ret;
        io->io_bio->bi_write_hint = inode->i_write_hint;
    }
    ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
    if (ret != bh->b_size)
        goto submit_and_retry;
    wbc_account_io(io->io_wbc, page, bh->b_size);
    io->io_next_block++;
    return 0;
}

We could see that: one bio_vec would correspond to part or the whole page.

Bio operations

bio advance

static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
                    unsigned bytes)
{
    iter->bi_sector += bytes >> 9;
    /* So this is why the bi_sector is located in bio->bi_iter, it could be
     * put forward */
    if (bio_no_advance_iter(bio))
    {/REQ_OP_DISCARD/SECTOR_ERASE/WRITE_SAME/WRITE_ZERO
        iter->bi_size -= bytes;
        iter->bi_done += bytes;
    } else {
        bvec_iter_advance(bio->bi_io_vec, iter, bytes);
        /* TODO: It is reasonable to complete bio with error here. */
    }
}

static inline bool bvec_iter_advance(const struct bio_vec *bv,
        struct bvec_iter *iter, unsigned bytes)
{
    >>>>
    while (bytes) {
        unsigned iter_len = bvec_iter_len(bv, *iter);
        unsigned len = min(bytes, iter_len);

        bytes -= len;
        iter->bi_size -= len; // remaining length
        iter->bi_bvec_done += len; //completed length of current bvec
        iter->bi_done += len; //completed length of this bio

        if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) {
            iter->bi_bvec_done = 0;
            iter->bi_idx++; //push forward the bvec table here
        }
    }
    return true;
}

After invoke this function, we could confirm one bio has been finished througth (bio->bi_iter.bi_size == 0). For example, in blk_update_request()

blk_mq_end_request()
    -> blk_update_request()
        -> req_bio_endio()
>>>>
    bio_advance(bio, nbytes);

    /* don't actually finish bio if it's part of flush sequence */
    // when RQF_FLUSH_SEQ is set, the req->end_io would be invoked instead of
    // bio_end.
    if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
        bio_endio(bio);
>>>>

bio clone
in the device mapper stack, the bio will be cloned. Let's look at how to do that. clone_bio(), clone a new bio contain the sector ~ (sector+len) of original one.

static int clone_bio(struct dm_target_io *tio, struct bio *bio,
             sector_t sector, unsigned len)
{
    struct bio *clone = &tio->clone;

    __bio_clone_fast(clone, bio);
    >>>>
        bio->bi_disk = bio_src->bi_disk;
        bio->bi_partno = bio_src->bi_partno;
        bio_set_flag(bio, BIO_CLONED); // a cloned bio
        bio->bi_opf = bio_src->bi_opf;
        bio->bi_write_hint = bio_src->bi_write_hint;
        bio->bi_iter = bio_src->bi_iter;
        bio->bi_io_vec = bio_src->bi_io_vec;
        //The cloned bio will shared a same bvec table with previous one.
        bio_clone_blkcg_association(bio, bio_src);
    >>>>
    if (bio_op(bio) != REQ_OP_ZONE_REPORT)
        bio_advance(clone, to_bytes(sector - clone->bi_iter.bi_sector));
    clone->bi_iter.bi_size = to_bytes(len);
    //cut out the sector ~ (sector+len) part of original one here
    if (unlikely(bio_integrity(bio) != NULL))
        bio_integrity_trim(clone);

    return 0;
}

Bio split

bio will be split in blk_mq_make_request, why ?
The associated commit is:
54efd50b ( block: make generic_make_request handle arbitrarily sized bios)

---
    The way the block layer is currently written, it goes to great lengths
    to avoid having to split bios; upper layer code (such as bio_add_page())
    checks what the underlying device can handle and tries to always create
    bios that don't need to be split.
    
    But this approach becomes unwieldy and eventually breaks down with
    stacked devices and devices with dynamic limits, and it adds a lot of
    complexity.
---

Then FS layer could submit arbitrary size bios.

How to do it ?

blk_queue_split
  -> blk_bio_segment_split
    -> bio_split
---
    split = bio_clone_fast(bio, gfp, bs);
      -> __bio_clone_fast
      ---
        bio->bi_disk = bio_src->bi_disk;
        bio->bi_partno = bio_src->bi_partno;
        bio_set_flag(bio, BIO_CLONED);
        if (bio_flagged(bio_src, BIO_THROTTLED))
            bio_set_flag(bio, BIO_THROTTLED);
        bio->bi_opf = bio_src->bi_opf;
        bio->bi_write_hint = bio_src->bi_write_hint;
        bio->bi_iter = bio_src->bi_iter;

        bio->bi_io_vec = bio_src->bi_io_vec;

        ...
      ---
    split->bi_iter.bi_size = sectors << 9;

    if (bio_integrity(split))
        bio_integrity_trim(split);

    bio_advance(bio, split->bi_iter.bi_size);
---
              |  sectors  |
   bi_io_vec  [  bv  ] [  bv  ] [  bv  ] [  bv  ]
              \____  _____/\________  __________/
                    V                V
          split->bi_iter         bio->bi_iter

blk_queue_split
---
    if (split) {
        /* there isn't chance to merge the splitted bio */
        split->bi_opf |= REQ_NOMERGE;

        /*
         * Since we're recursing into make_request here, ensure
         * that we mark this bio as already having entered the queue.
         * If not, and the queue is going away, we can get stuck
         * forever on waiting for the queue reference to drop. But
         * that will never happen, as we're already holding a
         * reference to it.
         */
        bio_set_flag(*bio, BIO_QUEUE_ENTERED);

        bio_chain(split, *bio);
        trace_block_split(q, split, (*bio)->bi_iter.bi_sector);

                a big bio
        |  max  |
        |__________________________|
        \___ ___/\________ ________/
            v             v
          submit      go back to
                     generic_make_request


        generic_make_request(*bio);
        *bio = split;
    }
---

stacked bio layer

bios from stacked devices

How does the generic_make_request handle bios from stacked devices ?

Two important code fragment,

#1
---
    if (current->bio_list) {
        bio_list_add(¤t->bio_list[0], bio);
        goto out;
    }

---

#2
---
    do {
        bool enter_succeeded = true;

        if (unlikely(q != bio->bi_disk->queue)) {
            if (q)
                blk_queue_exit(q);
            q = bio->bi_disk->queue;
            flags = 0;
            if (bio->bi_opf & REQ_NOWAIT)
                flags = BLK_MQ_REQ_NOWAIT;
            if (blk_queue_enter(q, flags) < 0) {
                enter_succeeded = false;
                q = NULL;
            }
        }

        if (enter_succeeded) {
            struct bio_list lower, same;

            /* Create a fresh bio_list for all subordinate requests */
            bio_list_on_stack[1] = bio_list_on_stack[0];
            bio_list_init(&bio_list_on_stack[0]);
            ret = q->make_request_fn(q, bio);

            /* sort new bios into those for a lower level
             * and those for the same level
             */
            bio_list_init(&lower);
            bio_list_init(&same);
            while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
                if (q == bio->bi_disk->queue)
                    bio_list_add(&same, bio);
                else
                    bio_list_add(&lower, bio);
            /* now assemble so we handle the lowest level first */
            bio_list_merge(&bio_list_on_stack[0], &lower);
            bio_list_merge(&bio_list_on_stack[0], &same);
            bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
        } else {
            if (unlikely(!blk_queue_dying(q) &&
                    (bio->bi_opf & REQ_NOWAIT)))
                bio_wouldblock_error(bio);
            else
                bio_io_error(bio);
        }
        bio = bio_list_pop(&bio_list_on_stack[0]);
    } while (bio);
---

Let's take the stripe as an example,


       stripe_dev

       bio 0 ~ 31
  |--------------------|
  +--+  +--+  +--+  +--+
  |  |  |  |  |  |  |  | } 4K (8 sectors)
  +--+  +--+  +--+  +--+
  |  |  |  |  |  |  |  |
  +--+  +--+  +--+  +--+
  |  |  |  |  |  |  |  |
  +--+  +--+  +--+  +--+

  dev0  dev1  dev2  dev3

Round #1

bio[0, 31].stripe_dev
q->make_request_fn
then,
bio_list_on_stack[0] -> bio[0, 7].dev0 -> bio[8, 31].stripe_dev
then,
lower -> bio[0, 7].dev0
same -> bio[8, 31].stripe_dev
then
bio_list_on_stack[0] ->  bio[0, 7].dev0 ->  bio[8, 31].stripe_dev

Round #2

bio[0, 7].dev0 is picked up to handle
bio_list_on_stack[1] -> bio[8, 31].stripe_dev
q->make_request_fn
bio_list_on_stack[0] is NULL
then
bio_list_on_stack[1] is merged into bio_list_on_stack[0]
bio_list_on_stack[0] -> bio[8, 31].stripe_dev

Round #3

bio[8, 31].stripe_dev is picked up to handle
q->make_request_fn
then
bio_list_on_stack[0] -> bio[8, 15].dev1 -> bio[16, 31].stripe_dev
then
lower ->  bio[8, 15].dev1
same -> bio[16, 31].stripe_dev
then
bio_list_on_stack
bio_list_on_stack[0] -> bio[8, 15].dev1 -> bio[16, 31].stripe_dev

Round #4

bio[8, 15].dev1 is picked up to handle
bio_list_on_stack[1] ->bio[16, 31].stripe_dev
....

Merge

The main merging point.

blk_mq_sched_try_merge
This is used to merge bio with req.
It is usually in bio submitting path.
elv_merge choose a rq which could merge with a new bio
and return how to merge.
(bio) (req) indicates the new one

if ELEVATOR_BACK_MERGE
    req -> bio -> (bio)
    then try to merge this req with latter one.
    (req) -?-> req

if ELEVATOR_FRONT_MERGE
    req -> (bio) -> bio
    then try to merge this req with former one.
    req -?-> (req)

elv_attempt_insert_merge
This is used to merge req with req.
It is usually in req inserting path.

Both elv_merge and elv_attempt_insert_merge employ q->last_merge
and request_queue elv rqhash to find out contiguous reqs.

Note: req is just a package. The real things are bios in them.

attempt_merge is used to merge two reqs (req, next).
The mainly checking it does:

!rq_mergeable(req) || !rq_mergeable(next) ?
req_op(req) != req_op(next) ?
blk_rq_pos(req) + blk_rq_sectors(req) != blk_rq_pos(next) ?
rq_data_dir(req) != rq_data_dir(next) ?
req->rq_disk != next->rq_disk ?
req->write_hint != next->write_hint ?
ll_merge_requests_fn ('ll' here means low level ?)
- (blk_rq_sectors(req) + blk_rq_sectors(next)) > blk_rq_get_max_sectors(req, blk_rq_pos(req))
- total_phys_segments > queue_max_segments(q)

If two requests could be merged with echo other:

    req->biotail->bi_next = next->bio;
    req->biotail = next->biotail;

    req->__data_len += blk_rq_bytes(next);

    elv_merge_requests(q, req, next);

    /*
     * 'next' is going away, so update stats accordingly
     */
    blk_account_io_merge(next);

    req->ioprio = ioprio_best(req->ioprio, next->ioprio);
    if (blk_rq_cpu_valid(next))
        req->cpu = next->cpu;

    /*
     * ownership of bio passed from next to req, return 'next' for
     * the caller to free
     */
    next->bio = NULL;

Then next one will be freed though __blk_put_request().

FLUSH and FUA

First, we need to know the volatile write cache.
Quote from Documentation/block/writeback_cache_control.txt

Many storage devices, especially in the consumer market, come with volatile
write back caches.  That means the devices signal I/O completion to the
operating system before data actually has hit the non-volatile storage.  This
behavior obviously speeds up various workloads, but it means the operating
system needs to force data out to the non-volatile storage when it performs
a data integrity operation like fsync, sync or an unmount. >

There are two flag set in bio or req to indicate which operation on vwc will be carried out.

REQ_FLUSH, REQ_FLUSH flag indicates a explicit cache flushes.

The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from
the filesystem and will make sure the volatile cache of the storage device
has been flushed before the actual I/O operation is started.  This explicitly
guarantees that previously completed write requests are on non-volatile
storage before the flagged bio starts. In addition the REQ_FLUSH flag can be
set on an otherwise empty bio structure, which causes only an explicit cache
flush without any dependent I/O.

REQ_FUA, REQ_FUA means Force Unit Access.

The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure that I/O completion for this request is only
signaled after the data has been committed to non-volatile storage.

The block device driver need to notify the queue that whether it supports REQ_FLUSH and REQ_FUA through blk_queue_write_cache(). And the flags will be set into queue->queue_flags.

void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
{
    spin_lock_irq(q->queue_lock);
    if (wc)
        queue_flag_set(QUEUE_FLAG_WC, q);
    else
        queue_flag_clear(QUEUE_FLAG_WC, q);
    if (fua)
        queue_flag_set(QUEUE_FLAG_FUA, q);
    else
        queue_flag_clear(QUEUE_FLAG_FUA, q);
    spin_unlock_irq(q->queue_lock);

    wbt_set_write_cache(q->rq_wb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
}

How to implement the flush operation
There are 4 flush sequence flag:

REQ_FSEQ_PREFLUSH
REQ_FSEQ_DATA
REQ_FSEQ_POSTFLUSH
REQ_FSEQ_DONE

These flush operation life cycle could include any ones of them. blk core will execute them in sequence. blk_flush_policy() is used to construct this sequence. Let's see it.

static unsigned int blk_flush_policy(unsigned long fflags, struct request *rq)
{
    unsigned int policy = 0;

    if (blk_rq_sectors(rq))
        policy |= REQ_FSEQ_DATA;

    if (fflags & (1UL << QUEUE_FLAG_WC)) {
        if (rq->cmd_flags & REQ_PREFLUSH)
            policy |= REQ_FSEQ_PREFLUSH;
        if (!(fflags & (1UL << QUEUE_FLAG_FUA)) &&
            (rq->cmd_flags & REQ_FUA))
            policy |= REQ_FSEQ_POSTFLUSH;
    }
    return policy;
}

Two things need to be emphasized here.

REQ_FSEQ_PREFLUSH/POSTFLUSH are only executed when the block device support vwc.
if the device not support fua, blk-core use data+flush pair to simulate it.

If blk_flush_policy() just return REQ_FSEQ_DATA, the request can be processed directly without going through flush machinery. For blk-mq, it will be inserted into the tail of hctx->dispatch.
Otherwise, a flush sequence will be started.
The flush sequence is carried out based on blk_flush_queue->flush_queue[2]. In addition, there are two idx to indicates the current state of the flush_queue.

flush_pending_idx
flush_running_idx

Both of them only have two values 0/1. At initial state, pending == running. After kick a flush sequence, the pending_idx is toggled, then the pending_idx become different from running_idx which means flush is in flight. During the process while flush is in flight, the new flushes will be queued on pending_idx which is different from the running_idx. After the flush is completed, the running_idx is toggled then the running_idx is same with pending_idx again.
a preallocated request - flush_rq will do the actual flush work on behalf of the FLUSH requests. when completed, all the FLUSH request on the running queuee would be pushed forward to next step.


blk_flush_queue->flush_queue[2]
                 running 0
                 pending 0
rq0 (PREFLUSH + DATA)
rq1 (DATA + POSTFLUSH)
rq2 (PREFLUSH + DATA)

Time 0: running 0, pending 0

                 (seq = PREFLUSH)   
flush_queue[0] - rq0

blk_kick_flush toggle the pending_idx and send out
the flush_rq.
Time 1: running 0, pending 1

                 (seq = PREFLUSH)   
flush_queue[0] - rq0

hctx->dispatch - flush_rq (w/ tag from rq0, RQF_FLUSH_SEQ)
                 requeue -> bypass insert

rq1 is inserted by blk_insert_flush
Time 2: running 0, pending 1

                 (seq = PREFLUSH)   
flush_queue[0] - rq0
                       (seq = DATA)
flush_data_in_flight - rq1

hctx->dispatch - rq1 (RQF_FLUSH_SEQ) - flush_rq (w/ tag from rq0, RQF_FLUSH_SEQ)
                 both requeue -> bypass insert

rq2 is inserted by blk_insert_flush
Time 3: running 0, pending 1

                 (seq = PREFLUSH)   
flush_queue[1] - rq2
                 (seq = PREFLUSH)   
flush_queue[0] - rq0
                       (seq = DATA)
flush_data_in_flight - rq1

hctx->dispatch - rq1 (RQF_FLUSH_SEQ) - flush_rq (w/ tag from rq0, RQF_FLUSH_SEQ)
                 both requeue -> bypass insert

rq1 is completed firstly, due to POSTFLUSH, it is inserted to pending
Time 4: running 0, pending 1

                 (seq = PREFLUSH)   (seq = POSTFLUSH) 
flush_queue[1] - rq2              - rq1 
                 (seq = PREFLUSH)   
flush_queue[0] - rq0

hctx->dispatch - flush_rq (w/ tag from rq0, RQF_FLUSH_SEQ)
                 

flush_rq is completed
get running list flush_queue[0]
toggle running running = 1
iterate running_list flush_queue[0] to invoke blk_flush_complete_seq
rq0 is inserted to flush_data_in_flight and requeue, finally add head of hctx->dispatch
another flush is issued by blk_kick_flush due to rq1 and rq2
Time 5: running 1, pending 1

                 (seq = PREFLUSH)   (seq = POSTFLUSH) 
flush_queue[1] - rq2              - rq1 
                       (seq = DATA)
flush_data_in_flight - rq1

hctx->dispatch -  rq0 (RQF_FLUSH_SEQ) - flush_rq (w/ tag from rq0, RQF_FLUSH_SEQ)

Question
the flush_rq could pass through the io scheduler with RQF_FLUSH_SEQ, but why does
the original rq do the same ?
does that mean all the rq with FLUSH or FUA will pass through the io scheduler ?

A sequenced PREFLUSH/FUA request with DATA is completed twice.
Once while executing DATA and again after the whole sequence is complete.
The first completion updates the contained bio but doesn't finish it so that the 
bio submitter is notified only after the whole sequence is complete.
This is implemented by testing RQF_FLUSH_SEQ in req_bio_endio().

Talking about the borrowed tag

FLUSH reqs below means the request with FLUSH or FUA operations
Why does the flush_rq borrow tags from the FLUSH request ?

flush_rq is allocated separately, so it is not in the tag_set of blk-mq.

For the non-scheduler case, the FLUSH req has occupied a driver tag and it
depends on the completion of flush_rq. Assume the scenario, all the driver tags
are held by FLUSH req, consequentially, the flush_rq cannot get driver tag
any more and cannot make the flush sequence forward. A IO hang comes up. To
avoid this, flush_rq should borrow driver tag from the FLUSH reqs.

Recently,
a commit 923218f (blk-mq: don't allocate driver tag upfront for flush rq)
was introduced, it change the way how to handle the tag borrowing in blk-mq.

Before this patch, when with io scheduler, the blk-mq will allocate driver tag ahead
of delivering it to blk-flush. Then blk-flush may borrow this driver tag to the proxy
flush_rq. Then this flush_rq will be queued to hctx->dispatch.

blk_mq_make_request()
---
    if (unlikely(is_flush_fua)) {
        blk_mq_put_ctx(data.ctx);
        blk_mq_bio_to_request(rq, bio);
        if (q->elevator) {
            blk_mq_sched_insert_request(rq, false, true, true,
                    true);
        } 
---

blk_mq_sched_insert_request()
---
    if (rq->tag == -1 && op_is_flush(rq->cmd_flags)) {
        blk_mq_sched_insert_flush(hctx, rq, can_block);
        return;
    }
---
static void blk_mq_sched_insert_flush(struct blk_mq_hw_ctx *hctx,
                      struct request *rq, bool can_block)
{

    if (blk_mq_get_driver_tag(rq, &hctx, can_block)) {

        blk_insert_flush(rq);
        blk_mq_run_hw_queue(hctx, true);
    } else
        blk_mq_add_to_requeue_list(rq, false, true);
}

And this will cause a issue. Look at the comment of reorder_tags_to_front()
---
If we fail getting a driver tag because all the driver tags are already
assigned and on the dispatch list, BUT the first entry does not have a
tag, then we could deadlock. For that case, move entries with assigned
driver tags to the front, leaving the set of tagged requests in the
same order, and the untagged set in the same order.
---
if the driver tags are all occupied by FLUSH reqs, and other reqs has to be 
queued on hctx->dispatch because shortage of driver tag.
the flush_rq with driver tag will be queued to the tail of hctx->dispatch.

then we will get the scenario described above.

The patch changes the way to handle this case, let flush_rq get a driver tag 
just before .queue_rq() in blk_mq_dispatch_rq_list().
This will not cause IO hang described above, because the FLUSH requests just
occupy sched tags. But the flush_rq still need to borrow the sched tag to cheat
the blk-mq.

blk_kick_flush()
>>>>
    if (q->mq_ops) {
        struct blk_mq_hw_ctx *hctx;

        flush_rq->mq_ctx = first_rq->mq_ctx;

        if (!q->elevator) {
            fq->orig_rq = first_rq;
            flush_rq->tag = first_rq->tag;
            hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
            blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
        } else {
            flush_rq->internal_tag = first_rq->internal_tag;
>>>>

Queue state flags

Let's look at the similar 3 flags of request_queue.

QUEUE_FLAG_STOPPED
It is only used in block legacy, looks like the BLK_MQ_S_STOPPED. The quiescing mechanism has big advantage on it, so BLK_MQ_S_STOPPED is rarely used now.

QUEUE_FLAG_DYING
QUEUE_FLAG_DYING indicates no request could enter a request anymore.
queue dying is different from queue freeze which will block new IO comming in,
blk_queue_enter returns -ENODEV for it.
Look at the check points of QUEUE_FLAG_DYING

[1] blk_queue_enter   (-ENODEV)
[2] blk_get_queue
[3] get_request                               (blk-legacy)
[4] generic_make_request/direct_make_request
[5] blk_insert_cloned_request                 (blk-legacy)
[6] blk_flush_plug_list                       (blk-legacy)
[7] blk_execute_rq_nowait                     (blk-legacy)
[8] sysfs interfaces

QUEUE_FLAG_DYING will not stop requests to be issued.

void blk_set_queue_dying(struct request_queue *q)
{
    spin_lock_irq(q->queue_lock);
    queue_flag_set(QUEUE_FLAG_DYING, q);
    spin_unlock_irq(q->queue_lock);

    blk_freeze_queue_start(q);  // kill the percpu-ref q_usage_counter, then blk_queue_dying will be
                                // checked in slow path in blk_queue_enter

    if (q->mq_ops)
        blk_mq_wake_waiters(q); //wake the ones waiting on driver tag
    ...
    /* Make blk_queue_enter() reexamine the DYING flag. */
    wake_up_all(&q->mq_freeze_wq);   //after this, no one could cross blk_queue_enter() in generic_make_request()
}

QUEUE_FLAG_QUIESCED
Checked through blk_queue_quiesced() in the following paths.

__blk_mq_run_hw_queue()
    -> blk_mq_sched_dispatch_requests() // under  rcu or src lock
        -> if blk_queue_quiesced()
            return  // will not dequeue from io scheduler or ctx queue
blk_mq_try_issue_directly()
    -> __blk_mq_try_issue_directly() // under  rcu or src lock
        -> if blk_queue_quiesced
            blk_mq_sched_insert_request() // to io scheduler or ctx queue

When the queue is quiesced, the reqs will not enter into lldd but only stay in blk-mq layer queues. In the other words, bios still could be submitted and will not be issued.
In blk_mq_quiesce_queue, synchronize_srcu/rcu ensure the QUEUE_FLAG_QUIESCED will be visible when it returns.

WBT

WBT = Write Buff Throttle
Why we need wbt ?
Let's quote some comment from the developer of this feature Jens.

When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background activity...
But for as long as I can remember, heavy buffered writers have not
behaved like that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts database reads or sync writes. When that happens, I get people
yelling at me.

In conclusion, the foreground IOs should be priorized over the background ones.
Who will be throttled
wbt_should_throttle() gives the answer.

static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
{
    const int op = bio_op(bio);

    /*
     * If not a WRITE, do nothing
     */
    if (op != REQ_OP_WRITE)
        return false;

    /*
     * Don't throttle WRITE_ODIRECT
     */
    if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE))
        return false;

    return true;
}

The suspect is what's about the synchronous write ?
For example, the updating of the metadata of filesystem ?
How to implement it
Let's first look at the hooks across the blk-mq layer.

          blk_mq_make_request()
                wbt_wait()
                    if !may_queue()
                        sleep

                wbt_track()
                    save track info 
                    on rq->issue_stat

          blk_mq_start_request()                        wb_timer_fn()
                wbt_issue()                                 account the latency of sync IO
                    sync issue time                         and adjust the limits of different IO type

          blk_mq_free_request()/__blk_mq_end_request()
                wbt_done()
                    dec inflight
                    wake up

          __blk_mq_requeue_request()
                wbt_requeue()
                    clear sync issue time

Yeah, it looks like the kyber IO scheduler.
But there is a big difference regarding to the action when limit is reached.

For wbt, submitting path will sleep before blk_mq_get_request in blk_mq_make_request(), On the one hand, wbt can limit the usage of requests/tags, on the other hand the submiting path cannot insert request any more.
For kyber, the path who sleep to wait token is dispatching path, which is the kworker context for write back. At the moment, the requests still could be inserted, even merged.

blkdev gendisk hd

When we access the block device directly, for example /dev/sda1, we will not pass througth bdev fs first. /dev/ is devtmpfs, not bdev fs. We could refer to init_special_inode to know this.

        sda1    sda2    sda3    sda4              devtmpfs
                     | [1]
                     V
        blkdev1 blkdev3 blkdev3 blkdev4           blkdev fs



blkdev - block_device
disk   - gendisk
hd     - hd_struct
[1]    - bdget get blkdev with inode->i_rdev (block devt) from blkdev fs
         get_gendisk get gendisk and partno with block devt and install
         them on blkdev->bd_disk and blkdev->bd_partno

In a realy workload, the stream is as following:

mount_bdev
  sget
    set_bdev_super       xxx_get_block
      set sb->s_bdev       map_bh
                             bh->bdev = sb->s_bdev
                             |
                             V
                         submit_bh_wbc
                           bio_set_dev(bio, bh->b_bdev)
                             bio->bi_disk = bdev->bd_disk 
                             bio->bi_partno = bdev->bd_partno
                             |
                             V
                         generic_make_request
                           generic_make_request_checks
                             blk_partition_remap
                               bio->bi_iter.bi_sector += hd->start_sect |
                                  bio->bi_partno = 0;
                           queue->make_request_fn

blk sysfs

Let's look at how is the following sysfs interface added.

    /sys/block/nvme/queue/
	      ^     ^     ^
		 [1]   [2]   [3]

    /sys/block/nvme/mq
	                ^
				   [4]

[1] /sys/block : block_depr

genhd_device_init
---
	/* create top-level block dir */
	if (!sysfs_deprecated)
		block_depr = kobject_create_and_add("block", NULL);
---

[2] /sys/block/nvme : gendisk->part0.__dev.kobj

device_add_disk
  -> __device_add_disk
    -> register_disk
      -> sysfs_create_link(block_depr, &ddev->kobj,
					kobject_name(&ddev->kobj));

[3] /sys/block/nvme/queue : request_queue->kobj

device_add_disk
  -> __device_add_disk
    -> blk_register_queue
	  ->  kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue")
The ktype of q->kobj is blk_queue_ktype

[4] /sys/block/nvme/mq : request_queue->mq_kobj

device_add_disk
  -> __device_add_disk
    -> blk_register_queue
	  -> __blk_mq_register_dev parent dev gendisk->part0.__dev 
	    -> kobject_add(&q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq")

request_queue cleanup and release

The first thing what blk_cleanup_queue need to do is to prevent others from entering blk path again. This is achieved by invoking blk_set_queue_dying.

void blk_set_queue_dying(struct request_queue *q)
{
    blk_queue_flag_set(QUEUE_FLAG_DYING, q);

    /*
     * When queue DYING flag is set, we need to block new req
     * entering queue, so we call blk_freeze_queue_start() to
     * prevent I/O from crossing blk_queue_enter().
     */
    blk_freeze_queue_start(q);

    if (q->mq_ops)
        blk_mq_wake_waiters(q);

    wake up the tag waiters.
    the hw queues will be run.
    DYING flag is not same with QUIESCED, the post one will prevent requests from
    entering into lldd.

    else {
    ...
    }

    /* Make blk_queue_enter() reexamine the DYING flag. */

    wake_up_all(&q->mq_freeze_wq);
}

blk_queue_dying and blk_queue_enter will gate the other contexts out of blk path.

blk_queue_dying gates:

 sysfs interface
 blk_execute_rq_nowait (looks like blk-mq doesn't do this)

Then blk_cleanup_queue will invoke blk_freeze_queue. It will defense any new requests and also drained all the requests, no matter pending or outstanding.
Even if all we have drained the queue, but there could be still contexts that will access the request_queue resources. such as blk-mq run work, requeue work blk_sync_queue is used to flush them.

void blk_sync_queue(struct request_queue *q)
{
    del_timer_sync(&q->timeout);
    cancel_work_sync(&q->timeout_work);

    if (q->mq_ops) {
        struct blk_mq_hw_ctx *hctx;
        int i;

        cancel_delayed_work_sync(&q->requeue_work);
        queue_for_each_hw_ctx(q, hctx, i)
            cancel_delayed_work_sync(&hctx->run_work);
    } else {
        cancel_delayed_work_sync(&q->delay_work);
    }
}

Finally, blk_put_queue put a reference of q->kobj.
When the reference reaches zero, blk_queue_ktype.blk_release_queue will be invoked. It queue the __blk_release_queue which will do the final release.

What need to be noted is the gendisk will take an extra ref on its request_queue in __device_add_disk and put it in disk_release. So the request_queue will sticks around as long as gendisk.

blk_integrity

What is blk_integrity for ?


       [ system memory ]
               |   
               | D  
               | M   path1
               | A
               V        sas/fc/iscsi
         [ HBA memory]- - - - - - - - ->[ storage volume ]
                             path2

The data integrity on path2 could be ensured by the transport protocol, for example: e.g

SCSI family protocols (SBC Data Integrity Field, SCC protection proposal)
SATA/T13 (External Path Protection) t
iscsi head/data digest ?

The path1 is protected by blk_integrity what we will talk next.

How is blk_integrity implemented ?
Quote from Documentation/block/data-integrity.txt

Because the format of the protection data is tied to the physical
disk, each block device has been extended with a block integrity
profile (struct blk_integrity).  This optional profile is registered
with the block layer using blk_integrity_register().

The profile contains callback functions for generating and verifying
the protection data, as well as getting and setting application tags.
The profile also contains a few constants to aid in completing,
merging and splitting the integrity metadata.

Let's look at how does the scsi sd implement this.

sd_probe_async
  -> sd_dif_config_host
--
    /* Enable DMA of protection information */
    if (scsi_host_get_guard(sdkp->device->host) & SHOST_DIX_GUARD_IP) {
        if (type == T10_PI_TYPE3_PROTECTION)
            bi.profile = &t10_pi_type3_ip;
        else
            bi.profile = &t10_pi_type1_ip;

        bi.flags |= BLK_INTEGRITY_IP_CHECKSUM;
    } else
        if (type == T10_PI_TYPE3_PROTECTION)
            bi.profile = &t10_pi_type3_crc;
        else
            bi.profile = &t10_pi_type1_crc;

    bi.tuple_size = sizeof(struct t10_pi_tuple);
    sd_printk(KERN_NOTICE, sdkp,
          "Enabling DIX %s protection\n", bi.profile->name);

    if (dif && type) {
        bi.flags |= BLK_INTEGRITY_DEVICE_CAPABLE;

        if (!sdkp->ATO)
            goto out;

        if (type == T10_PI_TYPE3_PROTECTION)
            bi.tag_size = sizeof(u16) + sizeof(u32);
        else
            bi.tag_size = sizeof(u16);

        sd_printk(KERN_NOTICE, sdkp, "DIF application tag size %u\n",
              bi.tag_size);
    }

out:
    blk_integrity_register(disk, &bi);
--

The process of blk_integrity

blk_mq_make_request
  -> bio_integrity_prep
    -> bio_integrity_add_page  //bio->bi_integrity
    -> bio_integrity_process(bio, &bio->bi_iter, bi->profile->generate_fn); //bio_data_dir(bio) == WRITE)

bio_endio
  -> bio_integrity_endio
    -> __bio_integrity_endio
--
    if (bio_op(bio) == REQ_OP_READ && !bio->bi_status &&
        (bip->bip_flags & BIP_BLOCK_INTEGRITY) && bi->profile->verify_fn) {
        INIT_WORK(&bip->bip_work, bio_integrity_verify_fn);
        queue_work(kintegrityd_wq, &bip->bip_work);
        return false;
    }
--

static void bio_integrity_verify_fn(struct work_struct *work)
{
    struct bio_integrity_payload *bip =
        container_of(work, struct bio_integrity_payload, bip_work);
    struct bio *bio = bip->bip_bio;
    struct blk_integrity *bi = blk_get_integrity(bio->bi_disk);
    struct bvec_iter iter = bio->bi_iter;

    /*
     * At the moment verify is called bio's iterator was advanced
     * during split and completion, we need to rewind iterator to
     * it's original position.
     */
    if (bio_rewind_iter(bio, &iter, iter.bi_done)) {
        bio->bi_status = bio_integrity_process(bio, &iter,
                               bi->profile->verify_fn);
    } else {
        bio->bi_status = BLK_STS_IOERR;
    }

    bio_integrity_free(bio);
    bio_endio(bio);
}

blk_integrity and fs
After the request is issued to HBA, the data will be transported to HBA internal buffer through DMA and then verify it based on protection meta data. During the DMA transporting, the data in the sglist (page caches) cannot be be modified. This is guaranteed by fs.

Steps of writing data to a file:
1. writing into the page cache
aops.write_begin
  -> lock page
  -> wait_for_stable_page
    -> if bdi_cap_stable_pages_required //BDI_CAP_STABLE_WRITES
         wait_on_page_writeback
copy from user buffer to page cache
aops.write_end

2. writeback the pagecache to disk
lock page
set page writeback
submit_bio
unlock page

3. io completion
end bio
  -> end_page_writeback
    -> test_clear_page_writeback
    -> wake_up_page(page, PG_writeback)

BDI_CAP_STABLE_WRITES is set in blk_integrity_register.

blk loop

What's blk-loop for ?


    /dev/loopX     /home/ubuntu-16.04.4-desktop-amd64.iso
         |         ^         |              |
         v         |         v              v
    +-------------C-------------------+  +-------+
    |     vfs cache|                  |  |  DIO  |
    +-------------C-------------------+  +-------+
         |         |         |              |
         v         |         v              v
    +-------------C------------------------------+
    |  block layer |                             |
    +-------------C------------------------------+
         |         |         |
         v         |         v
        blk-loop driver    SCSI layer

The backend of a block device could be a HDD, SSD, or storage subsystem linked by fc or iscsi, and also could be a local file.

There is another concept: direct IO.

The data from applications will go directly to block layer, bypassing the system
file cache.

How to create

Step 1

/dev/loop-control 
loop_ctl_fops
  -> loop_control_ioctl //LOOP_CTL_ADD
    -> loop_add
There are a lot of interesting things in loop_add, let's look at it.
static int loop_add(struct loop_device **l, int i)
{
    struct loop_device *lo;
    struct gendisk *disk;
    int err;

    err = -ENOMEM;
    lo = kzalloc(sizeof(*lo), GFP_KERNEL);
    if (!lo)
        goto out;

    lo->lo_state = Lo_unbound; //This means no file is bound on this device

    /* allocate id, if @id >= 0, we're requesting that specific id */
    if (i >= 0) {
        err = idr_alloc(&loop_index_idr, lo, i, i + 1, GFP_KERNEL);
        if (err == -ENOSPC)
            err = -EEXIST;
    } else {
        err = idr_alloc(&loop_index_idr, lo, 0, 0, GFP_KERNEL);
    }
    if (err < 0)
        goto out_free_dev;
    i = err;

    err = -ENOMEM;
    lo->tag_set.ops = &loop_mq_ops;
    lo->tag_set.nr_hw_queues = 1;
    /*
    It should be an interesting theme to find out how many hw_queues to be
    required to get better performance.
    The real work is done in loop kthread, what .queue_rq does is just to insert
    a work or wakeup the kthread.
     */
    lo->tag_set.queue_depth = 128;
    lo->tag_set.numa_node = NUMA_NO_NODE;
    lo->tag_set.cmd_size = sizeof(struct loop_cmd);
    lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
    lo->tag_set.driver_data = lo;

    err = blk_mq_alloc_tag_set(&lo->tag_set);
    if (err)
        goto out_free_idr;

    lo->lo_queue = blk_mq_init_queue(&lo->tag_set);
    if (IS_ERR_OR_NULL(lo->lo_queue)) {
        err = PTR_ERR(lo->lo_queue);
        goto out_cleanup_tags;
    }
    lo->lo_queue->queuedata = lo;

    blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);

    /*
     * By default, we do buffer IO, so it doesn't make sense to enable
     * merge because the I/O submitted to backing file is handled page by
     * page. For directio mode, merge does help to dispatch bigger request
     * to underlayer disk. We will enable merge once directio is enabled.
     */
    queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);

    err = -ENOMEM;
    disk = lo->lo_disk = alloc_disk(1 << part_shift);
    ...
    disk->fops        = &lo_fops; //this the fops for /dev/loopX
    disk->private_data    = lo;
    disk->queue        = lo->lo_queue;
    sprintf(disk->disk_name, "loop%d", i);
    add_disk(disk);
    *l = lo;
    return lo->lo_number;
    ...
}

Step 2

/dev/loopX
lo_fops
  -> lo_ioctl //LOOP_SET_FD
    -> loop_set_fd
static int loop_set_fd(struct loop_device *lo, fmode_t mode,
               struct block_device *bdev, unsigned int arg)
{
    ...
    file = fget(arg);
    if (!file)
        goto out;
    ...
    mapping = file->f_mapping;
    inode = mapping->host;
    //regular file or block file
    if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
        goto out_putf;

    if (!(file->f_mode & FMODE_WRITE) || !(mode & FMODE_WRITE) ||
        !file->f_op->write_iter)
        lo_flags |= LO_FLAGS_READ_ONLY;

    error = -EFBIG;
    size = get_loop_size(lo, file);
    if ((loff_t)(sector_t)size != size)
        goto out_putf;
    error = loop_prepare_queue(lo);
    
            kthread_init_worker(&lo->worker);
            lo->worker_task = kthread_run(loop_kthread_worker_fn,
                    &lo->worker, "loop%d", lo->lo_number);
            if (IS_ERR(lo->worker_task))
            return -ENOMEM;
            set_user_nice(lo->worker_task, MIN_NICE);
    

    set_device_ro(bdev, (lo_flags & LO_FLAGS_READ_ONLY) != 0);

    lo->use_dio = false;
    lo->lo_device = bdev;
    lo->lo_flags = lo_flags;
    lo->lo_backing_file = file;
    lo->transfer = NULL;
    lo->ioctl = NULL;
    lo->lo_sizelimit = 0;
    lo->old_gfp_mask = mapping_gfp_mask(mapping);
    mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));

    if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
        blk_queue_write_cache(lo->lo_queue, true, false);

    loop_update_dio(lo);
    set_capacity(lo->lo_disk, size);
    bd_set_size(bdev, size << 9);
    loop_sysfs_init(lo);
    /* let user-space know about the new size */
    kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE);

    set_blocksize(bdev, S_ISBLK(inode->i_mode) ?
              block_size(inode->i_bdev) : PAGE_SIZE);

    lo->lo_state = Lo_bound;
    ...
}

Kthread or Workqueue ?

When request enters into .queue_rq, how to handle it next ?
It need to be handled in another context, because we have owned a deep stack from vfs_read/write to driver .queue_rq. This context could be kworker or standalone kthread. But which one shoud we use ?
commit e03a3d7 ( block: loop: use kthread_work ) change the block loop from work to kthread context. Let's look at what block loop does before and after this patch.

Work based.

           Concurrently                   Sequentially                         
    Read   Read   Read   Read      Write<->Write<->Write<->Write
    +---+  +---+  +---+  +---+     +---+
    | W |  | W |  | W |  | W |     | W |
    +---+  +---+  +---+  +---+     +---+
      |      |      |      |         |
   + -v- - - v - - -v- - - v - - - - v - - +
   |          Unbound worker pool          |
   + - - - - - - - - - - - - - - - - - - - +

+---+
| W |  work instance
+---+

For the read, block loop issues them concurrently as far as possible. This is due to read operastions often need to wait for the page caches to be filled, it is usually a sychronous one. Issuing Read concurrently is good for random read, but it is not so efficient for sequential read which often could hit the page cache.
For the write, block loop issue them sequentially, because writes usually reaches on page cache, it is usually fast enough.

             Write<->Write<->Read<->Read<->Write ....
          +- - - - -+
          | kthread |
          +- - - - -+

When DIO/AIO is introduced, the read/write on backing file is not blocking operations.

DIO & AIO on backing file

In linux, read operastions are almost synchronous except for the required data has been already in the page cache, otherwise, it has to wait for the page cache to be filled by the block device through block layer and blk driver. Even if we have readahead mechanism, but the page cache cannot be often hit with random read.
Consequently, the loop driver execute context (kworker or standalone kthread) has to wait and this will delay the other requests which may has associated page cache already.
On ther other hand, there are two layer page cache would be involved, one for file over loop device, one for the backing file. This is unnecessary and wastes memory.

Leiming introduced backing file DIO and AIO supporting in block loop.

commit bc07c10a3603a5ab3ef01ba42b3d41f9ac63d1b6
Author: Ming Lei 
Date:   Mon Aug 17 10:31:51 2015 +0800

    block: loop: support DIO & AIO
    
    There are at least 3 advantages to use direct I/O and AIO on
    read/write loop's backing file:
    
    1) double cache can be avoided, then memory usage gets
    decreased a lot
    
    2) not like user space direct I/O, there isn't cost of
    pinning pages
    
    3) avoid context switch for obtaining good throughput
    - in buffered file read, random I/O top throughput is often obtained
    only if they are submitted concurrently from lots of tasks; but for
    sequential I/O, most of times they can be hit from page cache, so
    concurrent submissions often introduce unnecessary context switch
    and can't improve throughput much. There was such discussion[1]
    to use non-blocking I/O to improve the problem for application.
    - with direct I/O and AIO, concurrent submissions can be
    avoided and random read throughput can't be affected meantime
    
    xfstests(-g auto, ext4) is basically passed when running with
    direct I/O(aio), one exception is generic/232, but it failed in
    loop buffered I/O(4.2-rc6-next-20150814) too.
    
    Follows the fio test result for performance purpose:
        4 jobs fio test inside ext4 file system over loop block
    
    1) How to run
        - KVM: 4 VCPUs, 2G RAM
        - linux kernel: 4.2-rc6-next-20150814(base) with the patchset
        - the loop block is over one image on SSD.
        - linux psync, 4 jobs, size 1500M, ext4 over loop block
        - test result: IOPS from fio output
    
    2) Throughput(IOPS) becomes a bit better with direct I/O(aio)
            -------------------------------------------------------------
            test cases          |randread   |read   |randwrite  |write  |
            -------------------------------------------------------------
            base                |8015       |113811 |67442      |106978
            -------------------------------------------------------------
            base+loop aio       |8136       |125040 |67811      |111376
            -------------------------------------------------------------
    
    - somehow, it should be caused by more page cache avaiable for
    application or one extra page copy is avoided in case of direct I/O
    
    3) context switch
            - context switch decreased by ~50% with loop direct I/O(aio)
        compared with loop buffered I/O(4.2-rc6-next-20150814)
    
    4) memory usage from /proc/meminfo
            -------------------------------------------------------------
                                       | Buffers       | Cached
            -------------------------------------------------------------
            base                       | > 760MB       | ~950MB
            -------------------------------------------------------------
            base+loop direct I/O(aio)  | < 5MB         | ~1.6GB
            -------------------------------------------------------------
    
    - so there are much more page caches available for application with
    direct I/O
    
    [1] https://lwn.net/Articles/612483/
    
    Signed-off-by: Ming Lei 
    Reviewed-by: Christoph Hellwig 
    Signed-off-by: Jens Axboe

After that, we get following diagram.

    /dev/loopX            > /home/ubuntu-16.04.4-desktop-amd64.iso
         |               /         |
         v              /          v
    +-------------+    /       +-------+
    | vfs cache|  |   /        |  DIO  |
    +-------------+  /         +-------+
         |          /              |
         v         /               v
    +-------------C-----------------------------+
    | block layer  |                            |
    +-------------C-----------------------------+
         |         |               |
         v         |               v
        blk-loop driver        SCSI layer

blk-stats

Before look into the implementations of blk-stat in kernel, let's first look at how to utilize the information provided by blk-stats, iostat.

#iostat -c -d -x /dev/sda2 2 100
Linux 4.16.0-rc3+ (will-ThinkPad-L470) 03/20/2018 _x86_64_ (4 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
12.61 0.03 2.23 0.82 0.00 84.31

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda2 0.20 5.86 2.46 4.04 23.54 56.83 24.72 0.14 20.56 6.17 29.31 5.67 3.69

rrqm/s The number of read requests merged per second queued to the device.
wrqm/s The number of write requests merged per second queued to the device.
r/s The number of read requests issued to the device per second.
w/s The number of write requests issued to the device per second.
avgrq-sz The average size (in sectors) of the requests issued to the device.
avgqu-sz The average queue length of the requests issued to the device.
await The average time (milliseconds) for I/O requests issued to the device to be served.
This includes the time spent by the requests in queue and the time spent servicing them.
r_await The average time (in milliseconds) for read requests issued to the device to be served.
This includes the time spent by the requests in queue and the time spent servicing them.
w_await The average time (in milliseconds) for write requests issued to the device to be served.
This includes the time spent by the requests in queue and the time spent servicing them.
svctm The average service time (in milliseconds) for I/O requests issued to the device.
Warning! Do not trust this field; it will be removed in a future version of sysstat.
%util Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device).
Device saturation occurs when this values is close to 100%.

How to calculate them ? Based on write_ext_stat

ioj, ioi     two samples, j = i + 1
itv          interval of two samples

rrqm/s    (ioj->rd_merges, ioi->rd_merges)/itv   
wrqm/s    (ioj->wr_merges, ioi->wr_merges)/itv   
r/s       (ioj->rd_ios, ioi->rd_ios)/itv
w/s       (ioj->wr_ios, ioi->wr_ios)/itv
avgrq-sz  ((ioj->rd_sect - ioi->rd_sect) + (ioj->wr_sect - ioi->wr_sect))/
          (ioj->nr_ios - ioi->nr_ios)
avgqu-sz  (ioj->rq_ticks, ioi->rq_ticks)/itv
await     ((ioj->rd_ticks - ioi->rd_ticks) + (ioj->wr_ticks + ioj->wr_ticks))/
          (ioj->nr_ios - ioi->nr_ios)
         
r_await    similar with await
w_await    similar with await
         
%util      (ioj->tot_ticks - ioi->tot_ticks)/itv

We could refer read_diskstats_stat to know where does these data come from.

Next, let's find out how to generate this statistics data in kernel.
Based on diskstats_show There are following members in hd_struct.dkstats ( a percpu variable)

ios r/w

__blk_mq_end_request
  -> blk_account_io_done
    -> part_stat_inc(cpu, part, ios[rw]);

merges r/w

bio_attempt_back/font/disacard_merge
  -> blk_account_io_start // new_io == false
    -> part_stat_inc(cpu, part, merges[rw]);

sectors r/w

blk_mq_end_request
  -> blk_update_request
    -> blk_account_io_completion
      ->  part_stat_add(cpu, part, sectors[rw], bytes >> 9);

ticks r/w

__blk_mq_end_request
  -> blk_account_io_done
    -> part_stat_add(cpu, part, ticks[rw], (jiffies - req->start_time));
rq->start_time is set in blk_mq_rq_ctx_init and will inherit the smaller start_time of the merged rqs

What if the duration here is smaller than 1 jiffies ?
This could be possible on a machine that has a high-speed storage device and low HZ

io_ticks

time_in_queue

blk_account_io_start/merge/done diskstats_show/part_stat_show
  -> part_round_stats
    -> part_in_flight // f299b7c (blk-mq: provide internal in-flight variant)
      -> blk_mq_in_flight
    -> part_round_stats_single

static void part_round_stats_single(struct request_queue *q, int cpu,
                    struct hd_struct *part, unsigned long now,
                    unsigned int inflight)
{
    if (inflight) {
        __part_stat_add(cpu, part, time_in_queue,
                inflight * (now - part->stamp));
        __part_stat_add(cpu, part, io_ticks, (now - part->stamp));
    }
    part->stamp = now;
}
io_ticks here means the time when there is in-flight IO in request queue.

Reference
read_diskstats_stat

void read_diskstats_stat(int curr)
{
    ...
    if ((fp = fopen(DISKSTATS, "r")) == NULL) //  proc/diskstats
        return;

    while (fgets(line, 256, fp) != NULL) {

        /* major minor name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq */
        i = sscanf(line, "%u %u %s %lu %lu %lu %lu %lu %lu %lu %u %u %u %u",
               &major, &minor, dev_name,
               &rd_ios, &rd_merges_or_rd_sec, &rd_sec_or_wr_ios, &rd_ticks_or_wr_sec,
               &wr_ios, &wr_merges, &wr_sec, &wr_ticks, &ios_pgr, &tot_ticks, &rq_ticks);

        if (i == 14) {
            /* Device or partition */
            if (!dlist_idx && !DISPLAY_PARTITIONS(flags) &&
                !is_device(dev_name, ACCEPT_VIRTUAL_DEVICES))
                continue;
            sdev.rd_ios     = rd_ios;
            sdev.rd_merges  = rd_merges_or_rd_sec;
            sdev.rd_sectors = rd_sec_or_wr_ios;
            sdev.rd_ticks   = (unsigned int) rd_ticks_or_wr_sec;
            sdev.wr_ios     = wr_ios;
            sdev.wr_merges  = wr_merges;
            sdev.wr_sectors = wr_sec;
            sdev.wr_ticks   = wr_ticks;
            sdev.ios_pgr    = ios_pgr;
            sdev.tot_ticks  = tot_ticks;
            sdev.rq_ticks   = rq_ticks;
        }
        ...
        save_stats(dev_name, curr, &sdev, iodev_nr, st_hdr_iodev);
    }
    ...
}

diskstats_show

static int diskstats_show(struct seq_file *seqf, void *v)
{
    struct gendisk *gp = v;
    struct disk_part_iter piter;
    struct hd_struct *hd;
    char buf[BDEVNAME_SIZE];
    unsigned int inflight[2];
    int cpu;

    /*
    if (&disk_to_dev(gp)->kobj.entry == block_class.devices.next)
        seq_puts(seqf,    "major minor name"
                "     rio rmerge rsect ruse wio wmerge "
                "wsect wuse running use aveq"
                "\n\n");
    */

    disk_part_iter_init(&piter, gp, DISK_PITER_INCL_EMPTY_PART0);
    while ((hd = disk_part_iter_next(&piter))) {
        cpu = part_stat_lock();
        part_round_stats(gp->queue, cpu, hd);
        part_stat_unlock();
        part_in_flight(gp->queue, hd, inflight);
        seq_printf(seqf, "%4d %7d %s %lu %lu %lu "
               "%u %lu %lu %lu %u %u %u %u\n",
               MAJOR(part_devt(hd)), MINOR(part_devt(hd)),
               disk_name(gp, hd->partno, buf),
               part_stat_read(hd, ios[READ]),
               part_stat_read(hd, merges[READ]),
               part_stat_read(hd, sectors[READ]),
               jiffies_to_msecs(part_stat_read(hd, ticks[READ])),
               part_stat_read(hd, ios[WRITE]),
               part_stat_read(hd, merges[WRITE]),
               part_stat_read(hd, sectors[WRITE]),
               jiffies_to_msecs(part_stat_read(hd, ticks[WRITE])),
               inflight[0],
               jiffies_to_msecs(part_stat_read(hd, io_ticks)),
               jiffies_to_msecs(part_stat_read(hd, time_in_queue))
            );
    }
    disk_part_iter_exit(&piter);

    return 0;
}

write_ext_stat

void write_ext_stat(int curr, unsigned long long itv, int fctr,
            struct io_hdr_stats *shi, struct io_stats *ioi,
            struct io_stats *ioj)
{
    char *devname = NULL;
    struct stats_disk sdc, sdp;
    struct ext_disk_stats xds;
    double r_await, w_await;
    
    /*
     * Counters overflows are possible, but don't need to be handled in
     * a special way: The difference is still properly calculated if the
     * result is of the same type as the two values.
     * Exception is field rq_ticks which is incremented by the number of
     * I/O in progress times the number of milliseconds spent doing I/O.
     * But the number of I/O in progress (field ios_pgr) happens to be
     * sometimes negative...
     */
    sdc.nr_ios    = ioi->rd_ios + ioi->wr_ios;
    sdp.nr_ios    = ioj->rd_ios + ioj->wr_ios;

    sdc.tot_ticks = ioi->tot_ticks;
    sdp.tot_ticks = ioj->tot_ticks;

    sdc.rd_ticks  = ioi->rd_ticks;
    sdp.rd_ticks  = ioj->rd_ticks;
    sdc.wr_ticks  = ioi->wr_ticks;
    sdp.wr_ticks  = ioj->wr_ticks;

    sdc.rd_sect   = ioi->rd_sectors;
    sdp.rd_sect   = ioj->rd_sectors;
    sdc.wr_sect   = ioi->wr_sectors;
    sdp.wr_sect   = ioj->wr_sectors;
    
    compute_ext_disk_stats(&sdc, &sdp, itv, &xds);
    
    r_await = (ioi->rd_ios - ioj->rd_ios) ?
          (ioi->rd_ticks - ioj->rd_ticks) /
          ((double) (ioi->rd_ios - ioj->rd_ios)) : 0.0;
    w_await = (ioi->wr_ios - ioj->wr_ios) ?
          (ioi->wr_ticks - ioj->wr_ticks) /
          ((double) (ioi->wr_ios - ioj->wr_ios)) : 0.0;

    /* Print device name */
    if (DISPLAY_PERSIST_NAME_I(flags)) {
        devname = get_persistent_name_from_pretty(shi->name);
    }
    if (!devname) {
        devname = shi->name;
    }
    if (DISPLAY_HUMAN_READ(flags)) {
        printf("%s\n%13s", devname, "");
    }
    else {
        printf("%-13s", devname);
    }

    /*       rrq/s wrq/s   r/s   w/s  rsec  wsec  rqsz  qusz await r_await w_await svctm %util */
    printf(" %8.2f %8.2f %7.2f %7.2f %8.2f %8.2f %8.2f %8.2f %7.2f %7.2f %7.2f %6.2f %6.2f\n",
           S_VALUE(ioj->rd_merges, ioi->rd_merges, itv),
           S_VALUE(ioj->wr_merges, ioi->wr_merges, itv),
           S_VALUE(ioj->rd_ios, ioi->rd_ios, itv),
           S_VALUE(ioj->wr_ios, ioi->wr_ios, itv),
           ll_s_value(ioj->rd_sectors, ioi->rd_sectors, itv) / fctr,
           ll_s_value(ioj->wr_sectors, ioi->wr_sectors, itv) / fctr,
           xds.arqsz,
           S_VALUE(ioj->rq_ticks, ioi->rq_ticks, itv) / 1000.0,
           xds.await,
           r_await,
           w_await,
           /* The ticks output is biased to output 1000 ticks per second */
           xds.svctm,
           /*
            * Again: Ticks in milliseconds.
        * In the case of a device group (option -g), shi->used is the number of
        * devices in the group. Else shi->used equals 1.
        */
           shi->used ? xds.util / 10.0 / (double) shi->used
                     : xds.util / 10.0);    /* shi->used should never be null here */
}

blk-timeout

There is a timer per request_queue to defense blk device no response.

The timer is armed by blk_add_timer.
The timer is request_queue.timeout and timeout fn is blk_rq_timed_out_timer.
static void blk_rq_timed_out_timer(struct timer_list *t)
{
    struct request_queue *q = from_timer(q, t, timeout);

    kblockd_schedule_work(&q->timeout_work);
}

The main stuff of timeout is executed in kworker context.
There is a difference between blk-legacy and blk-mq.

In blk-legacy, when arm the timer, the request will be added on request_queue.timeout_list.
And when the request is completed, the request will be dequeued from it.
blk_requeue_request/blk_finish_request
  -> blk_delete_timer
The blk_timeout_work will check the requests on request_queue.timeout_list.

In blk-mq, the request_queue.timeout_list is not used any more, instead, it
employ the blk_mq_queue_tag_busy_iter. It use the occupied
driver tag to track the requests.

static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
{
    struct bt_iter_data *iter_data = data;
    struct blk_mq_hw_ctx *hctx = iter_data->hctx;

    struct blk_mq_tags *tags = hctx->tags;

    bool reserved = iter_data->reserved;
    struct request *rq;

    if (!reserved)
        bitnr += tags->nr_reserved_tags;
    rq = tags->rqs[bitnr];

    /*
     * We can hit rq == NULL here, because the tagging functions
     * test and set the bit before assining ->rqs[].
     */
    if (rq && rq->q == hctx->queue)
        iter_data->fn(hctx, rq, iter_data->data, reserved);
    return true;
}

When there is no io scheduler, the request will always occupy a driver tag. If the lldd prevent new requests from entering through blk_mq_quiesce_queue or other ways, and the request_queue.timeout has been armed, will the requests in blk-mq queues be expired ?
So when a request is completed, we don't see blk_delete_timer in __blk_mq_complete_request or __blk_mq_end_request.

Another difference is the method to handle Race between timeout completion and regular completion
blk-legacy employs blk_mark_rq_complete.

void blk_complete_request(struct request *req)
{
    if (unlikely(blk_should_fake_timeout(req->q)))
        return;
    if (!blk_mark_rq_complete(req))
        __blk_complete_request(req);
}
static void blk_rq_check_expired(struct request *rq, unsigned long *next_timeout,
              unsigned int *next_set)
{
    const unsigned long deadline = blk_rq_deadline(rq);

    if (time_after_eq(jiffies, deadline)) {
        list_del_init(&rq->timeout_list);

        /*
         * Check if we raced with end io completion
         */
        if (!blk_mark_rq_complete(rq))
            blk_rq_timed_out(rq);
    } else if (!*next_set || time_after(*next_timeout, deadline)) {
        *next_timeout = deadline;
        *next_set = 1;
    }
}

In blk-mq, after tejun's blk-mq: reimplement timeout handling (https://lkml.org/lkml/2018/1/9/761), blk_mark_rq_complete has been discarded.
rcu/srcu is employed to synchronize between timeout path and regular completion path instead of atomic operations. In addition, it could avoid the following scenario below.

blk_mq_check_expired
---
    deadline = READ_ONCE(rq->deadline);

A delay introduced here by preempt or interrupt or other, during this, the rq is
completed and freed, then got and reinitialized again by others.
And we could timeout a new instance here.

    if (time_after_eq(jiffies, deadline)) {
        if (!blk_mark_rq_complete(rq)) {
            blk_mq_rq_timed_out(rq, reserved);
        }
---

After tejun's commit, things become this:
blk_mq_check_expired
---
    /* read coherent snapshots of @rq->state_gen and @rq->deadline */
    while (true) {
        start = read_seqcount_begin(&rq->gstate_seq);
        gstate = READ_ONCE(rq->gstate);
        deadline = blk_rq_deadline(rq);
        if (!read_seqcount_retry(&rq->gstate_seq, start))
            break;
        cond_resched();
    }

A delay introduced here by preempt or interrupt or other, during this, the rq is
completed and freed, then got and reinitialized again by others.

    /* if in-flight && overdue, mark for abortion */
    if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
        time_after_eq(jiffies, deadline)) {
        blk_mq_rq_update_aborted_gstate(rq, gstate);
        data->nr_expired++;
        hctx->nr_expired++;
    } 
---
static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
        struct request *rq, void *priv, bool reserved)
{

    /*
     * We marked @rq->aborted_gstate and waited for RCU.  If there were
     * completions that we lost to, they would have finished and
     * updated @rq->gstate by now; otherwise, the completion path is
     * now guaranteed to see @rq->aborted_gstate and yield.  If
     * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
     */
Note: the rcu/srcu synchronize is between blk_mq_check_expired and
blk_mq_terminate_expired.

    if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
        READ_ONCE(rq->gstate) == rq->aborted_gstate)

There two parts of the gstate, generation and state.
When we save the gstate to aborted_gstate, its state was MQ_RQ_IN_FLIGHT.
If the recycle new instance has not been started, the state will not match,
because it is MQ_RQ_IDLE, if started, the generation will not match, because the
generation part of gstate will be increased when state switches to
MQ_RQ_IN_FLIGHT.

        blk_mq_rq_timed_out(rq, reserved);
}

blk-throttle

Basis

                               generic_make_request
                                       |
                                       V
   tg_A->sq->queued (qn_A_r_self (bio, bio, bio))    tg_B->sq->queued (qn_B_r_self (bio, bio, bio))
                                       |   
                                       V
                 tg_ABg->sq->queued (qn_ABg_r_self(bio, bio) qn_A_r_parent (bio), qn_B_r_parent (bio bio))
                                        |
                                        V
                             td->sq->queued (qn_ABg_r_parent(bio))
                                        | 
                                        V                                 
                                generic_make_request (td->dispatch_work context)

                                bio (w/ BIO_THROTTLED) will not pass
                                through blk-throttle again.

qn  per-tg, contains throttled bios.
    when dispatch bios, qn by qn, rather than bio by bio, otherwise, one tg could
    fill up the budget and starve others. (throtl_pop_queued)
    There are two dimensions of qn.
    r/w , when dispatch, 75% read, 25% write (throtl_dispatch_tg)
    self/parent, during dispatching, some bios maybe queued upwards to parent's
    sq, some others not. At the moment, parent qn is used to contained ther bios
    queued to parent's sq, self qn contains others.

sq  throtl_service_queue, per-tg or td
    construct the hierarchy, td->sq is the root node
    queued throl_qnode
    first_pending_disptime
    pending_timer, dispatch bios upwards to parent sq until td->sq, queue td dispatch_work

tg  throtl_grp, per (blk-throt cgroup - request_queue)
    bps,iops limits, bytes, ios dispatched number

td  throtl_data, per-request_queue
    queued[r/w] qn list, only the bios that has reached here could be issued.
    dispatch_work, generic_make_request
    limit_index (LOW/MAX)



How to account the bps and iops ? 

                            current
                               |
 tg->slice_start               v         tg->slice_end
          |-------|------|-------|------| ....
          |< - - - -   - - - - ->|
                     V
                elapsed_rnd


          limit = tg_bps/iops_limit(tg, rw) * elapsed_rnd



| - - - |  td->throtl_slice
Refer to tg_with_in_tg/iops_limit

When the tg->bytes/io_disp is over the limit:
blk_throtl_bio
  -> throtl_add_bio_tg
    -> set THROTL_TG_WAS_EMPTY when sq->nr_queued == 0
    -> throtl_qnode_add_bio(bio, qn, &sq->queued[rw]);
      -> add bio to qn, add qn to sq
      -> blkg_get(tg_to_blkg(qn->tg))
         throttled bio dispatching is an asynchronous event,
         we need a reference of blkg to prevent it to be freed
         
    -> add tg to parent sq pending rb tree with tg->disptime as key
  if THROTL_TG_WAS_EMPTY is set
  -> tg_update_disptime
  
  next dispatch time will be calculated here through tg_may_dispatch
  
  -> throtl_schedule_next_dispatch(tg->service_queue.parent_sq, true);
    -> update_min_dispatch_time
      -> pick up the leftest node from the parent sq pending rb tree
         and update parent_sq->first_pending_disptime
      -> throtl_schedule_pending_timer
        -> schedule parent_sq pending_timer on first_pending_disptime


Think of a case here:

A bio is throttled and its dispatch time is 5 jiffies. What if a new bio comes
in with a 3 jiffies dispatch time ?
Why does every tg need a dispatch time ?

bio size
        ^
        |  o - bio
        |
        |                o2
        |                |     o3
        |  o0   o1       |     |
        |   |   |        |     |
        +-----------------------------------------> time
                t0       t1
if we issue the o2 on t0, the bps limit will be reached, we have to delay it to
t1, then bps limit could be complied.


Howeve, what if the following case:

bio size
        ^
        |  o - bio
        |
        |                o2 (planed)
        |                |     
        |  o0   o1o3     |    
        |   |   | |      |     
        +-----------------------------------------> time
                t0       t1

We have schedule the parent_sq pending timer to t1 to dispatch o2, when we have
a o3 on t0, the pending_timer need to expire ahead to disaptch o3, otherwise, o3
is delayed. How to handle this case in blk-throtl ?

No such kind of issue
Except for o3 has a higher priority than o2. What does blk-throl do here is to
limit the bps.
In fact, blk-throtl maintain the rq_list of read and write separately, so write
bios will not block read bios. And blk-throtl will try to dispatch 75% READS and
25% WRITES, refer to throtl_dispatch_tg.


We have illustrated the hierarchy structure of blk-throtl. Let's walk through
the source code here.
submit path
generic_make_request
  -> generic_make_request_checks
    -> blkcg_bio_issue_check
      -> blk_throtl_bio
---
    while (true) {
        if (tg->last_low_overflow_time[rw] == 0)
            tg->last_low_overflow_time[rw] = jiffies;
        throtl_downgrade_check(tg);
        throtl_upgrade_check(tg);
/* throtl is FIFO - if bios are already queued, should queue */
        if (sq->nr_queued[rw])
            break;

        /* if above limits, break to queue */
        if (!tg_may_dispatch(tg, bio, NULL)) {
            tg->last_low_overflow_time[rw] = jiffies;
            if (throtl_can_upgrade(td, tg)) {
                throtl_upgrade_state(td);
                goto again;
            }
            break;
        }

        /* within limits, let's charge and dispatch directly */
        throtl_charge_bio(tg, bio);

        /*
         * We need to trim slice even when bios are not being queued
         * otherwise it might happen that a bio is not queued for
         * a long time and slice keeps on extending and trim is not
         * called for a long time. Now if limits are reduced suddenly
         * we take into account all the IO dispatched so far at new
         * low rate and * newly queued IO gets a really long dispatch
         * time.
         *
         * So keep on trimming slice even if bio is not queued.
         */ 
        throtl_trim_slice(tg, rw);

        /*
         * @bio passed through this layer without being throttled.
         * Climb up the ladder.  If we''re already at the top, it
         * can be executed directly.
         */
        qn = &tg->qnode_on_parent[rw];
        sq = sq->parent_sq;    // check limit upward
        tg = sq_to_tg(sq);
        if (!tg)
            goto out_unlock;
    }
---

Dispatch path:

static void throtl_pending_timer_fn(struct timer_list *t)
{
    ...
again:
    parent_sq = sq->parent_sq;
    dispatched = false;

    while (true) {
        throtl_log(sq, "dispatch nr_queued=%u read=%u write=%u",
               sq->nr_queued[READ] + sq->nr_queued[WRITE],
               sq->nr_queued[READ], sq->nr_queued[WRITE]);

        ret = throtl_select_dispatch(sq);
          -> throtl_dispatch_tg // if tg_may_dispatch
            -> tg_dispatch_one_bio
              -> throtl_pop_queued
              -> throtl_charge_bio
              -> add to sq of parent tg or td
        if (ret) {
            throtl_log(sq, "bios disp=%u", ret);
            dispatched = true;
        }

        there maybe still queued bio in the tg
        if (throtl_schedule_next_dispatch(sq, false))
            break;

        /* this dispatch windows is still open, relax and repeat */
        spin_unlock_irq(q->queue_lock);
        cpu_relax(); //give some others chances to get in.
        queued spinlock will ensure the waiters to get this lock in turn.
        spin_lock_irq(q->queue_lock);
    }

    if (!dispatched)
        goto out_unlock;

    if (parent_sq) {
        /* @parent_sq is another throl_grp, propagate dispatch */
        if (tg->flags & THROTL_TG_WAS_EMPTY) {
            tg_update_disptime(tg);
            if (!throtl_schedule_next_dispatch(parent_sq, false)) {
                /* window is already open, repeat dispatching */
                sq = parent_sq;
                tg = sq_to_tg(sq);
                goto again;
            }
        }
    } else {
        /* reached the top-level, queue issueing */
        queue_work(kthrotld_workqueue, &td->dispatch_work);
    }
out_unlock:
    spin_unlock_irq(q->queue_lock);
}











low limit


io.low limit is only available in cgroup2. cgroup with a io.max limit will never
dispatch more IO than its max limit, but it cannot ensure the cgroup always has
a appropriate bps or iops. For example:
tasks in cgroup_read have very high read workload, and tasks in cgroup_write
have very high write workload. They both issues requests on a same disk with wbt
enabled. The writes operations will be limitted due to wbt and IO performance in
cgroup_write will be very pool when cgroup_read always issues read operations.

These two cgroup both don't exceed the io.max, but cgroup_write has a very pool
performance. This is not fair for cgroup_write.

Or another example from https://lwn.net/Articles/709474/

An example usage is we have a high prio cgroup with high 'low' limit and a low
prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
prio can run above its 'low' limit, so we don't waste the bandwidth. When the
high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
under its 'low' limit. This will protect high prio cgroup to get more
resources.

The final destination is to limit the bps/iops between io.low ~ io.max.


There are two questions that need to be figured out.





When to switch to io.low limit
Related varaibles in tg

 last_check_time
 last_bytes/io_disp[R/W] (throtl_charge_bio)
 last_low_overflow_time[R/W] 

Check the bps or iops through last_bytes/io_disp/(jiffies - last_check_time)
If the result > io.low limit, set last_low_overflow_time, which means the
bps/iops is higher than io.low during the last period.

If jiffies >= tg->last_low_overflow_time + td->throtl_slice, we say the io.low
limit is reached.

This is done by throtl_downgrade_check.

throtl_downgrade_state switches the limit to LOW.

static void throtl_downgrade_state(struct throtl_data *td, int new)
{
    td->scale /= 2;

    throtl_log(&td->service_queue, "downgrade, scale %d", td->scale);
    if (td->scale) {
        td->low_upgrade_time = jiffies - td->scale * td->throtl_slice;
        return;
    }

    td->limit_index = new;
    td->low_downgrade_time = jiffies;
}


After switch to io.low limit, when to get back to io.max ?
When swith to io.low limit,
blk_throtl_bio -> tg_may_dispatch -> tg_with_in_bps_limit -> tg_bps_limit will
return the io.low limit through tg->bps[rw][td->limit_index]
then more bios will be throttled and queued.


last_low_overflow_time ( bps/iops is higher than limit) is updated in following

              if limit_index == MAX
 ^                throttled and queued,   blk_throtl_bio updates last_low_overflow_time
 |                
 |           ----------------------------------------------   LIMIT_MAX
 |
 |           if limit_index == MAX
 |              charge and dispatch,    throtl_downgrade_check updates last_low_overflow_time
 |disp       if limit_index == LOW
 |bps/          throttled and queue,    blk_throtl_bio updates last_low_overflow_time
 |iops       
 |
 |
 |            ----------------------------------------------  LIMIT_LOW
 |             if limit_index == MAX && time_after(now, last_low_overflow_time + throtl_slice)
 |               downgrade
 |             if limit_index == LOW
 |                 charge and dispatch
 |             if limit_index == LOW && time_after(now, last_low_overflow_time + throtl_slice)
                upgrade
position:
 throtl_downgrade_check ( only make sense when LIMIT_MAX)
 tg_may_dispatch return returns false, which indicates bps/iops is above no
matter MAX or LOW
 before queue throttled bio, tg_may_dispatch maybe skipped due to
sq->nr_queued > 0


Are the 2nd and 3rd cases necessary ?
last_low_overflow_time indicates the bps/iops is above low limit during the past
period of time. For 2nd and 3rd cases, if the limit_index is MAX, beyond
question, bps/iops is above the low limit, because the blk-throtl pending time
will ensure the dispatching bps/iops is equal max limit. However, if the
limit_index is LOW, when the bio is throtled and queued, it indicates the submit
bps/iops is above low limit, not dispatch, which one will be ensured to be equal
to low limit.


   submit bps/iops is                             dispatch bps/iops 
   above limit                                    is equal to limit

   vfs                   push               pop   sq->pending_timer
   blk_throtl_bio       ---->  sq->queued[] ---->    throtl_dispatch_tg
     sq->nr_queued > 0                                tg_dispatch_one_bio (charge, queue up, trim)
       throtl_add_bio_tg



The condition to switch to MAX
throtl_upgrade_check
--
    if (time_after(tg->last_check_time + tg->td->throtl_slice, now))
        return;

    tg->last_check_time = now;
--
  ...
  -> throtl_tg_can_upgrade
    -> time_after_eq(jiffies, tg_last_low_overflow_time(tg) + tg->td->throtl_slice)
       && throtl_tg_is_idle(tg))
      ^^^^
      Should it be a '||' ?



throtl_upgrade_state does the real work.
static void throtl_upgrade_state(struct throtl_data *td)
{
    struct cgroup_subsys_state *pos_css;
    struct blkcg_gq *blkg;

    throtl_log(&td->service_queue, "upgrade to max");
    td->limit_index = LIMIT_MAX;
    td->low_upgrade_time = jiffies;
    td->scale = 0;
    rcu_read_lock();
    blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
        struct throtl_grp *tg = blkg_to_tg(blkg);
        struct throtl_service_queue *sq = &tg->service_queue;

        tg->disptime = jiffies - 1; //force this tg to be dispatched 
        throtl_select_dispatch(sq); //Move the bios of child tgs upward
        throtl_schedule_next_dispatch(sq, true);
    }
    rcu_read_unlock();
    //Dispatch !!! 
    throtl_select_dispatch(&td->service_queue);
    throtl_schedule_next_dispatch(&td->service_queue, true);
    queue_work(kthrotld_workqueue, &td->dispatch_work);
}

After the the io limit upgrades, blk-throtl tries to dispatch the bios more
smoothly.

Let's look at tg_bps_limit and throtl_adjusted_limit.
---
if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_LOW] &&
        tg->bps[rw][LIMIT_LOW] != tg->bps[rw][LIMIT_MAX]) {
        uint64_t adjusted;

        adjusted = throtl_adjusted_limit(tg->bps[rw][LIMIT_LOW], td);
        ret = min(tg->bps[rw][LIMIT_MAX], adjusted);
    }

---
static uint64_t throtl_adjusted_limit(uint64_t low, struct throtl_data *td)
{
    /* arbitrary value to avoid too big scale */
    if (td->scale < 4096 && time_after_eq(jiffies,
        td->low_upgrade_time + td->scale * td->throtl_slice))
        td->scale = (jiffies - td->low_upgrade_time) / td->throtl_slice;

    return low + (low >> 1) * td->scale;
}
throtl_adjusted_limit will re-balance the bandwidth between tgs.

throtl_upgrade_state has updated td->scale and td->low_upgrade_time.

so the limit will not reach to io.max immediately after
throtl_upgrade_state. The actual limit is:

limit = low + (low >> 1) * (now - td->low_upgrade_time)/td->throtl_slice

 The tg that has higher low limit will get more bandwidth because it has
higher growing limit, so this should be the core idea of io.low




When the cgroup is free even idle, it indeed stay below the io.low limit,
    but it should not count. How to tell this ? 
Quote from comment of throtl_tg_is_idle:

cgroup is idle if:
- single idle is too long, longer than a fixed value (in case user
  configure a too big threshold) or 4 times of idletime threshold
- average think time is more than threshold
- IO latency is largely below threshold


Think time

The interval between the completion of previous IO and submitting of next IO.
blk_throtl_bio_endio will record the time of completion in tg->last_finish_time.
Then in blk_throtl_bio -> blk_throtl_update_idletime, the average think time
will be calculated.

static void blk_throtl_update_idletime(struct throtl_grp *tg)
{
    unsigned long now = ktime_get_ns() >> 10;
    unsigned long last_finish_time = tg->last_finish_time;

    if (now <= last_finish_time || last_finish_time == 0 ||
        last_finish_time == tg->checked_last_finish_time)
        return;

    tg->avg_idletime = (tg->avg_idletime * 7 + now - last_finish_time) >> 3;
    tg->checked_last_finish_time = last_finish_time;
}

Latency

The latency here is the interval between issuing request to device and completion of the request.
This is based on the processing capability of the storage device.
If a cgroup's IO latency is blow the IO latency threshold, it means this cgroup
is handled by device fairly.
My question is: if one cgroup is below the low limit, but its
IO latency is acceptable, we could say this cgroup is served by device fairly,
but not served fairly by block layer, right ?

commit comment of b9147dd (blk-throttle: add a mechanism to estimate IO latency)

User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.

To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.

But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.

Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD
only, we probably
can use a fixed threshold like 4ms for hard disk though.

td will have average latency for echo size separately, every tg has its own
latency_target, ITOW, a tolerance.
For SSD, when the td's average latency is low, we could say the device is
relatively relaxed.
This explains why it is '&&' throtl_tg_is_idle, which means the device fall into
idle.

The samples collection is hooked in blk_stat_add.
blk_stat_add //the latency here is the interval between blk_mq_start_request and __blk_mq_complete_request
  -> blk_throtl_stat_add
    -> throtl_track_latency
static void throtl_track_latency(struct throtl_data *td, sector_t size,
    int op, unsigned long time)
{
    struct latency_bucket *latency;
    int index;

    if (!td || td->limit_index != LIMIT_LOW ||
        !(op == REQ_OP_READ || op == REQ_OP_WRITE) ||
        !blk_queue_nonrot(td->queue))
        //We assume there is no congestion when LIMIT_LOW,
        //and the latency make sense only when there is no congestion in device
        return;

    index = request_bucket_index(size);

    latency = get_cpu_ptr(td->latency_buckets[op]);
    latency[index].total_latency += time;
    latency[index].samples++;
    put_cpu_ptr(td->latency_buckets[op]);
}







bsg


The Linux sg driver is a upper level SCSI subsystem device driver that is used primarily to handle devices _not_ covered by the other upper
level drivers: sd (disks), st (tapes) and sr (CDROMs and DVDs). The sg driver is used for enclosure management, cd writers,
applications that read cd audio digitally and scanners. Sg can also be used for less usual tasks performed on disks, tapes and cdroms.
Sg is a character device driver which, in some contexts, gives it advantages over block device drivers such as sd and sr. The interface of sg
is at the level of SCSI command requests and their associated responses.

From about Linux kernel 2.6.24, there is an alternate SCSI pass-through driver called "bsg" (block SCSI generic driver). The bsg driver has
device names of the form /dev/bsg/0:1:2:3 and supports the SG_IO ioctl with the sg version 3 interface. The bsg driver also supports the sg
version 4 interface which at this time the sg driver does not. Amongst other improvements the sg version 4 interface supports SCSI bidirectional commands.


How does it work ?

 setup
bsg_setup_queue
---

    // A new request_queue

    q = blk_alloc_queue(GFP_KERNEL);
    if (!q)
        return ERR_PTR(-ENOMEM);
    q->cmd_size = sizeof(struct bsg_job) + dd_job_size;
    q->init_rq_fn = bsg_init_rq;
    q->exit_rq_fn = bsg_exit_rq;
    q->initialize_rq_fn = bsg_initialize_rq;

    q->request_fn = bsg_request_fn;


    ret = blk_init_allocated_queue(q);
    if (ret)
        goto out_cleanup_queue;

    q->queuedata = dev;
    q->bsg_job_fn = job_fn;
    blk_queue_flag_set(QUEUE_FLAG_BIDI, q);
    blk_queue_softirq_done(q, bsg_softirq_done);
    blk_queue_rq_timeout(q, BLK_DEFAULT_SG_TIMEOUT);

    ret = bsg_register_queue(q, dev, name, &bsg_transport_ops, release);
---

 issue request
take write as example:
bsg_write
  -> __bsg_write
    -> bsg_map_hdr
      -> blk_get_request
      -> q->bsg_dev.ops->fill_hdr
      -> blk_rq_map_user //hdr->dout_xferp points to userland buffer
        -> blk_rq_map_user_iov // userland buffer will be mapped directly for zero copy I/O
   -> bsg_add_command
     -> blk_execute_rq_nowait

bsg_request_fn
  -> blk_fetch_request
    -> blk_peek_request
    -> blk_start_request
  -> bsg_prepare_job // kref_init(&job->kref)
  -> q->bsg_job_fn

 complete request
bsg_softirq_done
  -> bsg_job_put
    -> kref_put(&job->kref, bsg_teardown_job)
bsg_teardown_job
  -> blk_end_request_all
 
This is a very interesting method.
Unless the job->kref reaches zero, the bsg request will not be completed.
It will fix the race between blk-timeout and completion path.
Look at the following code:
fc_bsg_job_timeout
---
    inflight = bsg_job_get(job);

    if (inflight && i->f->bsg_timeout) {
        /* call LLDD to abort the i/o as it has timed out */
        err = i->f->bsg_timeout(job);
        if (err == -EAGAIN) {
            bsg_job_put(job);
            return BLK_EH_RESET_TIMER;
        } else if (err)
            printk(KERN_ERR "ERROR: FC BSG request timeout - LLD "
                "abort failed with status %d\n", err);
    }

    /* the blk_end_sync_io() doesn't check the error */
    if (!inflight)
        return BLK_EH_NOT_HANDLED;
    else
        return BLK_EH_HANDLED;
---


bidi request
bidi aka bidirectional commands. There will be output and input in this kind of
command concurrently.
Look at bsg_map_hdr
---
    if (hdr->dout_xfer_len && hdr->din_xfer_len) {
        if (!test_bit(QUEUE_FLAG_BIDI, &q->queue_flags)) {
            ret = -EOPNOTSUPP;
            goto out;
        }

        next_rq = blk_get_request(q, REQ_OP_SCSI_IN, GFP_KERNEL);
        if (IS_ERR(next_rq)) {
            ret = PTR_ERR(next_rq);
            goto out;
        }

        rq->next_rq = next_rq;

        ret = blk_rq_map_user(q, next_rq, NULL, uptr64(hdr->din_xferp),
                       hdr->din_xfer_len, GFP_KERNEL);
        if (ret)
            goto out_free_nextrq;
    }
---





direct_IO


What will be done when direct IO on a block device ?
__generic_file_write_iter
---
    if (iocb->ki_flags & IOCB_DIRECT) {
        loff_t pos, endbyte;

        written = generic_file_direct_write(iocb, from);
        if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
            goto out;

        // if direct_IO doesn't complete all of the IO, fallback to buffered IO.

        status = generic_perform_write(file, from, pos = iocb->ki_pos);
        ...

        /*
         * We need to ensure that the page cache pages are written to
         * disk and invalidated to preserve the expected O_DIRECT
         * semantics.
         */

        endbyte = pos + status - 1;
        err = filemap_write_and_wait_range(mapping, pos, endbyte);
        if (err == 0) {
            iocb->ki_pos = endbyte + 1;
            written += status;
            invalidate_mapping_pages(mapping,
                         pos >> PAGE_SHIFT,
                         endbyte >> PAGE_SHIFT);
        } else {
            /*
             * We don't know how much we wrote, so just return
             * the number of bytes which were direct-written
             */
        }
    }
---

generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
{
    ...
    if (iocb->ki_flags & IOCB_NOWAIT) {
        /* If there are pages to writeback, return */
        if (filemap_range_has_page(inode->i_mapping, pos,
                       pos + iov_iter_count(from)))
            return -EAGAIN;
    } else {

        written = filemap_write_and_wait_range(mapping, pos,
                            pos + write_len - 1);

        if (written)
            goto out;
    }

    /*
     * After a write we want buffered reads to be sure to go to disk to get
     * the new data.  We invalidate clean cached page from the region we're
     * about to write.  We do this *before* the write so that we can return
     * without clobbering -EIOCBQUEUED from ->direct_IO().
     */

    written = invalidate_inode_pages2_range(mapping,
                    pos >> PAGE_SHIFT, end);
    ...
    written = mapping->a_ops->direct_IO(iocb, from);
    ...
    if (written > 0) {
        pos += written;
        write_len -= written;

        //Interesting thing here, the file is expanded by the direct IO.
        // we have to modify the size of the inode.
        if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
            i_size_write(inode, pos);
            mark_inode_dirty(inode);
        }
        iocb->ki_pos = pos;
    }

    iov_iter_revert(from, write_len - iov_iter_count(from));
out:
    return written;
}

blkdev_direct_IO
  -> __blkdev_direct_IO_simple // Let's look at the simpler case.
---
    ...
    struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs, *bvec;
    ...
    if (nr_pages <= DIO_INLINE_BIO_VECS)
        vecs = inline_vecs;
    else {
        vecs = kmalloc_array(nr_pages, sizeof(struct bio_vec),
                     GFP_KERNEL);
        if (!vecs)
            return -ENOMEM;
    }

    bio_init(&bio, vecs, nr_pages);
    bio_set_dev(&bio, bdev);
    bio.bi_iter.bi_sector = pos >> 9;
    bio.bi_write_hint = iocb->ki_hint;
    bio.bi_private = current;
    bio.bi_end_io = blkdev_bio_end_io_simple;
    bio.bi_ioprio = iocb->ki_ioprio;

    // The most important thing here is to fill the bi_io_vec
                                /
                                | bv_page
    bio->bi_io_vec [ bio_vec ] <  bv_len
                   [ bio_vec ]  | bv_offset
                   [ bio_vec ]  \
                   ...
    bio_iov_iter_get_pages
      -> iov_iter_get_pages
        -> get_user_pages_fast
    It will get and pin the pages behind the userland buffers.
    Then these pages will be sent to block layer directly.
    So we could say this is zero-copy.
    Note: get_user_pages_fast will not ensure all of the requested pages will be got
          and pined.

    ret = bio_iov_iter_get_pages(&bio, iter);
    if (unlikely(ret))
        return ret;
    ret = bio.bi_iter.bi_size;

    if (iov_iter_rw(iter) == READ) {
        bio.bi_opf = REQ_OP_READ;
        if (iter_is_iovec(iter))
            should_dirty = true;
    } else {
        bio.bi_opf = dio_bio_write_op(iocb);
        task_io_account_write(ret);
    }

    qc = submit_bio(&bio);
    for (;;) {
        set_current_state(TASK_UNINTERRUPTIBLE);
        if (!READ_ONCE(bio.bi_private))
            break;
        if (!(iocb->ki_flags & IOCB_HIPRI) ||
            !blk_poll(bdev_get_queue(bdev), qc))
            io_schedule();
    }

    // we will sleep here to wait for the completion.
    // the blkdev_bio_end_io_simple will wake up us.

    __set_current_state(TASK_RUNNING);

    bio_for_each_segment_all(bvec, &bio, i) {
        if (should_dirty && !PageCompound(bvec->bv_page))
            set_page_dirty_lock(bvec->bv_page);
        put_page(bvec->bv_page);
    }

    if (vecs != inline_vecs)
        kfree(vecs);

    if (unlikely(bio.bi_status))
        ret = blk_status_to_errno(bio.bi_status);

    bio_uninit(&bio);
---

blk RPM

RPM

Traditional suspend/resume

System wide

All devices together

Initiated by userspace

Any device can prevent the system suspend

Runtime suspend/resume

Device-local suspend/resume
Single device at a time
Controlled by driver

* Once the subsystem-level suspend callback (or the driver suspend callback, 
  if invoked directly) has completed successfully for the given device, the PM 
  core regards the device as suspended, which need not mean that it has been 
  put into a low power state.  It is supposed to mean, however, that the 
  device will not process data and will not communicate with the CPU(s) and 
  RAM until the appropriate resume callback is executed for it.  The runtime 
  PM status of a device after successful execution of the suspend callback is 
  'suspended'.

Hooks in BLK

Hooks in blk-legacy

__elv_add_request
  -> blk_pm_add_request
---
    if    q->dev // support RPM
       && rq->rq_flags & RQF_PM //not PM command
       && q->nr_pending++ == 0
       && (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))

       pm_request_resume(q->dev) // start resume
---
elv_requeue_request
  -> blk_pm_requeue_request
    ---
    if (rq->q->dev && !(rq->rq_flags & RQF_PM))
        rq->q->nr_pending--;
    ---
  -> __elv_add_request()//ELEVATOR_INSERT_REQUEUE

__blk_put_request
  -> blk_pm_put_request
---
    if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
        pm_runtime_mark_last_busy(rq->q->dev);
---

blk_peek_request
  -> elv_next_request
    -> iterate q->queue_head
       if blk_pm_allow_request
         return it
    ---
    switch (rq->q->rpm_status) {
    case RPM_RESUMING:
    case RPM_SUSPENDING:
        return rq->rq_flags & RQF_PM;
    case RPM_SUSPENDED:
        return false;
    }

    return true;
    ---

Don't process normal requests when queue is suspended
or in the process of suspending/resuming

Work process

The normal process of the runtime PM running in block layer is:

        blk_pre_runtime_suspend
          if q->nr_pending is zero
             set q->rpm_status to RPM_SUSPENDING
               |
               v
        sdev_runtime_suspend
          -> pm->runtime_suspend
               |
               v
        blk_post_runtime_suspend
          -> set state to RPM_SUSPENDED

When new request is added:
        __elv_add_request
          -> blk_pm_add_request
          ---
    if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
        (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
        pm_request_resume(q->dev);
          ---
The resume process will be started here.

Before the resume is completed, the requests will not been issued to LLDD.

        blk_peek_request
          -> elv_next_request
          ---
        list_for_each_entry(rq, &q->queue_head, queuelist) {
            if (blk_pm_allow_request(rq))
                return rq;
          ---

During the process of pm runtime resuming:
        blk_pre_runtime_resume
          -> set rpm_status to RPM_RESUMING
        pm->runtime_resume
        blk_post_runtime_resume
        ---
        q->rpm_status = RPM_ACTIVE;
        __blk_run_queue(q);
        pm_runtime_mark_last_busy(q->dev);
        pm_request_autosuspend(q->dev);
        ---

rpm_suspend // if RPM_AUTO
  -> pm_runtime_autosuspend_expiration
    -> last_busy = READ_ONCE(dev->power.last_busy);

    it will check whether the device has been idle for some time,
    if yes, the suspend process will proceed, otherwise, set up the
    suspend_timer.
    
    the check here depends on the dev->power.last_busy
    it is updated around the blk-legacy layer.
	the most important one is in blk_pm_put_request.


pm_suspend_timer_fn
---
    if (expires > 0 && !time_after(expires, jiffies)) {
        dev->power.timer_expires = 0;
        rpm_suspend(dev, dev->power.timer_autosuspends ?
            (RPM_ASYNC | RPM_AUTO) : RPM_ASYNC);
    }
---

RPM Core

pm_runtime_put
  -> __pm_runtime_idle //RPM_GET_PUT | RPM_ASYNC
    ---
    if (rpmflags & RPM_GET_PUT) {
        if (!atomic_dec_and_test(&dev->power.usage_count))
            return 0;
    }

    might_sleep_if(!(rpmflags & RPM_ASYNC) && !dev->power.irq_safe);

    spin_lock_irqsave(&dev->power.lock, flags); 
    //This spinlock will serialize all the things
    retval = rpm_idle(dev, rpmflags);
    spin_unlock_irqrestore(&dev->power.lock, flags);
    ---

rpm_idle
---
    ...
    callback = RPM_GET_CALLBACK(dev, runtime_idle);

    if (callback)
        retval = __rpm_callback(callback, dev);

    // __rpm_callback will unlock the dev->power.lock before invokes the
    // driver's callback.

    ...
    return retval ? retval : rpm_suspend(dev, rpmflags | RPM_AUTO);
---
scsi_runtime_idle will always returns -EBUSY.
Let's look at RPM_SUSPENDED
---
 repeat:
    retval = rpm_check_suspend_allowed(dev);
      -> if dev->power.runtime_status == RPM_SUSPENDED, return 1
    ...
    if (retval)
        goto out;

    ...
    /* Other scheduled or pending requests need to be canceled. */
    pm_runtime_cancel_pending(dev);

    if (dev->power.runtime_status == RPM_SUSPENDING) {
        DEFINE_WAIT(wait);
        ...

        /* Wait for the other suspend running in parallel with us. */

        for (;;) {
            prepare_to_wait(&dev->power.wait_queue, &wait,
                    TASK_UNINTERRUPTIBLE);
            if (dev->power.runtime_status != RPM_SUSPENDING)
                break;

            spin_unlock_irq(&dev->power.lock);

            schedule();

            spin_lock_irq(&dev->power.lock);
        }
        finish_wait(&dev->power.wait_queue, &wait);
        goto repeat;
    }

    __update_runtime_status(dev, RPM_SUSPENDING);

    callback = RPM_GET_CALLBACK(dev, runtime_suspend);

    dev_pm_enable_wake_irq_check(dev, true);
    retval = rpm_callback(callback, dev);
    if (retval)
        goto fail;

 no_callback:

    __update_runtime_status(dev, RPM_SUSPENDED);

    pm_runtime_deactivate_timer(dev);

    if (dev->parent) {
        parent = dev->parent;
        atomic_add_unless(&parent->power.child_count, -1, 0);
    }
    wake_up_all(&dev->power.wait_queue);


---

blk and hardware

dma alignment

Some storage controllers have DMA alignment requirement, which is often set through blk_queue_dma_alignment, such as 512 bytes.

One of the usages of dma_alignment of request_queue.

blk_rq_map_kern
---
    do_copy = !blk_rq_aligned(q, addr, len) || object_is_on_stack(kbuf);

    //unsigned int alignment = queue_dma_alignment(q) | q->dma_pad_mask;
    //return !(addr & alignment) && !(len & alignment);

    if (do_copy)
        bio = bio_copy_kern(q, kbuf, len, gfp_mask, reading);

    //New page will be allocated and copy data in it.
    //When bio is done, the data will be copied back to the original buffer.
    //Refer to bio_copy_kern_endio_read

    else
        bio = bio_map_kern(q, kbuf, len, gfp_mask);

    //Add the page associated with the buffer into bio.

---

The caller of this blk_rq_map_kern:
 - __scsi_execute
 - __nvme_submit_sync_cmd

Another similar interface is blk_rq_map_user_iov.

block size

The blocksize of filesystem and block device.
Block: The smallest unit writable by a disk or ﬁle system. Everything a ﬁle system does is
composed of operations done on blocks. A ﬁle system block is always the same size as or larger
(in integer multiples) than the disk block size.

The bdev_logical_block_size is the q->limits.logical_block_size. Look at how does the nvme set it.

__nvme_revalidate_disk
---
    ns->lba_shift = id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ds;
    ...
    nvme_update_disk_info
    ---
        unsigned short bs = 1 << ns->lba_shift;

        blk_mq_freeze_queue(disk->queue);
        blk_integrity_unregister(disk);

        blk_queue_logical_block_size(disk->queue, bs);
        blk_queue_physical_block_size(disk->queue, bs);
        blk_queue_io_min(disk->queue, bs);
    ---
---

The most important point here is that the blocksize is set during mkfs.

dma alignment

What is the gap ?
It is indicated by queue_virt_boundary.

The NVME PRP descriptor which is PAGE_SIZE aligned
                                         
  page A+-----+                page A+-----+            
        |     | \ PAGE_SIZE          |     | \ PAGE_SIZE
        |     | /                    |     | /              
  page B+-----+                page B+-----+                
        |     | \ PAGE_SIZE          |_ _ _| > PAGE_SIZE/2
        |     | /                    | GAP |
  page C+-----+                page C+-----+                
        |     | \ PAGE_SIZE          |     | \ PAGE_SIZE
        |     | /                    |     | /              
        +-----+                      +-----+ 

So if we want to handle the unaligned PAGE_SIZE IO, need to
split the IO into 3 parts as following,

page A+-----+              page B+-----+                page C+-----+                
      |     | \ PAGE_SIZE        |_ _ _| > PAGE_SIZE/2        |     | \ PAGE_SIZE
      |     | /                                               |     | /              
      +-----+                                                 +-----+ 

This is done by blk_queue_split.

blk_queue_split
  -> blk_bio_segment_split
  ---
    bio_for_each_segment(bv, bio, iter) {
        /*
         * If the queue doesn't support SG gaps and adding this
         * offset would create a gap, disallow it.
         */
        if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
            goto split;
        ....
    }
split:
    *segs = nsegs;

    if (do_split) {
        new = bio_split(bio, sectors, GFP_NOIO, bs);
        if (new)
            bio = new;
    }
  ---

And let's check other places that need to check this.

// the buffer may come from userspace and not aligned
blk_rq_map_user_iov
// don't merge bios or requests if will gap
bio_will_gap <- req_gap_back_merge <- ll_back_merge_fn
                                   <- ll_merge_requests_fn

bvec_gap_to_prev <- bio_integrity_add_page
　　　　　　　　 <- bio_add_pc_page
                 <- integrity_req_gap_back_merge

Before queue_virt_boundary is introduced, we use QUEUE_FLAG_SG_GAPS

QUEUE_FLAG_SG_GAPS
And we check this flag in following positions.

__bio_add_page
ll_merge_requests_fn
blk_rq_merge_ok

DISCARD

What is discard

write amplification

 
    |----|                         Write granularity (e.g 32K)
    |----------------------------| Erase granularity (e.g 128K)
 
    There are contiguous user data blocks.
    If we want to write a 32K block in it, we have to
     - read in 128K data, and update data in it
     - erase
     - write this 128K

Wear leveling

    A write can only occur to those pages that are erased, thereforehost write commands
    invoke flash erase cycles prior to writing to the flash. This write/erase cycling causes
    cell wear which imposes the limited write-life. Host write accesses can occur to any location
    which can cause hot-spots, which causes premature wear in these locations.
    wear-leveling is used to prevent the hot-spots

    Mapping

    In most cases, the controller maintains a lookup table to translate the memory array physical
    block address (PBA) to the logical block address (LBA) used by the host system. The controller's
    wear-leveling algorithm determines which physical block to use each time data is programmed,
    eliminating the relevance of the physical location of data and enabling data to be stored
    anywhere within the memory array.

    Selecting

    The controller typically either writes to the available erased block with the lowest erase count
    (dynamic wear leveling); or it selects an available target block with the lowest overall erase
    count, erases the block if necessary

    Garbage collection

    Given that previously written-to blocks must be erased before they are able to receive data again,
    the SSD controller must, for performance, actively pre-erase blocks so new write commands can always
    get an empty block.

What is the discard command for ?

If the user or operating system erases a file (not just remove parts of it), the file
will typically be marked for deletion, but the actual contents on the disk are never
actually erased. Because of this, the SSD does not know that it can erase the LBAs
previously occupied by the file, so the SSD will keep including such LBAs in the
garbage collection.

Enables the operating system to tell an SSD which blocks of previously saved data are
no longer needed as a result of file deletions or volume formatting. When an LBA is
replaced by the OS, as with an overwrite of a file, the SSD knows that the original
LBA can be marked as stale or invalid and it will not save those blocks during Garbage
collection.


A simple example of SSD write,
(Assume application only write in erase blocks)

|----| erase block
  -    free
  o    used
  i    invalid 

   |ooooo|-----|-----|-----|-----|
   \__ __/                 \__ __/
      v                       v
	File1                  Reserved


When we write to File1,

	    RMW
      .-----.
     /      v
   |iiiii|ooooo|-----|-----|-----|
         \__ __/           \__ __/
            v                 v
          File1            Reserved

The original position of File1 will be reclaimed then.

If we delete File1 in filesystem layer,

   |-----|ooooo|-----|-----|-----|
                           \__ __/
                              v
                          Reserved
   
SSD controller doesn't know this the File1 has been deleted,
so it still think there is valid data in the block. And if
such things happen multiple times, we would get,

   |ooooo|ooooo|ooooo|ooooo|-----|
   \__ __/     \__ __/     \__ __/
      v           v           v
    File2       File3      Reserved

And only two of them has a valid file on it. (filesytem knows
which block is free).

When we write data in File2 and File3 in parallel,
SSD controller has to use the Reserved block. However, there
is only one in our case, when one write is ongoing, another one
has to wait.

This is why the SSDs would become slower when it fills up.

If we DISCARD support in filesystem, when a file is deleted, filesystem
will tell the SSD controller that the associated blocks are invalid
and could be reclaimed. Then we would have,

   |ooooo|-----|ooooo|-----|-----|
   \__ __/     \__ __/     \__ __/
      v           v           v
    File2       File3      Reserved

Another useful link about this Block layer discard requests

Linux calls this as DISCARD
Different storage protocol has a different name, e.g TRIM ( ATA ) UNMAP ( SBC ) Deallocate ( NVME )

discard in blk

Merge

The restriction of merging of DISCARD is more relaxed, because the command in underlying storage
protocol supports Ranges.
For example:
 - TRIM in ATA
   Supports up to 64 ranges
   16 bits worth of blocks per range
 - UNMAP in SBC
   Supports an implementation specific number of ranges
   32 bits worth of blocks per range
 - Deallocate in NVMe
   Supports up to 256 ranges
   32 bits worth of blocks per range

Look at how does nvme setup the discard command.
nvme_setup_discard
---
    unsigned short segments = blk_rq_nr_discard_segments(req), n = 0;
    ...

    range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
    if (!range)
        return BLK_STS_RESOURCE;

    __rq_for_each_bio(bio, req) {
        u64 slba = nvme_block_nr(ns, bio->bi_iter.bi_sector);
        u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;

        if (n < segments) {
            range[n].cattr = cpu_to_le32(0);
            range[n].nlb = cpu_to_le32(nlb);
            range[n].slba = cpu_to_le64(slba);
        }
        n++;
    }

---

So discontiguous bios or requests could be merged together.
blk_mq_bio_list_merge
---
        if (!blk_rq_merge_ok(rq, bio))
            continue;

        switch (blk_try_merge(rq, bio)) {
            ...
        case ELEVATOR_DISCARD_MERGE:
            merged = bio_attempt_discard_merge(q, rq, bio);
            break;
        default:
            continue;
        }
---

In blk_try_merge
---

    if (req_op(rq) == REQ_OP_DISCARD &&
        queue_max_discard_segments(rq->q) > 1)
        return ELEVATOR_DISCARD_MERGE;

    else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
        return ELEVATOR_BACK_MERGE;
    else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
        return ELEVATOR_FRONT_MERGE;
    return ELEVATOR_NO_MERGE;
---
We could see only the req op is concerned. They needn't to be contiguous.

throttle

DISCARD has been throttled by WBT,
Look at the comment of patch from Jens to limit DISCARD in WBT.
"
Throttle discards like we would any background write. Discards should
be background activity, so if they are impacting foreground IO, then
we will throttle them down.
"
In kyber, DISCARD is counted as KYBER_OTHER which has very low priority.

discard in fs

There is a danger about discard in fs

the filesystem may well discard a set of sectors, then write new data to them once they are allocated to
a new file. It would be a serious mistake to reorder the new writes ahead of the discard operation,
causing the newly-written data to be lost.

Let's look at how to handle this in individual fs.

XFS

xlog_cil_committed

  //transaction log of free block operation has been on the disk
  //these blocks has been able to be allocated by others.
  //so it is safe to discard these blocks
)
---
    xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
                    ctx->start_lsn, abort);

    xfs_extent_busy_sort(&ctx->busy_extents);
    xfs_extent_busy_clear(mp, &ctx->busy_extents,
                 (mp->m_flags & XFS_MOUNT_DISCARD) && !abort);

    ---
    list_for_each_entry_safe(busyp, n, list, list) {
        ...
        if (do_discard && busyp->length &&
            !(busyp->flags & XFS_EXTENT_BUSY_SKIP_DISCARD)) {
            busyp->flags = XFS_EXTENT_BUSY_DISCARDED;
        } else {
            xfs_extent_busy_clear_one(mp, pag, busyp);
            wakeup = true;
        }
    ---

    ...
    if (!list_empty(&ctx->busy_extents))
        xlog_discard_busy_extents(mp, ctx);
---

xlog_discard_busy_extents
---
    blk_start_plug(&plug);
    list_for_each_entry(busyp, list, list) {
        error = __blkdev_issue_discard(mp->m_ddev_targp->bt_bdev,
                XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno),
                XFS_FSB_TO_BB(mp, busyp->length),
                GFP_NOFS, 0, &bio);
        ...
    }
    if (bio) {
        bio->bi_private = ctx;
        bio->bi_end_io = xlog_discard_endio;
        submit_bio(bio);
    } else {
        xlog_discard_endio_work(&ctx->discard_endio_work);
    }
    blk_finish_plug(&plug);
---

EXT4


This function is called by the jbd2 layer once the commit has finished,
so we know we can free the blocks that were released with that commit.

ext4_process_freed_data
---
    if (test_opt(sb, DISCARD)) {
        list_for_each_entry(entry, &freed_data_list, efd_list) {
            err = ext4_issue_discard(sb, entry->efd_group,
                         entry->efd_start_cluster,
                         entry->efd_count,
                         &discard_bio);
            ...

        if (discard_bio) {
            submit_bio_wait(discard_bio);
            bio_put(discard_bio);
        }
    }
---

The trouble with discard

https://lwn.net/Articles/347511/

At the ATA protocol level, a discard request is implemented by a "TRIM" command sent to the device.
For reasons unknown to your editor, the protocol committee designed TRIM as a non-queued command.
That means that, before sending a TRIM command to the device, the block layer must first wait for
all outstanding I/O operations on that device to complete; no further operations can be started
until the TRIM command completes. So every TRIM operation stalls the request queue. Even if TRIM 
were completely free, its non-queued nature would impose a significant I/O performance cost. (It's
worth noting that the SCSI equivalent to TRIM is a tagged command which doesn't suffer from this
problem).

With current SSDs, TRIM appears to be anything but free. Mark Lord has measured regular delays of
hundreds of milliseconds. Delays on that scale would be most unwelcome on a rotating storage device.
On an SSD, hundred-millisecond latencies are simply intolerable.

In one word, discard is not free.

Someone complained that

XFS has had async discard support, but it has been problematic for our
fleet. We were seeing bursts of large discard requests caused by async
discard in XFS. This resulted in degraded drive performance increasing
latency for dependent services.

And proposed alternative that filesystem layer could reuse the blocks which has just been freed.

   |ooooo|-----|-----|-----|-----|
   \__ __/                 \__ __/
      v                       v
	File1                  Reserved

Deleted File1 and then create File2,

   |ooooo|-----|-----|-----|-----|
   \__ __/                 \__ __/
      v                       v
	File2                  Reserved