Blk Multi-queue understanding

Will be migrated to blk_basis
Background of introducing blk-mq
Design of blk-mq

Source code of blk-mq



     +----+      +----+       +----+       +----+
     |cpu0|      |cpu1|       |cpu2|       |cpu3|
     +----+      +----+       +----+       +----+

                          ||
                          \/
            +-+-+-+          +-+-+-+         +-+-+-+
    default | | | |   read   | | | |   poll  | | | |
            +-+-+-+          +-+-+-+         +-+-+-+
              ||               ||              ||
              \/               \/              \/

     +-+  +-+  +-+  +-+  +-+  +-+  +-+  +-+  +-+  +-+
     |D|  |D|  |R|  |R|  |R|  |R|  |P|  |P|  |P|  |P|
     +-+  +-+  +-+  +-+  +-+  +-+  +-+  +-+  +-+  +-+

                    .----->-----.
                   /             \
                   \ Round-Robin /
                    '-----<-----'

Background of introducing blk-mq


The scalability under multiple cores (even with NUMA) and high IOPS NAND-flash SSD is the bottleneck.

Request queue lock which synchronizes the exclusive resource of one request_queue is really a devil. It will cause high contention and continuously cache-line bounce on muliple-cores system, even worse on NUMA.
Both of them will harm the performance of block layer on high IOPS SSDs.

(It seems to be acceptable on slow storage device such as HHD)
Nearly all the operations in block layer need this request queue lock.

The perf top result on blk-null.

sudo modprobe null_blk queue_mode=2 completion_nsec=0 irqmode=0

The request queue lock is also big chanllenge for the cache-line. Not only the requests submitters need acquire the lock, but also the completion path from different cpu. When multiple cores are contending the lock, the cache-line invalidatation will be very frequently and lead to significant cost. The situation will be even worse on NUMA.
Regarding the completion path, the __blk_complete_reqeust() will identify the submitting cpu of the request and use IPI to raise BLK_SOFTIRQ on it. This is comforting for the cache-line, however it introduces extra overhead caused by IPI and context switch.

Data on an HDD is stored in concentric tracks on platters (the recording media). An actuator arm with a read/write head moves on top of the platter to perform the actual read or write operation, moving from track to track. The moving of the actuator arm is costly. A lot of heuristics and optimizations in block layer serve for reducing actuator arm movements. They are good for HDD, but maybe not for SSD.

block legacy layer is not for _high_speed_ SSDs

Design of blk-mq


Oct 24 2013  blk-mq: new multi-queue block IO queueing mechanism           3.12.0-rc5      320ae51
Jan 17 2014  scsi: add support for a blk-mq based I/O path                 3.16.0-rc5      d285203
Jan 17 2017  blk-mq-sched: add framework for MQ capable IO schedulers      4.10.0-rc3      bd166ef
Jun 16 2017  scsi: default to scsi-mq                                      4.12.0-rc5      5c279bd
Aug 13 2017  Revert "scsi: default to scsi-mq"                             4.13            cbe7dfa
The performance regression which cause the revert and its fix:
The performance regression
"In Red Hat internal storage test wrt. blk-mq scheduler, we found that I/O
performance is much bad with mq-deadline, especially about sequential I/O
on some multi-queue SCSI devcies(lpfc, qla2xxx, SRP...)
Turns out one big issue causes the performance regression: requests are
still dequeued from sw queue/scheduler queue even when ldd's queue is
busy, so I/O merge becomes quite difficult to make, then sequential IO
performance degrades a lot.
This issue becomes one of mains reasons for reverting default SCSI_MQ
in V4.13."
https://lkml.org/lkml/2017/10/14/65


The tags

Device command tagging was first introduced with hardware supporting native command queuing. A tag is an integer value that uniquely identifies the position of the block IO in the driver submission queue, so when completed the tag is passed back from the device indicating which IO has been completed. This eliminates the need to perform a linear search of the in-flight window to determine which IO has completed.
On the other hand, these tags could indicate the capacity of the hardware submit queue.
The nvme exploit the tags in blk-mq fully.

nvme_queue_rq()
    -> nvme_setup_cmd()
        ->  cmd->common.command_id = req->tag;
nvme_irq()
    -> nvme_proccess_cq()
        -> nvme_handle_cqe()
            -> blk_mq_tag_to_rq()
            -> nvme_end_request() // cqe->command_id
                -> blk_mq_complete_request()
There is two kinds of tags in blk-mq, sched-tags and driver-tags.
The driver tags is the tag we said above. The sched tags is associated with the io scheduler.
In the comment of the commit which add the MQ capable IO scheduler framework (bd166ef), Jens Axboe said:
We split driver and scheduler tags, so we can run the scheduling independently of device queue depth.
In the blk-mq code:
blk_mq_init_sched()
>>>>
    /*
     * Default to double of smaller one between hw queue_depth and 128,
     * since we don't split into sync/async like the old code did.
     * Additionally, this is a per-hw queue depth.
     */
    q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
                   BLKDEV_MAX_RQ); //BLKDEV_MAX_RQ 128

    queue_for_each_hw_ctx(q, hctx, i) {
        ret = blk_mq_sched_alloc_tags(q, hctx, i);
        if (ret)
            goto err;
    }
>>>>
The max nr_tags of sched is 256. For the qla2xxx of which queue depth is 4096, is it too small ?.
Actually no. The reqs in io sched's queue will always be dispatched to hctx->dispatch queue and hardware dispatch queue. If there are too many reqs residual in io sched queue, there must be some issues in hardware.
These two tags both have its own static request arrary.
blk_mq_alloc_tag_set()
    -> blk_mq_alloc_rq_maps() // the actual depth may be lower than the set->queue_depth
        -> __blk_mq_alloc_rq_maps()
            -> __blk_mq_alloc_rq_map() //per-hctx-idx set->queue_depth
OR
        blk_mq_init_sched()
            -> blk_mq_sched_alloc_tags() // per-hctx q->nr_requests
                
                -> blk_mq_alloc_rq_map() // alloc tags->rqs and tags->static_rqs arrary
                -> blk_mq_alloc_rqs() // alloc pages to carry the (request + driver payload)
                    -> blk_mq_ops->init_request()

blk_mq_get_request()
    -> blk_mq_rq_ctx_init()
        -> blk_mq_tags_from_data() // hctx->tags or sched_tags
        -> rq = tags->static_rqs[tag] // when w/o io scheduler
when w/ ioscheduler, the reverse mapping is done in blk_mq_get_driver_tag()
The request instance could come from driver tags or sched tags. But no matter w/ or w/o io scheduler, the request must be saved in hctx->tags->rq[] as a reverse mapping from tag to request.
request_queue->queue_hw_ctx[] -> sched_tags -> static_rqs[] 
                   |                              /       
                   V                   ___ ___ __/____ ___
                  tags                |   |   |   |   |   |  w/ scheduler
                   |                  |___|___|___|___|___| <------------+
                   |                            /\                        \
                   |                           /  \                        \
                  /                  _________/    \___________             \
                 /                  | request + driver payload |             \
------------------------------------------------                              \  
               /                                                               |   
              |   maybe shared (Multiple LUs SCSI HBA, or Multiple ns NVMe)    |
              V                                                                |  
 tag_set->tags[] ---> rqs  [ ][ ][ ][ ][ ][ ] a reverse mapping from tag to request 
                 |                                                          /   
                 +--> static_rqs[]                                         /          
                                /                                         /
                     ___ ___ __/____ ___                                 /
                    |   |   |   |   |   |  w/o  scheduler               /
                    |___|___|___|___|___|  <---------------------------
                              /\
                             /  \
                   _________/    \___________
                  | request + driver payload |

IO scheduler in blk-mq

Quote from Jen's paper Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems

 
While global sequential re-ordering is still possible across the multiple software
queues, it is only necessary for HDD based devices, where the additional latency
and locking overhead required to achieve total ordering does not hurt IOPS
performance."
"for many users, it is no longer necessary to employ advanced fairness scheduling
as the speed of the devices are often exceeding the ability of even multiple
applications to saturate their performance. If fairness is essential, it is possible to
design a scheduler that exploits the characteristics of SSDs at coarser granularity to
achieve lower performance overhead"
The 3 io schedulers in current blk-mq seems to accord with what jens said.
The available io scheduler for blk-mq Let's look at the kyber. kyber, the lightweight io scheduler for fast SSDs
Quote from https://lwn.net/Articles/720675/
BFQ is a complex scheduler that is designed to provide good interactive
response, especially on those slower devices. It has a relatively high 
per-operation overhead, which is justified when the I/O operations 
themselves are slow and expensive. This complexity may not make sense,
though, in situations where I/O operations are cheap and throughput is 
a primary concern. When running a server workload using solid-state 
devices, it may be better to run a much simpler scheduler that allows 
for request merging and perhaps some simple policies, but which mostly 
stays out of the way.

Except for the lightweight, the intention of kyber is to keep low level
latency of prioritized requests.
There are 3 domain there: READ, SYNC_WRITE, and OTHER (async writes, discard)
Note: some request will bypass the io scheduler, such as flush.
How does it work ?
kyber_dispatch_request()
	-> kyber_dispatch_cur_domain()

>>>>
	rqs = &khd->rqs[khd->cur_domain];
	rq = list_first_entry_or_null(rqs, struct request, queuelist);

	/*
	 * If there wasn't already a pending request and we haven't flushed the
	 * software queues yet, flush the software queues and check again.
	 */
	if (!rq && !*flushed) {
		kyber_flush_busy_ctxs(khd, hctx);
		// flush the requests from the ctx software queue to kyber queue.
		*flushed = true;
		rq = list_first_entry_or_null(rqs, struct request, queuelist);
	}

	if (rq) {
		nr = kyber_get_domain_token(kqd, khd, hctx);
			-> __sbitmap_queue_get()
		if (nr >= 0) {
			khd->batching++;
			rq_set_domain_token(rq, nr);
			list_del_init(&rq->queuelist);
			return rq;
		}
	}
>>>>
Every domain has its own queue depth, look at the default ones:
static const unsigned int kyber_depth[] = {
	[KYBER_READ] = 256,
	[KYBER_SYNC_WRITE] = 128,
	[KYBER_OTHER] = 64,
};
It is similar with the blk-mq tags, but it is called token here.

The queue size of every domain will be resized based on the req latency.

kyber exploit the blk_stat mechanism to collect latency statistics and Adjust
the domain queue size in the blk_stat timer fn.

static int kyber_lat_status(struct blk_stat_callback *cb,
			    unsigned int sched_domain, u64 target)
{
	u64 latency;

	if (!cb->stat[sched_domain].nr_samples)
		return NONE;

	latency = cb->stat[sched_domain].mean;
	if (latency >= 2 * target)
		return AWFUL;
	else if (latency > target)
		return BAD;
	else if (latency <= target / 2)
		return GREAT;
	else /* (latency <= target) */
		return GOOD;
}

static void kyber_stat_timer_fn(struct blk_stat_callback *cb)
{
	struct kyber_queue_data *kqd = cb->data;
	int read_status, write_status;

	read_status = kyber_lat_status(cb, KYBER_READ, kqd->read_lat_nsec);
	write_status = kyber_lat_status(cb, KYBER_SYNC_WRITE, kqd->write_lat_nsec);

	kyber_adjust_rw_depth(kqd, KYBER_READ, read_status, write_status);
	kyber_adjust_rw_depth(kqd, KYBER_SYNC_WRITE, write_status, read_status);
	kyber_adjust_other_depth(kqd, read_status, write_status,
				 cb->stat[KYBER_OTHER].nr_samples != 0);

	/*
	 * Continue monitoring latencies if we aren't hitting the targets or
	 * we're still throttling other requests.
	 */
	if (!blk_stat_is_active(kqd->cb) &&
	    ((IS_BAD(read_status) || IS_BAD(write_status) ||
	      kqd->domain_tokens[KYBER_OTHER].sb.depth < kyber_depth[KYBER_OTHER])))
		blk_stat_activate_msecs(kqd->cb, 100);
}


There is no insert_requests callbacks in kyber, so the merge/sort will be done
by the ctx software queue.

Source code of blk-mq


Quiesce and Freeze

Quiescing and freezing a blk-mq queue is different. Let's check the source code.

blk_mq_quiesce_queue()
>>>>
    blk_mq_quiesce_queue_nowait(q);
    >>>>
    spin_lock_irqsave(q->queue_lock, flags);
    queue_flag_set(QUEUE_FLAG_QUIESCED, q);
    spin_unlock_irqrestore(q->queue_lock, flags);
    >>>>

    queue_for_each_hw_ctx(q, hctx, i) {
        if (hctx->flags & BLK_MQ_F_BLOCKING)
            synchronize_srcu(hctx->queue_rq_srcu);
        else
            rcu = true;
    }
    if (rcu)
        synchronize_rcu();
>>>>

__blk_mq_run_hw_queue()
>>>>
    if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
        rcu_read_lock();
        run_queue = blk_mq_sched_dispatch_requests(hctx);
        rcu_read_unlock();
    } else {
        might_sleep();

            srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
            run_queue = blk_mq_sched_dispatch_requests(hctx);
            srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
        }
>>>>
Look at the rcu/srcu lock in __blk_mq_run_hw_queue().
static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
{
	int srcu_idx;

	/*
	 * We should be running this queue from one of the CPUs that
	 * are mapped to it.
	 */
	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
		cpu_online(hctx->next_cpu));

	/*
	 * We can't run the queue inline with ints disabled. Ensure that
	 * we catch bad users of this early.
	 */
	WARN_ON_ONCE(in_interrupt());

	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
		rcu_read_lock();
		blk_mq_sched_dispatch_requests(hctx);
		rcu_read_unlock();
	} else {
		might_sleep();

		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
		blk_mq_sched_dispatch_requests(hctx);
		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
	}
}

So when the blk_mq_quiesce_queue() returns, we could ensure that all the __blk_mq_run_hw_queue
has returned and the following blk_mq_sched_dispatch_requests() will see the QUEUE_FLAG_QUIESCED.
Consequently, there will be no requests issued after blk_mq_quiesce_queue().
When the queue is quiesced, the requests will not be handled, but could be allocated and queued.
blk_mq_freeze_queue() will increase the q->mq_freeze_depth and wait until the
q->q_usage_counter.

This two counter play very important role in blk-mq path.
q_usage_counter covers nearly all the path of blk-mq, more importantly, every
request will hold one reference of q_usage_counter until it is freed.

blk_mq_make_request()
    -> blk_mq_get_request()
        -> blk_queue_enter_live()
	// add 1 here

blk_mq_free_request()
    -> blk_queue_exit()
Other other hand, the mq_freeze_depth will gate blk-mq path.

generic_make_request()
    -> blk_queue_enter()
    >>>>
    ret = wait_event_interruptible(q->mq_freeze_wq,
                !atomic_read(&q->mq_freeze_depth) ||
                    blk_queue_dying(q));
    >>>>
Therfore, we know that when the blk_mq_free_queue() returns, all the requests
has been drained and no new request will enter the io path.
An issue about the queue quiesced flag.
https://lkml.org/lkml/2017/10/3/548

Mapping between ctx and hctx

Look at the source code

blk_mq_alloc_tag_set()
    -> blk_mq_update_queue_map()
        -> blk_mq_map_queues()
>>>>
    for_each_possible_cpu(cpu) {   use possible_cpu here
    /*
     * First do sequential mapping between CPUs and queues.
     * In case we still have CPUs to map, and we have some number of
     * threads per cores then map sibling threads to the same queue for
     * performace optimizations.
     */
    if (cpu < nr_queues) {
        map[cpu] = cpu_to_queue_index(nr_queues, cpu);
    } else {
        first_sibling = get_first_sibling(cpu);
        if (first_sibling == cpu)
            map[cpu] = cpu_to_queue_index(nr_queues, cpu);
        else
            map[cpu] = map[first_sibling];
    }
>>>>
Do the mapping from cpu to hctx queue.
blk_mq_init_allocated_queue() -> q->mq_map = set->mq_map -> blk_mq_map_swqueue() >>>> for_each_present_cpu(i) { hctx_idx = q->mq_map[i]; // the hctx to which this cpu are mapped >>>> ctx = per_cpu_ptr(q->queue_ctx, i); hctx = blk_mq_map_queue(q, i); cpumask_set_cpu(i, hctx->cpumask); // mark this cpu on the hctx ctx->index_hw = hctx->nr_ctx; hctx->ctxs[hctx->nr_ctx++] = ctx; } >>>> Gather the cpus that mapped to one hctx queue to the hctx queue
Including:
this index_hw is very important, the hw queue will know there is pending request on the ctx queue through it.
__blk_mq_insert_request()
    -> __blk_mq_insert_request()
    -> blk_mq_hctx_mark_pending()
        -> sbitmap_set_bit(&hctx->ctx_map, ctx->index_hw);

blk_mq_flush_busy_ctxs()
    -> sbitmap_for_each_set(&hctx->ctx_map, flush_busy_ctx, &data);
Currently, blk-mq only do the mapping with present cpu. This is due to a commit:
4b855ad (blk-mq: Create hctx for each present CPU)

It used to do the mapping with online cpu. Every time with cpu hotplug, blk-mq has to adjust the mapping. During this process, it will try to freeze the queue and wait all the reqs to be completed(every req holds a q->q_usage_counter). However, this could introduce some deadlock scenario, for example, the following commit:
302ad8c (nvme: Complete all stuck requests)
When we suspend the nvme device, the reqs that has been started will be cancelled and requeued, but the q->q_usage_counter will still be held. Because the queue has been quiesced, these rqs cannot be handled and q->q_usage_counter cannot be released, the cpu hotplug in blk-mq process cannot make progress forward and stucks there.
Therefore, the 4b855ad (blk-mq: Create hctx for each present CPU) was introduced.
But this introduced some other regression, look at the following mail thread. https://lkml.org/lkml/2017/11/21/583
As the Christian said
I am not doing a hot unplug and the replug, I use KVM and add a previously
not available CPU.
The cpu_present_mask could be modified at this moment, the WARN_ON in __blk_mq_run_hw_queue will be triggered.
    WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
        cpu_online(hctx->next_cpu));
On the other hand, the new present cpu ctx->index_hw is zero, when there is pending reqs on its software queue, the hctx will goto ctx of cpu0 to get rqs, the reqs on the new present cpu cannot be handled forever.

Will be a hctx executed on other cpu where it is not mapped ?
For example, a nvme card on a 8-core PC will create 8 sq-cq pairs as well as 8 blk_mq_hw_ctxs. Let's look at the mapping between hctx and cpu first.
blk_mq_alloc_tag_set()
    -> blk_mq_update_queue_map()
        -> nvme_pci_map_queues()
            -> blk_mq_pci_map_queues()
>>>>
    for (queue = 0; queue < set->nr_hw_queues; queue++) {
        mask = pci_irq_get_affinity(pdev, queue);
        if (!mask)
            goto fallback;

        for_each_cpu(cpu, mask)
            set->mq_map[cpu] = queue;
    }

    return 0;
>>>>
Then in 
blk_mq_init_allocated_queue()
    -> blk_mq_map_swqueue()
>>>>
    for_each_present_cpu(i) {
        hctx_idx = q->mq_map[i];
        >>>>
        ctx = per_cpu_ptr(q->queue_ctx, i);
        hctx = blk_mq_map_queue(q, i);

        cpumask_set_cpu(i, hctx->cpumask);
        ctx->index_hw = hctx->nr_ctx;
        hctx->ctxs[hctx->nr_ctx++] = ctx;
    }
>>>>

nvme_alloc_ns()
    -> device_add_disk()
        -> blk_register_queue()
            -> __blk_mq_register_dev()
                -> echo hctx blk_mq_register_hctx()
>>>>
    ret = kobject_add(&hctx->kobj, &q->mq_kobj, "%u", hctx->queue_num);
    if (ret)
        return ret;

    hctx_for_each_ctx(hctx, ctx, i) {
        ret = kobject_add(&ctx->kobj, &hctx->kobj, "cpu%u", ctx->cpu);
        if (ret)
            break;
    }

>>>>
We could get mapping between hctx and cpu througth sysfs.
will@will-ThinkCentre-M910s:/sys/block/nvme0n1/mq$ tree
.
├── 0
│   ├── cpu0
│   ├── cpu_list
│   ├── nr_reserved_tags
│   └── nr_tags
├── 1
│   ├── cpu1
│   ├── cpu_list
│   ├── nr_reserved_tags
│   └── nr_tags
├── 2
│   ├── cpu2
│   ├── cpu_list
│   ├── nr_reserved_tags
│   └── nr_tags
├── 3
│   ├── cpu3
│   ├── cpu_list
│   ├── nr_reserved_tags
│   └── nr_tags
├── 4
│   ├── cpu4
│   ├── cpu_list
│   ├── nr_reserved_tags
│   └── nr_tags
├── 5
│   ├── cpu5
│   ├── cpu_list
│   ├── nr_reserved_tags
│   └── nr_tags
├── 6
│   ├── cpu6
│   ├── cpu_list
│   ├── nr_reserved_tags
│   └── nr_tags
└── 7
    ├── cpu7
    ├── cpu_list
    ├── nr_reserved_tags
    └── nr_tags

16 directories, 24 files
Now, we have know that every hctx has been mapped to a specific cpu. Return to our question, will a hctx executed on the other cpu to where it is not mapped ?
There are two paths that executes the hctx, sync and async.
Let's look at one of the sync path.
blk_mq_make_request()
>>>>
    rq = blk_mq_get_request(q, bio, bio->bi_opf, &data);
    // blk_mq_get_ctx() will invoke get_cpu() to disable the preempt
>>>>
    } else if (q->nr_hw_queues > 1 && is_sync) {
        blk_mq_put_ctx(data.ctx); // preempt enabled again.
        blk_mq_bio_to_request(rq, bio);
        // There seem to be a gap here with preempt enabled and the 
        // current task maybe migrated to other cpu.
        blk_mq_try_issue_directly(data.hctx, rq, &cookie);
    }
>>>> Let's look into the blk_mq_try_issue_directly()

>>>>
    if (!(hctx->flags & BLK_MQ_F_BLOCKING)) { 
        // nvme doesn't have this BLK_MQ_F_BLOCKING
        rcu_read_lock(); // preempt disable here
        __blk_mq_try_issue_directly(hctx, rq, cookie, false);
        rcu_read_unlock();
    }
>>>>
Another sync path.
blk_mq_requeue_work()
    -> blk_mq_run_hw_queues()  // async = false
        -> echo hctx run blk_mq_run_hw_queue()

blk_freeze_queue_start()
    -> blk_mq_run_hw_queues() // async = false
The two paths above must will execute the hctxs on the the cpu where they are not mapped.
For the async path, scenario seems different.
__blk_mq_delay_run_hw_queue() will invoke blk_mq_hctx_next_cpu() to decide on which cpu queue the hctx->run_work, then the blk_mq_run_work_fn() will be executed on the kworker that has been pinned on a specific cpu.

Tags sharing

The driver tags could be shared between different request_queues. We could think fo the following scenario.

SCSI HBA ------> SCSI TARGET +------> LU 0
                             |
                             +------> LU 1
                             |
                             +------> LU 2
                             |
                             +------> LU 3
The count of LU could be very big in SAN case.
How does the blk-mq handle the fairness between different request_queues on the same tag_set ?
First, we to tell the request_queue and blk tag_set that the tag_set is shared.
blk_mq_init_allocated_queue()
    -> blk_mq_add_queue_tag_set()

static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
                     struct request_queue *q)
{
    q->tag_set = set;

    mutex_lock(&set->tag_list_lock);

    /*
     * Check to see if we're transitioning to shared (from 1 to 2 queues).
     */
    if (!list_empty(&set->tag_list) &&
        !(set->flags & BLK_MQ_F_TAG_SHARED)) {
        set->flags |= BLK_MQ_F_TAG_SHARED;
        /* update existing queue */
        blk_mq_update_tag_set_depth(set, true);
        /*
         Actually, only the first request_queue on this tag_set will be set here.
         The blk-mq will be frozen and the hctxs associated with request_queue
         will be set BLK_MQ_F_TAG_SHARED.
         */
    }
    if (set->flags & BLK_MQ_F_TAG_SHARED)
        queue_set_hctx_shared(q, true); // Set the BLK_MQ_F_TAG_SHARED
    list_add_tail_rcu(&q->tag_set_list, &set->tag_list);

    mutex_unlock(&set->tag_list_lock);
}
In addition, the meaning of sharing tag_set is to share the driver tags. Look at the following code:
blk_mq_init_allocated_queue()
    -> blk_mq_add_queue_tag_set()
    -> blk_mq_map_swqueue()
>>>>
    queue_for_each_hw_ctx(q, hctx, i) {
    >>>>
        hctx->tags = set->tags[i];
        WARN_ON(!hctx->tags);
    >>>>
    }
>>>>
Then all the things will occur around this BLK_MQ_F_TAG_SHARED.
driver tags allocation
blk_mq_get_request()
    -> blk_mq_rq_ctx_init()
Or 
blk_mq_get_driver_tag()
>>>>
        if (blk_mq_tag_busy(data.hctx)) {
            rq->rq_flags |= RQF_MQ_INFLIGHT;
            atomic_inc(&data.hctx->nr_active);
        }
>>>>
blk_mq_tag_busy() will only return true when BLK_MQ_F_TAG_SHARED is set.
In __blk_mq_tag_busy(), the BLK_MQ_S_TAG_ACTIVE will be set and
hctx->tags->active_queues will be increased.

In addition, the hctx->nr_active will also be increased.

The hctx->tags->nr_actives is the count of request_queues that use the driver tag.
The hctx->nr_active is the count of driver tags used by the hctx itself

Where will be these two values used ?

blk_mq_get_request()/blk_mq_get_driver_tag()
    -> blk_mq_get_tag()
        -> __blk_mq_get_tag() // if !BLK_MQ_REQ_INTERNAL
            -> hctx_may_queue()

static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
                  struct sbitmap_queue *bt)
{
    unsigned int depth, users;

    if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
        return true;
    if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
        return true;

    /*
     * Don't try dividing an ant
     */
    if (bt->sb.depth == 1)
        return true;

    users = atomic_read(&hctx->tags->active_queues);
    if (!users)
        return true;

    /*
     * Allow at least some tags
     */
    depth = max((bt->sb.depth + users - 1) / users, 4U);
    return atomic_read(&hctx->nr_active) < depth;
}
It is easy to understand the source code above. It just averages the tag depth to
among the active request_queues to ensure fairness. But it look like not fair
when a high workload on LU0 and a relatively low workload on LU1. Another sad
thing is the active_queues will only be decreased when the queue is timed out or
exits( Look at the blk_mq_tag_idle() ). This indicates that the tags will be shared 
between the LUs no matter whether there is workload on the LUs.

driver tag release
__blk_mq_put_driver_tag()/blk_mq_free_request()
>>>>
    if (rq->rq_flags & RQF_MQ_INFLIGHT) {
        rq->rq_flags &= ~RQF_MQ_INFLIGHT;
        atomic_dec(&hctx->nr_active);
    }
>>>>

hctx->nr_active will be decreased here.
restart hctxs
If we submit a req to blk-mq, when will the reqs not be issued ?

Regarding the a.
blk_mq_mark_tag_wait() will be invoked by blk_mq_dispatch_rq_list().
(A commit that optimize this path f906a6a blk-mq: improve tag waiting setup for non-shared tags)
For the non-shared-tags case, the BLK_MQ_S_SCHED_RESTART flag will be enough.
But for the shared-tags case, it uses the sbq_wake_up().

Regarding the b.

Look at the comment of blk_mq_sched_restart()
Called after a driver tag has been freed to check whether a hctx needs to
be restarted. Restarts @hctx if its tag set is not shared. Restarts hardware
queues in a round-robin fashion if the tag set of @hctx is shared with other
hardware queues.
Look at the round-robin part:
>>>>
        list_for_each_entry_rcu_rr(q, queue, &set->tag_list,
                       tag_set_list) {
            queue_for_each_hw_ctx(q, hctx2, i)
                if (hctx2->tags == tags &&
                    blk_mq_sched_restart_hctx(hctx2))
                    goto done;
        }
>>>>
blk_mq_sched_restart_hctx() will run the hctx in async fashion if has work
pending on it.

Polling

Regarding the blk-mq polling, please refer to Block-layer I/O polling

blk_mq_poll()
    -> rq = blk_mq_tag_to_rq() //hctx->tags/hctx->sched_tags based on the cookie
    -> __blk_mq_poll()
    >>>>
    while (!need_resched()) {
        int ret;

        hctx->poll_invoked++;

        ret = q->mq_ops->poll(hctx, rq->tag);
        if (ret > 0) {
            hctx->poll_success++;
            set_current_state(TASK_RUNNING);
            return true;
        }
        >>>>
        if (ret < 0)
            break;
        cpu_relax();
    }

    >>>>

static int __nvme_poll(struct nvme_queue *nvmeq, unsigned int tag)
{
    struct nvme_completion cqe;
    int found = 0, consumed = 0;

    if (!nvme_cqe_valid(nvmeq, nvmeq->cq_head, nvmeq->cq_phase))
        return 0;
    //First check whether there is valid cqe, if no, return.
    //This procedure is interrupt-enabled, before __nvme_poll() get the valid
    //cqe, the interrupt has come and process the cqe. So nvme_poll() will not
    //get any cqe here, but how does it get out of the loop of __blk_mq_poll() ?
    spin_lock_irq(&nvmeq->q_lock);
    while (nvme_read_cqe(nvmeq, &cqe)) {
        nvme_handle_cqe(nvmeq, &cqe);
        consumed++;

        if (tag == cqe.command_id) {
            found = 1;
            break;
        }
       }

    if (consumed)
        nvme_ring_cq_doorbell(nvmeq);
    spin_unlock_irq(&nvmeq->q_lock);

    return found;
In fact, it cannot disable the irq there, because the context may have been migrated to other cpu
}
The answer is task state.
__blkdev_direct_IO_simple()
>>>>
qc = submit_bio(&bio);
    for (;;) {
        set_current_state(TASK_UNINTERRUPTIBLE);
        if (!READ_ONCE(bio.bi_private))
            break;
        if (!(iocb->ki_flags & IOCB_HIPRI) ||
            !blk_mq_poll(bdev_get_queue(bdev), qc))
            io_schedule();
    }
>>>>
The task state is set to TASK_UNINTERRUPTIBLE The blk completion path will try to wake up this process in bi_end_io callback, blkdev_bio_end_io_simple()
try_to_wake_up()
    -> ttwu_queue()
        -> ttwu_do_activate()
            -> ttwu_do_wakeup()
                -> p->state = TASK_RUNNING
in loop of __blk_mq_poll(), it will check the task state.
>>>>
    while (!need_resched()) {
        int ret;

        hctx->poll_invoked++;

        ret = q->mq_ops->poll(hctx, rq->tag);
        >>>>
        if (current->state == TASK_RUNNING)
            return true;
        if (ret < 0)
            break;
        cpu_relax();
    }
>>>>
if the task state is set to TASK_RUNNING, it returns true.
The following is test result w/ or w/o blk-mq polling.
w/o blk-mq polling
  read : io=131072KB, bw=79007KB/s, iops=19751, runt=  1659msec
    clat (usec): min=27, max=253, avg=49.46, stdev=33.61

w/ blk-mq polling
  read : io=131072KB, bw=97019KB/s, iops=24254, runt=  1351msec
    clat (usec): min=19, max=302, avg=40.90, stdev=34.31
The difference of average completion latency is 9us, about the time of two scheduling.