concepts
blk-mq
Block legacy
plug
BIO
Merge
FLUSH and FUA
Queue state flags
WBT
blkdev gendisk hd
blk sysfs
request_queue cleanup and release
blk_integrity
blk loop
blk-stats
blk-timeout
blk-throttle
bsg
direct_IO
blk RPM
EIO is fatal for fs
whether EIO is fatal or not depends on the component that is receiving it,
and they behave accordingly. If a file system encounters EIO error during
normal I/O (no metadata updates are involved), the error is bubbled back to
user space. Here even userspace application can choose to behave differently.
They can resubmit if possible, or crash if the I/O is related to recovery.
In file systems case, if an EIO error is returned during journal
update(metadata update) like in this case, it has 2 choices. 1) remount FS to
read-only or 2) crash the node. If FS is in single user mode, it can take the
FS to read-only, however if it's in clustered mode, it has to evict itself
hoping at least other nodes can continue fine
So avoid IO error as much as possible
There are two parts in sbitmap:
The core idea of sbitmap_queue is 'batch' and 'scatter'
scatter
Caller of sbq_wait_ptr has its owner wait_index.
static inline struct sbq_wait_state *sbq_wait_ptr(struct sbitmap_queue *sbq,
atomic_t *wait_index)
{
struct sbq_wait_state *ws;
ws = &sbq->ws[atomic_read(wait_index)];
sbq_index_atomic_inc(wait_index);
the wait_index will be increased every time.
return ws;
}
Every time the caller try to get the sbq_wait_state, its wait_index will be
increased 1.
Take blk_mq_get_request as example, when there are multiple tasks try to
allocate tag, if all of them fail, they will try to get a wait queue and sleep
on it. sbq_wait_ptr will ensure they get different wait queue, so there will be
no contending when the wait entry is adding on the wait queue.
we could check this on the /sys/kernel/debug/block/nvme0n1/hctx0/tags (driver tag )
wake_index=0
ws={
{.wait_cnt=1, .wait=inactive},
{.wait_cnt=1, .wait=active},
{.wait_cnt=1, .wait=inactive},
{.wait_cnt=1, .wait=active},
{.wait_cnt=1, .wait=inactive},
{.wait_cnt=1, .wait=active},
{.wait_cnt=1, .wait=inactive},
{.wait_cnt=1, .wait=active},
}
batch
static void sbq_wake_up(struct sbitmap_queue *sbq)
{
...
ws = sbq_wake_ptr(sbq);
if (!ws)
return;
wait_cnt = atomic_dec_return(&ws->wait_cnt);
if (wait_cnt <= 0) {
wake_batch = READ_ONCE(sbq->wake_batch);
smp_mb__before_atomic();
atomic_cmpxchg(&ws->wait_cnt, wait_cnt, wait_cnt + wake_batch);
sbq_index_atomic_inc(&sbq->wake_index);
wake_up_nr(&ws->wait, wake_batch);
}
}
wake_index=0
ws={
{.wait_cnt=1, .wait=inactive},
{.wait_cnt=1, .wait=active},
{.wait_cnt=1, .wait=inactive},
{.wait_cnt=1, .wait=active},
{.wait_cnt=1, .wait=inactive},
{.wait_cnt=1, .wait=active},
{.wait_cnt=1, .wait=inactive},
{.wait_cnt=1, .wait=active}, only one wait queue will be waked up per wait_cnt
}
Does the wake_batch introduce delay on high speed device ?
There is a interesting bug about wake_batch.
The wake_batch is calculated based on the sbitmap_queue depth which is actually
the tagset depth.
But the runtime tagset depth could be changed due to shallow_depth and
.limit_depth callback.
BFQ could ends up limiting shallow_depth to something low that is smaller than
the wake batch sizing for sbitmap, we can run into cases where we never wake up
folks waiting for a tag. The end result is an idle system with no IO pending,
but with tasks waiting for a tag with no one to wake them up because the
wake_batch.
Kyber could run into the same issue, if the async depth is limited low enough.
There two types of tags.
In the comment of the commit which add the MQ capable IO scheduler framework (bd166ef), Jens Axboe said:
We split driver and scheduler tags, so we can run the scheduling independently of device queue depth.
sched tags sched tags sched tags sched tags
Queue0 Queue1 Queue2 Queue3
shared driver tags
HBA cmd queue [C][C][C][C]
LUN0 LUN1 LUN2 LUN3
blk_mq_get_tag is used to allocate tag.
There are following points need to be noded:
if the tag is used up, there mainly two methods to wait the tags.
There are two interfaces will invoke blk_mq_get_tag to get tag.
blk_mq_get_tag -> blk_mq_tags_from_data
will decide from where tag is allocated based on BLK_MQ_REQ_INTERNAL.
No matter sched tagset or driver tagset, every tag corresponds to a static request entry.
blk_mq_get_request
-> blk_mq_get_tag // get tag
-> blk_mq_rq_ctx_init
---
struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
struct request *rq = tags->static_rqs[tag];
---
The static_rqs is filled in blk_mq_alloc_rqs
So we know, when we get a request, it may belong to sched tagset or driver
tagset.
driver tag indicates the capability of the HBA cmd queue.
even if a request is from sched tagset and already has a sched tag in
rq->internal_tag, it has to be assigned a driver tag before being issued.
blk_mq_dispatch_rq_list -> blk_mq_get_driver_tag
after get a driver tag, the request will be installed on driver tags->rqs[].
blk_mq_get_request
-> blk_mq_rq_ctx_init
---
if (data->flags & BLK_MQ_REQ_INTERNAL) {
rq->tag = -1;
rq->internal_tag = tag;
} else {
if (blk_mq_tag_busy(data->hctx)) {
rq_flags = RQF_MQ_INFLIGHT;
atomic_inc(&data->hctx->nr_active);
}
rq->tag = tag;
rq->internal_tag = -1;
data->hctx->tags->rqs[rq->tag] = rq;
}
---
blk_mq_get_driver_tag
---
rq->tag = blk_mq_get_tag(&data);
if (rq->tag >= 0) {
...
data.hctx->tags->rqs[rq->tag] = rq;
}
---
driver need to use driver tag to get the associated request entry.
struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
{
if (tag < tags->nr_tags) {
prefetch(tags->rqs[tag]);
return tags->rqs[tag];
}
return NULL;
}
It seems that the requests will only be installed in driver tagset
a low overhead method to limit the tags depth compared with tagset resize.
it usually cooperates with elevator .limit_depth.
blk_mq_get_request
-> .limit_depth // change the shallow_depth
-> blk_mq_get_tag
-> __blk_mq_get_tag
---
if (data->shallow_depth)
return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
else
return __sbitmap_queue_get(bt);
---
blk_freeze_queue will wait all the requests that have been allocated to be completed.
blk_mq_get_request
-> blk_queue_enter_live // get q->q_usage_counter
blk_mq_free_request
-> blk_queue_exit // release q->q_usage_counter
blk_mq_get_tag
---
DEFINE_WAIT(wait);
wake up func is autoremove_wake_function
...
ws = bt_wait_ptr(bt, data->hctx);
drop_ctx = data->ctx == NULL;
do {
...
tag = __blk_mq_get_tag(data, bt);
if (tag != -1)
break;
prepare_to_wait_exclusive(&ws->wait, &wait,
TASK_UNINTERRUPTIBLE);
tag = __blk_mq_get_tag(data, bt);
if (tag != -1)
break;
if (data->ctx)
blk_mq_put_ctx(data->ctx);
io_schedule();
after the task is scheduled back, it maybe migrated to other cpu,
the hctx has to be reassigned.
data->ctx = blk_mq_get_ctx(data->q);
data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
tags = blk_mq_tags_from_data(data);
if (data->flags & BLK_MQ_REQ_RESERVED)
bt = &tags->breserved_tags;
else
bt = &tags->bitmap_tags;
finish_wait(&ws->wait, &wait);
ws = bt_wait_ptr(bt, data->hctx);
} while (1);
---
there is a interesting issue here.
consider the following scenario:
Time0, task0, task1, task2 are all waiting tag on hctx0
hctx0 tags ws={
{.wait_cnt=1, .wait=active}, - task0
{.wait_cnt=1, .wait=active}, - task1
{.wait_cnt=1, .wait=active}, - task2
{.wait_cnt=1, .wait=inactive},
}
Time2, tags are released and task0 is waked up, but it is migrated to other cpu During
waking up. task0 will run and allocate tags on another hctx.
Consequently, even if there is free tag on hctx0 tagset, but no one will allocate them.
The worst thing is, task1 and task2 are still sleeping and no one could wake them up.
hctx0 tags ws={
{.wait_cnt=1, .wait=inactive},
{.wait_cnt=1, .wait=active}, - task1
{.wait_cnt=1, .wait=active}, - task2
{.wait_cnt=1, .wait=inactive},
}
This happens when blk_mq_dispatch_rq_list try to allocate driver tag for the
requests from sched tagset.
When the driver tagset is shared, it depends on the sbitmap_queue wakeup mechanism,
otherwise, the blk-mq restart mechanism will run the hw queue
blk_mq_dispatch_rq_list
-> blk_mq_get_driver_tag
-> blk_mq_mark_tag_wait
---
if (!(this_hctx->flags & BLK_MQ_F_TAG_SHARED)) {
if (!test_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state))
set_bit(BLK_MQ_S_SCHED_RESTART, &this_hctx->state);
return blk_mq_get_driver_tag(rq, hctx, false);
}
wait = &this_hctx->dispatch_wait;
the wake up func is blk_mq_dispatch_wake.
it will remove the task from wait list and run the hw queue asynchronously.
if (!list_empty_careful(&wait->entry))
return false;
spin_lock(&this_hctx->lock);
if (!list_empty(&wait->entry)) {
spin_unlock(&this_hctx->lock);
return false;
}
ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx);
add_wait_queue(&ws->wait, wait);
---
On HBA could connect to multiple LU, every LU has a request queue, all of these
request_queue share a tagset of the HBA.
From the view of scsi source code:
scsi_alloc_sdev
-> scsi_mq_alloc_queue
---
sdev->request_queue = blk_mq_init_queue(&sdev->host->tag_set);
all of the scsi dev (LU) share the same tagset of the host (HBA).
if (IS_ERR(sdev->request_queue))
return NULL;
sdev->request_queue->queuedata = sdev;
__scsi_init_queue(sdev->host, sdev->request_queue);
blk_queue_flag_set(QUEUE_FLAG_SCSI_PASSTHROUGH, sdev->request_queue);
return sdev->request_queue;
---
For shared tag users, we track the number of currently active users
and attempt to provide a fair share of the tag depth for each of them.
blk_mq_get_request/blk_mq_get_driver_tag
-> blk_mq_get_tag
-> __blk_mq_get_tag
-> hctx_may_queue
static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
struct sbitmap_queue *bt)
{
unsigned int depth, users;
if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
return true;
if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
return true;
/*
* Don't try dividing an ant
*/
if (bt->sb.depth == 1)
return true;
users = atomic_read(&hctx->tags->active_queues);
if (!users)
return true;
/*
* Allow at least some tags
*/
depth = max((bt->sb.depth + users - 1) / users, 4U);
return atomic_read(&hctx->nr_active) < depth;
}
There are two key points here:
Where to activate them ?
blk_mq_rq_ctx_init
---
if (data->flags & blk_mq_req_internal) {
rq->tag = -1;
rq->internal_tag = tag;
} else {
if (blk_mq_tag_busy(data->hctx)) {
rq_flags = RQF_MQ_INFLIGHT;
atomic_inc(&data->hctx->nr_active);
}
rq->tag = tag;
rq->internal_tag = -1;
data->hctx->tags->rqs[rq->tag] = rq;
}
---
blk_mq_get_driver_tag
---
rq->tag = blk_mq_get_tag(&data);
if (rq->tag >= 0) {
if (blk_mq_tag_busy(data.hctx)) {
rq->rq_flags |= RQF_MQ_INFLIGHT;
atomic_inc(&data.hctx->nr_active);
}
data.hctx->tags->rqs[rq->tag] = rq;
}
---
blk_mq_tag_busy
-> __blk_mq_tag_busy
---
if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
!test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
atomic_inc(&hctx->tags->active_queues);
---
When to deactivate them ?
A interesting question:
blk_mq_exit_hctx/blk_mq_timeout_work
-> blk_mq_tag_idle
-> __blk_mq_tag_idle
---
struct blk_mq_tags *tags = hctx->tags;
if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
return;
atomic_dec(&tags->active_queues);
blk_mq_tag_wakeup_all(tags, false);
---
Let's look into the case of blk_mq_timeout_work
---
if (data.next_set) {
data.next = blk_rq_timeout(round_jiffies_up(data.next));
mod_timer(&q->timeout, data.next);
} else {
/*
* Request timeouts are handled as a forward rolling timer. If
* we end up here it means that no requests are pending and
* also that no request has been pending for a while. Mark
* each hctx as idle.
*/
queue_for_each_hw_ctx(q, hctx, i) {
/* the hctx may be unmapped, so check it here */
if (blk_mq_hw_queue_mapped(hctx))
blk_mq_tag_idle(hctx);
}
}
---
When to set the next_set ?
blk_mq_check_expired
---
if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
time_after_eq(jiffies, deadline)) {
blk_mq_rq_update_aborted_gstate(rq, gstate);
data->nr_expired++;
hctx->nr_expired++;
} else if (!data->next_set || time_after(data->next, deadline)) {
data->next = deadline;
data->next_set = 1;
}
---
If any pending, non-timeout request exists, we set next_set.
blk_mq_free_request
---
if (rq->rq_flags & RQF_MQ_INFLIGHT)
atomic_dec(&hctx->nr_active);
---
__blk_mq_put_driver_tag
---
blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
rq->tag = -1;
if (rq->rq_flags & RQF_MQ_INFLIGHT) {
rq->rq_flags &= ~RQF_MQ_INFLIGHT;
atomic_dec(&hctx->nr_active);
}
---
BLK-MQ
q of LUN0 q of LUN1 q of LUN2 q of LUN3
hctx hctx hctx hctx
active active active inactive
driver tags
------------------------------------------------------
LLDD
HBA
All the driver tags have been used up by the 3 active q.
At the moment, we submit bio to an inactive q of LUN3, it cannot get driver tag
and queue the req on the hctx->dispatch list.
When will this hctx of LUN3 be waked up ?
blk_mq_mark_tag_wait will put this hctx of LUN3 on the shared-tag's wait queue.
When a driver tag is freed, it will wake up the waiters on the tag's wait queue
in round-robin fashion.
The active_queues of the shared-tags has been changed, so reqs to LUN0/1/2 have
to wait for its budget even if hctxs of LUN0/1/2 are waked up prio to LUN3's.
Here is part of the comment about io scheduler for blk-mq from the paper
[Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems]
While global sequential re-ordering is still possible across the multiple
software queues, it is only necessary for HDD based devices, where the additional
latency and locking overhead required to achieve total ordering does not hurt IOPS
performance. It can be argued that, for many users, it is no longer necessary to
employ advanced fairness scheduling as the speed of the devices are often
exceeding the ability of even multiple applications to saturate their performance.
If fairness is essential, it is possible to design a scheduler that exploits the
characteristics of SSDs at coarser granularity to achieve lower performance overhead.
Whether the scheduler should reside in the block layer or on the SSD controller
is an open issue. If the SSD is responsible for fair IO scheduling, it can leverage
internal device parallelism, and lower latency, at the cost of additional interface
complexity between disk and OS
We could get following points from the comment above:
blk-mq need the io scheduler to get this ability. bfq is an io scheduler for
HDD, but looks like it have not been used widely, so if we want a stable io scheduler
with advanced fairness scheduling, we have to go back to blk-legacy and use the
cfq.
So we don't need the kind of scheduler, such as cfq or bfq, for nvme device.
Kyber is a coarser granularity and low overhead io scheduler for fast device.
Such as nvme weighted round robin with urgent priority class arbitration.
[blk-mq io scheduler framework]
[scheduler init]
elevator_switch_mq
-> blk_mq_init_sched //freezed and quiesced
-> [.init_sched]
-> [.init_hctx]
[bio submit]
blk_mq_make_request
-> blk_mq_sched_bio_merge
-> __blk_mq_sched_bio_merge
-> [.bio_merge]
-> blk_mq_sched_try_merge //bfq and mq-deadline, use it to merge a bio to existing request
elv_merge // get the merge decision and req
-> [.request_merge]
if ELEVATOR_BACK_MERGE
blk_mq_sched_allow_merge
-> [.allow_merge]
bio_attempt_back_merge // merge the bio to the tail of req
attempt_back_merge // the new bio may have fill the hole between req and the latter req
-> elv_latter_request
-> [.next_request]
-> attempt_merge
-> [.requests_merged] // notify the io scheduler that the two reqs have been merged
elv_merged_request // if attempt_back_merge do nothing
-> [.request_merged] // one bio is merged into this req
else if ELEVATOR_FRONT_MERGE
blk_mq_sched_allow_merge
-> [.allow_merge]
bio_attempt_front_merge // merge the bio to the head of req
attempt_front_merge // the new bio may have fill the hole between req and the former req
-> elv_former_request
-> [.former_request]
-> attempt_merge
-> [.requests_merged] // notify the io scheduler that the two reqs have been merged
elv_merged_request // if attempt_front_merge do nothing
-> [.request_merged]
-> if there is request merging happen, invoke blk_mq_free_request ot free the merged request
blk_mq_free_request
-> [.finish_request]
[request allocation]
blk_mq_get_request
-> [.limit_depth] //update the blk_mq_alloc_data->shallow_depth
-> blk_mq_get_tag
-> shallow_depth? __sbitmap_queue_get_shallow : __sbitmap_queue_get
-> blk_mq_rq_ctx_init
-> blk_mq_sched_assign_ioc
-> ioc_create_icq
-> [.init_icq] // only bfq use it
-> [.prepare_request]
[request enqueue]
blk_mq_sched_insert_request
-> [.insert_requests]
-> blk_mq_sched_try_merge
-> elv_attempt_insert_merge
try blk_attempt_req_merge on q->last_merge or req from elv_rqhash tree
-> attempt_merge
-> [.requests_merged] // notify the io scheduler that the two reqs have been merged
//if there is request merging happen, invoke blk_mq_free_request ot free the merged request
-> blk_mq_free_request
-> [.finish_request]
[dispatch request]
blk_mq_sched_dispatch_requests
-> blk_mq_do_dispatch_sched
-> [.has_work] // blk_mq_sched_has_work
-> [.dispatch_request]
blk_mq_start_request
-> blk_mq_sched_started_request
-> [.started_request]
[requeue request]
blk_mq_requeue_request
-> __blk_mq_requeue_request
-> blk_mq_put_driver_tag // very important
-> blk_mq_sched_requeue_request
-> [.requeue_request]
blk_mq_requeue_work
-> blk_mq_sched_insert_request
Note: in blk-mq, a requeued request will be inserted to io scheduler
again, this is very different with blk-legacy. For the io scheduler of
blk-mq, .requeue_request is same with .finish_request (bfq and kyber)
[complete request]
__blk_mq_complete_request
-> blk_mq_sched_completed_request
-> [.completed_request]
blk_mq_free_request
-> [.finish_request]
We should notice: LLDD will not always complete a request with blk_mq_complete_request,
but also blk_mq_end_request. At the moment, .completed_request will not be invoked.
This is a special path for high speed device.
blk_mq_make_request
-> blk_mq_try_issue_directly
-> __blk_mq_try_issue_directly
---
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
run_queue = false;
bypass_insert = false;
goto insert;
}
// No io scheduler
if (q->elevator && !bypass_insert)
goto insert;
// No .get_budget
if (!blk_mq_get_dispatch_budget(hctx))
goto insert;
// No io scheduler, so driver tag has been got
if (!blk_mq_get_driver_tag(rq, NULL, false)) {
blk_mq_put_dispatch_budget(hctx);
goto insert;
}
return __blk_mq_issue_directly(hctx, rq, cookie);
// invoke .queue_rq directly here
insert:
if (bypass_insert)
return BLK_STS_RESOURCE;
// if io scheduler is set, fallback to normal path
blk_mq_sched_insert_request(rq, false, run_queue, false);
return BLK_STS_OK;
---
w/o io scheduler attached, the sync io could nearly bypass the whole blk-mq stack.
submit_bio
----------------|---------------------
BLK-MQ v
blk_mq_make_request
|
----^---- insert to ctx
|
----^---- run hctx
----------------|--------------------
LLDD v
.queue_rq
Where to run the hctx ? or in the other word, will be a hctx ran on the cpu
which is not mapped to this hctx ?
Let's see the two basic scenario that the hctx will be ran.
Whether the hctx will be executed on different mapped cpus concurrently ?
map_request
-> dm_dispatch_clone_request
-> blk_insert_cloned_request
-> blk_mq_request_issue_directly
-> __blk_mq_try_issue_directly // under hctx_lock interface
-> __blk_mq_issue_directly
blk_mq_make_request
-> blk_mq_try_issue_directly
-> __blk_mq_try_issue_directly // under hctx_lock interface
-> __blk_mq_issue_directly
There is many holes where the task will be preempted and migrated out.
__blk_mq_run_hw_queue could be run synchronously and asynchronously.
It will be invoked by __blk_mq_delay_run_hw_queue
---
if (!async && !(hctx->flags & BLK_MQ_F_BLOCKING)) {
int cpu = get_cpu();preempt is disabled here
if (cpumask_test_cpu(cpu, hctx->cpumask)) {
__blk_mq_run_hw_queue(hctx);
put_cpu();
return;
}
put_cpu();
}
kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,
msecs_to_jiffies(msecs));
---
The basic condition is:
1. parameter async is false
2. not BLK_MQ_F_BLOCKING
3. current cpu is mapped to the hctx
At the same time, __blk_mq_run_hw_queue will be run with preempt disabled.
So hctx will not be ran on the cpu which is not mapped on this hctx
It seems more obvious here, the __blk_mq_run_hw_queue will be ran in by
workqueue worker kthread which will be pined on its cpu. But if the only cpu
where the hctx is mapped is offlined, the hctx will have to be run on the other
cpu. Except for this, the hctx will not be ran on the cpu where it is not mapped
to.
cpu0 cpu1 cpu2 cpu3
. flush i_d run_work
. . . .
v . . v
v hctx0 .
-------------------.---------------
v
HBA
i_d issue directly
The possible concurrent path:
A common case is:
blk_mq_make_request
-> blk_mq_sched_insert_request
-> blk_mq_run_hw_queue
-> __blk_mq_delay_run_hw_queue
-> __blk_mq_run_hw_queue
-> blk_mq_sched_dispatch_requests
blk_mq_sched_insert_request
-> blk_mq_sched_bypass_insert // RQF_FLUSH_SEQ
refer to issue directly
blk_mq_run_work_fn
-> __blk_mq_run_hw_queue
-> blk_mq_sched_dispatch_requests
There are some cases where the requests cannot be dispatched immediately.
hctx-restart is a supplement to tag wakeup hook, because not all dispatch
deferring is due to lack of driver tag
in this case, the io scheduler itself is responsible for dispatch the
deferred requests
for shared tags, tag wakeup hook is in charge of this, otherwise, hctx_restart
blk_mq_mark_tag_wait will mark BLK_MQ_SCHED_RESTART on the hctx for non shared-tag
case.
after blk_mq_dispatch_rq_list queue the reqs on hctx->dispatch list, it will try
to rerun the hctx again. then the next blk_mq_sched_dispatch_requests will mark
restart.
we check the restart mark after enqueue reqs on hctx->dispatch
blk_mq_run_hw_queue
-> blk_mq_hctx_has_pending
-> !list_empty_carefull(&hctx->dispatch)
Let's look into the hctx restart next.
Mark restart
Currently, blk_mq_sched_mark_restart_hctx will only be invoked by blk_mq_sched_dispatch_requests
when there are requests in hctx->dispatch list. The requests could be inserted into
hctx->dispatchlist in following cases
refer to blk_mq_hctx_notify_dead
refer to blk_mq_dispatch_rq_list
blk_insert_flush
---
/*
* If there's data but flush is not necessary, the request can be
* processed directly without going through flush machinery. Queue
* for normal execution.
*/
if ((policy & REQ_FSEQ_DATA) &&
!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
if (q->mq_ops)
blk_mq_request_bypass_insert(rq, false);
else
list_add_tail(&rq->queuelist, &q->queue_head);
return;
}
---
blk_mq_sched_insert_request
-> blk_mq_sched_bypass_insert
---
/* dispatch flush rq directly */
if (rq->rq_flags & RQF_FLUSH_SEQ) {
spin_lock(&hctx->lock);
list_add(&rq->queuelist, &hctx->dispatch);
spin_unlock(&hctx->lock);
return true;
}
---
who will own this flag ?
static void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx)
{
if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
return;
if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
struct request_queue *q = hctx->queue;
if (!test_and_set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
atomic_inc(&q->shared_hctx_restart);
//if not set, increase the q->shared_hctx_restart
// shared_hctx_restart counts the number of hctx need to be restarted.
} else
set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
}
Restart
For the non-shared tag case, it is very simple, just invoke blk_mq_run_hw_queue(hctx, true) finally.
But for shared tag case, it is a bit complicated.
We will do hctx restart around all the hctxs that share same tags in round-robin fashion.
Why we need this ?
for sharing the resource of lldd fairly
if we always restart the hctx which the freed request points to,
other hctxs that share the same tagset will be starved.
restart
/'---------------------------------------,
BLK-MQ V \
q of LUN0 q of LUN1 q of LUN2 q of LUN3 |
|
hctx hctx hctx hctx |
^ |
driver tags | blk_mq_free_request
------------------------------------------------------
LLDD
HBA
We needn't worry about the fairly sharing on driver tag.
sbitmap wakeup hook and tag-sharing (hctx_may_queue) will work well.
Loop every q and hctx sharing the same tagset causes a massive performance regression if you have a lot of
shared devices. 8e8320c (blk-mq: fix performance regression with shared tags) will fix this.
A atomic shared_hctx_restart is added in request_queue to mark there is hctx need to be restarted in this
request_queue. Then blk_mq_sched_restart_hctx don't need to loop every time.
There is a question here:
The rr fashion hctx restart check would only happen:
- there is hctx marked as need restart
- there is req freed on the request_queue
What if there is no other req in-flight when hctx restart is marked ?
Who restart the hctx ? The others sharing the same tagset will not do that, because they are not marked as
restart in q->shared_hctx_restart.
This is genernal issue no matter sharing tag or not.
If there is no in-flight request, and .queue_rq need to requeue the request:
- return BLK_STS_RESOURCE
- LLDD rerun the hw queue itself
In fact, it looks that we don't always need to restart the hctxs in rr fashion.
- if we fail to get driver tag, tags wakeup hook could save us
- if we have reqs on hctx->dispatch which is inserted directly, it doesn't matter to other hctxs
There are also some special cases, look at the code segment in blk_mq_dispatch_rq_list:
if (!list_empty(list)) {
bool needs_restart;
// we reach here, because the .queue_rq returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE
spin_lock(&hctx->lock);
list_splice_init(list, &hctx->dispatch);
spin_unlock(&hctx->lock);
needs_restart = blk_mq_sched_needs_restart(hctx);
if (!needs_restart ||
(no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
blk_mq_run_hw_queue(hctx, true);
else if (needs_restart && (ret == BLK_STS_RESOURCE))
blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
}
When there is request left in hctx->dispatch list, there are some cases need to be handled:
run hctx asynchronously, SCHED_RESTART will be marked in blk_mq_sched_dispatch_requests
Why not invoke blk_mq_sched_mark_restart_hctx directly?
Look at the scenario below:
blk_mq_dispatch_rq_list blk_mq_free_request
-> .queue_rq return BLK_STS_DEV_RESOURCE -> blk_mq_sched_restart
-> queue rq on hctx->dispatch -> blk_mq_sched_restart_hctx
-> test BLK_MQ_S_SCHED_RESTART
-> blk_mq_sched_mark_restart_hctx
Think of the blk_mq_free_request is invoked for the last in-flight req, it would miss
the restart mark and incur io hang.
if we run the hw queue again, we will get the resource when invoke .queue_rq, even if we
still don't get the resource, the restart mark will not be missed.
there could be a narrow window as below:
blk_mq_dispatch_rq_list
blk_mq_dispatch_wake -> blk_mq_mark_tag_wait
-> add_wait_queue
-> list_del_init(&wait->entry)
-> blk_mq_run_hw_queue
-> blk_mq_hctx_has_pending
-> list_splice_init(list, &hctx->dispatch);
lldd will return BLK_STS_DEV_RESOURCE when it is lacking in resources due to pending
requests, otherwise, return BLK_STS_RESOURCE.
When return BLK_STS_RESOURCE, it indicates there is no pending requests, so hctx-restart
mechanism will not work, because there will be no blk_mq_free_request to be invoked.
at this moment, rerun the hctx with a delay to avoid stuck.
__blk_mq_requeue_request is used to prepare for a requeue.
---
//w/ io scheduler attached, there will be no in-queue req that
//holds driver tag.
blk_mq_put_driver_tag(rq);
trace_block_rq_requeue(q, rq);
wbt_requeue(q->rq_wb, &rq->issue_stat);
if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
// switch to IDLE state
blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
...
}
---
Where will be the req requeued ?
Question: why the blk_mq_sched_requeue_request is only invoked in blk_mq_requeue_request ?
blk_mq_dispatch_rq_list
---
ret = q->mq_ops->queue_rq(hctx, &bd);
if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
...
list_add(&rq->queuelist, list);
__blk_mq_requeue_request(rq);
break;
}
...
} while (!list_empty(list));
hctx->dispatched[queued_to_index(queued)]++;
/*
* Any items that need requeuing? Stuff them into hctx->dispatch,
* that is where we will continue on next queue run.
*/
if (!list_empty(list)) {
bool needs_restart;
spin_lock(&hctx->lock);
list_splice_init(list, &hctx->dispatch);
spin_unlock(&hctx->lock);
...
}
...
---
The request is requeued through blk_mq_sched_insert_request
There are two paths:
Look at the bfq and kyber, the callbacks of .requeue_request and .finish_request are the same one.
For blk_mq_dispatch_rq_list, the request is not queued back to io scheduler, we can say the request
is still being dispatched, so needn't invoke .requeue_request callback.
For the __blk_mq_try_issue_directly, the direct issue path only works w/o io scheduler attached.
Only the blk_mq_requeue_request case, the request is dequeued from io scheduler and will be requeued
back to io scheduler.
In fact, there is a big difference between block legacy and blk-mq in requeue.
blk_requeue_request
-> elv_requeue_request
-> __elv_add_request //ELEVATOR_INSERT_REQUEUE
-> list_add(&rq->queuelist, &q->queue_head);
The request is requeued to q->queue_head which is similar with hctx->dispatch.
There is also a tag mechanism in block legacy. Quote comment from blk-mq about tagging.
Device command tagging was first introduced with hardware supporting native command queuing. A tag is an integer value that uniquely identifies the position of the block IO in the driver submission queue, so when completed the tag is passed back from the device indicating which IO has been completed. This eliminates the need to perform a linear search of the in-flight window to determine which IO has completed.
We don't look into how to implement it but just how to employ it in block legacy and do some comparing with tagging in blk-mq.
How to use it in driver level ?
static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host *shost,
int tag)
{
struct request *req = NULL;
if (tag == SCSI_NO_TAG)
return NULL;
if (shost_use_blk_mq(shost)) {
u16 hwq = blk_mq_unique_tag_o_hwq(tag);
if (hwq < shost->tag_set.nr_hw_queues) {
req = blk_mq_tag_to_rq(shost->tag_set.tags[hwq],
blk_mq_unique_tag_to_tag(tag));
}
} else {
req = blk_map_queue_find_tag(shost->bqt, tag);
}
if (!req)
return NULL;
return blk_mq_rq_to_pdu(req);
}
A reverse mapping tag -> req -> driver pdu
How to assign tag to a req ?
scsi_request_fn()
>>>>
/*
* Remove the request from the request list.
*/
if (!(blk_queue_tagged(q) && !blk_queue_start_tag(q, req)))
blk_start_request(req);
/*
blk_queue_tagged() will check QUEUE_FLAG_QUEUED in the q->flags, means the hardware support native command queuing.
blk_queue_start_tag() will try to assign tag for this rq, if tags has been used up, return 1.
otherwise,
bqt->next_tag = (tag + 1) % bqt->max_depth;
rq->rq_flags |= RQF_QUEUED; //indicates tag has been assigned
rq->tag = tag;
bqt->tag_index[tag] = rq;
blk_start_request(rq);
list_add(&rq->queuelist, &q->tag_busy_list);
*/
>>>>
/*
* We hit this when the driver is using a host wide
* tag map. For device level tag maps the queue_depth check
* in the device ready fn would prevent us from trying
* to allocate a tag. Since the map is a shared host resource
* we add the dev to the starved list so it eventually gets
* a run when a tag is freed.
*/
if (blk_queue_tagged(q) && !(req->rq_flags & RQF_QUEUED)) {
spin_lock_irq(shost->host_lock);
if (list_empty(&sdev->starved_entry))
list_add_tail(&sdev->starved_entry,
&shost->starved_list);
spin_unlock_irq(shost->host_lock);
goto not_ready;
}
>>>>
not_ready:
/*
* The tag here looks like the driver tag in blk-mq.
* In block legacy, the req is requeued and inserted to the head of q->queue_head directly.
* In blk-mq, the action is similar, refer to blk_mq_dispatch_rq_list. (but __blk_mq_try_issue_directly looks like not assigned with this.)
*/
spin_lock_irq(q->queue_lock);
blk_requeue_request(q, req);
atomic_dec(&sdev->device_busy);
>>>>
There are mainly two aspects about blk plug's benifit.
where is the plug list flushed from schedule ?
schedule
-> sched_submit_work
-> blk_schedule_flush_plug
io_schedule_timeout/io_schedule
-> io_schedule_prepare
-> blk_schedule_flush_plug
However, the preempt schedule path doesn't flush plug list
asmlinkage __visible void __sched preempt_schedule_irq(void)
{
enum ctx_state prev_state;
/* Catch callers which need to be fixed */
BUG_ON(preempt_count() || !irqs_disabled());
prev_state = exception_enter();
do {
preempt_disable();
local_irq_enable();
__schedule(true);
local_irq_disable();
sched_preempt_enable_no_resched();
} while (need_resched());
exception_exit(prev_state);
}
Let's look into the _basic unit_ in block layer, the bio.
We could deem there is a bio layer between the fs and block layer.
FS LAYER
------------------------------------------------
| submit_bio
|
V generic_make_request <-------+
------------------------------------------------ |
blk-throttl |
BIO LAYER bio remap +--> partition |
| |
+--> bio based device mapper (stackable)
------------------------------------------------- |
| |
V blk_queue_bio/blk_mq_make_request
BLOCK LAGACY/BLK-MQ
The basic architecture of a bio.
request->bio __
\
\ bio
\ ________
->| bi_next next bio in one request, the blocks in these bios should be contigous on disk
|
| bi_disk gendisk->request_queue
|
| bi_partno partition NO.
|
| bi_opf bio_op, req_flag_bits, same with req->cmd_flags
|
| bi_phys_segments Number of segments in this BIO after physical address coalescing is performed.
|
| bi_end_io blk_update_request->req_bio_endio->bio_endio
|
| bi_vcnt how many bio_vec's
| bi_max_vecs max bio_vecs can hold
| bi_io_vec pointer to bio_io_vec list
| \ ________
| ---> | bv_page
| | bv_len
| | bv_offset
| ________
| | bv_page
| | bv_len
| | bv_offset These two pages could be non physical contigously
| But the corresponding blocks on storage disk should be contigous.
| bi_pool as its name
|
| bi_iter the current iterating status in bio_vec list
___________
| bi_sector device address in 512 byte sectors
| bi_size residual I/O count
| bi_idx current index into bvl_vec
| bi_done number of bytes completed
| bi_bvec_done number of bytes completed in current bvec
(Some members associated with cgroup,blk-throttle,merge-assistant are ignored here.)
Let's take the submit_bh_wbc() as example to show how to setup a bio
static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
enum rw_hint write_hint, struct writeback_control *wbc)
{
struct bio *bio;
>>>>
bio = bio_alloc(GFP_NOIO, 1); // the second parameter is the count of bvec
if (wbc) {
wbc_init_bio(wbc, bio);
wbc_account_io(wbc, bh->b_page, bh->b_size);
}
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio_set_dev(bio, bh->b_bdev);
//(bio)->bi_disk = (bdev)->bd_disk;
//(bio)->bi_partno = (bdev)->bd_partno;
bio->bi_write_hint = write_hint;
bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
>>>>//Fs with blocksize smaller than pagesize, could reach here.
if (bio->bi_vcnt > 0) {
bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
if (page == bv->bv_page &&
offset == bv->bv_offset + bv->bv_len) {
bv->bv_len += len;
goto done;
}
} //merged with previous one
if (bio->bi_vcnt >= bio->bi_max_vecs)
return 0;
bv = &bio->bi_io_vec[bio->bi_vcnt];
bv->bv_page = page;
bv->bv_len = len;
bv->bv_offset = offset;
bio->bi_vcnt++;
done:
bio->bi_iter.bi_size += len;
>>>>
BUG_ON(bio->bi_iter.bi_size != bh->b_size);
bio->bi_end_io = end_bio_bh_io_sync;
bio->bi_private = bh; //reverse mapping to the bh
/* Take care of bh's that straddle the end of the device */
guard_bio_eod(op, bio);
if (buffer_meta(bh))
op_flags |= REQ_META;
if (buffer_prio(bh))
op_flags |= REQ_PRIO;
bio_set_op_attrs(bio, op, op_flags);
submit_bio(bio);
return 0;
}
Most of the information to construct a bio is from the bh. If we want to dig deeper, we have to look into how to setup a bh.
static int
grow_dev_page(struct block_device *bdev, sector_t block,
pgoff_t index, int size, int sizebits, gfp_t gfp)
{
>>>>
page = find_or_create_page(inode->i_mapping, index, gfp_mask);
-> pagecache_get_page()
-> __page_cache_alloc() //no_page case
-> __alloc_pages_node(n, gfp, 0);
/*
The pages of page cache are allocated one by one. It's more flexible to
map and unmap, page in and swap out. And in the past, the memory is limited, there is not
enougth contiguous pages to take advantage of.
*/
BUG_ON(!PageLocked(page));
>>>>`
/*
* Allocate some buffers for this page
*/
bh = alloc_page_buffers(page, size, true);
/*
* Link the page to the buffers and initialise them. Take the
* lock to be atomic wrt __find_get_block(), which does not
* run under the page lock.
*/
spin_lock(&inode->i_mapping->private_lock);
link_dev_buffers(page, bh);
end_block = init_page_buffers(page, bdev, (sector_t)index << sizebits,
size);
>>>>
do {
if (!buffer_mapped(bh)) {
init_buffer(bh, NULL, NULL);
bh->b_bdev = bdev;
bh->b_blocknr = block;
if (uptodate)
set_buffer_uptodate(bh);
if (block < end_block)
set_buffer_mapped(bh);
}
block++;
bh = bh->b_this_page;
} while (bh != head);
>>>>
spin_unlock(&inode->i_mapping->private_lock);
done:
ret = (block < end_block) ? 1 : -ENXIO;
failed:
unlock_page(page);
put_page(page);
return ret;
}
One page from pagecache could be broken up into several bh's based on the blocksize of the associated filesystem (sb->s_blocksize). One bh corresponds to one block in disk. Then echo bh will be used to constructed a bio and submitted to block layer. At the moment, the bio only contain one bio_vec pointing to page of the bh. This is the classical path to setup a bio. Nowadays, some filesystems would like to create bios itself, during the procedure, the bio containing multiple bio_vec maybe created. For example:
static int io_submit_add_bh(struct ext4_io_submit *io,
struct inode *inode,
struct page *page,
struct buffer_head *bh)
{
int ret;
if (io->io_bio && bh->b_blocknr != io->io_next_block) {
submit_and_retry:
ext4_io_submit(io);
}
if (io->io_bio == NULL) {
ret = io_submit_init_bio(io, bh);
if (ret)
return ret;
io->io_bio->bi_write_hint = inode->i_write_hint;
}
ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
if (ret != bh->b_size)
goto submit_and_retry;
wbc_account_io(io->io_wbc, page, bh->b_size);
io->io_next_block++;
return 0;
}
We could see that: one bio_vec would correspond to part or the whole page.
bio advance
static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
unsigned bytes)
{
iter->bi_sector += bytes >> 9;
/* So this is why the bi_sector is located in bio->bi_iter, it could be
* put forward */
if (bio_no_advance_iter(bio))
{/REQ_OP_DISCARD/SECTOR_ERASE/WRITE_SAME/WRITE_ZERO
iter->bi_size -= bytes;
iter->bi_done += bytes;
} else {
bvec_iter_advance(bio->bi_io_vec, iter, bytes);
/* TODO: It is reasonable to complete bio with error here. */
}
}
static inline bool bvec_iter_advance(const struct bio_vec *bv,
struct bvec_iter *iter, unsigned bytes)
{
>>>>
while (bytes) {
unsigned iter_len = bvec_iter_len(bv, *iter);
unsigned len = min(bytes, iter_len);
bytes -= len;
iter->bi_size -= len; // remaining length
iter->bi_bvec_done += len; //completed length of current bvec
iter->bi_done += len; //completed length of this bio
if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) {
iter->bi_bvec_done = 0;
iter->bi_idx++; //push forward the bvec table here
}
}
return true;
}
After invoke this function, we could confirm one bio has been finished througth
(bio->bi_iter.bi_size == 0). For example, in blk_update_request()
blk_mq_end_request()
-> blk_update_request()
-> req_bio_endio()
>>>>
bio_advance(bio, nbytes);
/* don't actually finish bio if it's part of flush sequence */
// when RQF_FLUSH_SEQ is set, the req->end_io would be invoked instead of
// bio_end.
if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
bio_endio(bio);
>>>>
bio clone
in the device mapper stack, the bio will be cloned. Let's look at how to do that.
clone_bio(), clone a new bio contain the sector ~ (sector+len) of original one.
static int clone_bio(struct dm_target_io *tio, struct bio *bio,
sector_t sector, unsigned len)
{
struct bio *clone = &tio->clone;
__bio_clone_fast(clone, bio);
>>>>
bio->bi_disk = bio_src->bi_disk;
bio->bi_partno = bio_src->bi_partno;
bio_set_flag(bio, BIO_CLONED); // a cloned bio
bio->bi_opf = bio_src->bi_opf;
bio->bi_write_hint = bio_src->bi_write_hint;
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
//The cloned bio will shared a same bvec table with previous one.
bio_clone_blkcg_association(bio, bio_src);
>>>>
if (bio_op(bio) != REQ_OP_ZONE_REPORT)
bio_advance(clone, to_bytes(sector - clone->bi_iter.bi_sector));
clone->bi_iter.bi_size = to_bytes(len);
//cut out the sector ~ (sector+len) part of original one here
if (unlikely(bio_integrity(bio) != NULL))
bio_integrity_trim(clone);
return 0;
}
bio will be split in blk_mq_make_request, why ?
The associated commit is:
54efd50b ( block: make generic_make_request handle arbitrarily sized bios)
---
The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.
But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity.
---
Then FS layer could submit arbitrary size bios.
How to do it ?
blk_queue_split
-> blk_bio_segment_split
-> bio_split
---
split = bio_clone_fast(bio, gfp, bs);
-> __bio_clone_fast
---
bio->bi_disk = bio_src->bi_disk;
bio->bi_partno = bio_src->bi_partno;
bio_set_flag(bio, BIO_CLONED);
if (bio_flagged(bio_src, BIO_THROTTLED))
bio_set_flag(bio, BIO_THROTTLED);
bio->bi_opf = bio_src->bi_opf;
bio->bi_write_hint = bio_src->bi_write_hint;
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
...
---
split->bi_iter.bi_size = sectors << 9;
if (bio_integrity(split))
bio_integrity_trim(split);
bio_advance(bio, split->bi_iter.bi_size);
---
| sectors |
bi_io_vec [ bv ] [ bv ] [ bv ] [ bv ]
\____ _____/\________ __________/
V V
split->bi_iter bio->bi_iter
blk_queue_split
---
if (split) {
/* there isn't chance to merge the splitted bio */
split->bi_opf |= REQ_NOMERGE;
/*
* Since we're recursing into make_request here, ensure
* that we mark this bio as already having entered the queue.
* If not, and the queue is going away, we can get stuck
* forever on waiting for the queue reference to drop. But
* that will never happen, as we're already holding a
* reference to it.
*/
bio_set_flag(*bio, BIO_QUEUE_ENTERED);
bio_chain(split, *bio);
trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
a big bio
| max |
|__________________________|
\___ ___/\________ ________/
v v
submit go back to
generic_make_request
generic_make_request(*bio);
*bio = split;
}
---
How does the generic_make_request handle bios from stacked devices ?
Two important code fragment,
#1
---
if (current->bio_list) {
bio_list_add(¤t->bio_list[0], bio);
goto out;
}
---
#2
---
do {
bool enter_succeeded = true;
if (unlikely(q != bio->bi_disk->queue)) {
if (q)
blk_queue_exit(q);
q = bio->bi_disk->queue;
flags = 0;
if (bio->bi_opf & REQ_NOWAIT)
flags = BLK_MQ_REQ_NOWAIT;
if (blk_queue_enter(q, flags) < 0) {
enter_succeeded = false;
q = NULL;
}
}
if (enter_succeeded) {
struct bio_list lower, same;
/* Create a fresh bio_list for all subordinate requests */
bio_list_on_stack[1] = bio_list_on_stack[0];
bio_list_init(&bio_list_on_stack[0]);
ret = q->make_request_fn(q, bio);
/* sort new bios into those for a lower level
* and those for the same level
*/
bio_list_init(&lower);
bio_list_init(&same);
while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
if (q == bio->bi_disk->queue)
bio_list_add(&same, bio);
else
bio_list_add(&lower, bio);
/* now assemble so we handle the lowest level first */
bio_list_merge(&bio_list_on_stack[0], &lower);
bio_list_merge(&bio_list_on_stack[0], &same);
bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
} else {
if (unlikely(!blk_queue_dying(q) &&
(bio->bi_opf & REQ_NOWAIT)))
bio_wouldblock_error(bio);
else
bio_io_error(bio);
}
bio = bio_list_pop(&bio_list_on_stack[0]);
} while (bio);
---
Let's take the stripe as an example,
stripe_dev
bio 0 ~ 31
|--------------------|
+--+ +--+ +--+ +--+
| | | | | | | | } 4K (8 sectors)
+--+ +--+ +--+ +--+
| | | | | | | |
+--+ +--+ +--+ +--+
| | | | | | | |
+--+ +--+ +--+ +--+
dev0 dev1 dev2 dev3
Round #1
bio[0, 31].stripe_dev
q->make_request_fn
then,
bio_list_on_stack[0] -> bio[0, 7].dev0 -> bio[8, 31].stripe_dev
then,
lower -> bio[0, 7].dev0
same -> bio[8, 31].stripe_dev
then
bio_list_on_stack[0] -> bio[0, 7].dev0 -> bio[8, 31].stripe_dev
Round #2
bio[0, 7].dev0 is picked up to handle
bio_list_on_stack[1] -> bio[8, 31].stripe_dev
q->make_request_fn
bio_list_on_stack[0] is NULL
then
bio_list_on_stack[1] is merged into bio_list_on_stack[0]
bio_list_on_stack[0] -> bio[8, 31].stripe_dev
Round #3
bio[8, 31].stripe_dev is picked up to handle
q->make_request_fn
then
bio_list_on_stack[0] -> bio[8, 15].dev1 -> bio[16, 31].stripe_dev
then
lower -> bio[8, 15].dev1
same -> bio[16, 31].stripe_dev
then
bio_list_on_stack
bio_list_on_stack[0] -> bio[8, 15].dev1 -> bio[16, 31].stripe_dev
Round #4
bio[8, 15].dev1 is picked up to handle
bio_list_on_stack[1] ->bio[16, 31].stripe_dev
....
The main merging point.
blk_mq_sched_try_merge
This is used to merge bio with req.
It is usually in bio submitting path.
elv_merge choose a rq which could merge with a new bio
and return how to merge.
(bio) (req) indicates the new one
if ELEVATOR_BACK_MERGE
req -> bio -> (bio)
then try to merge this req with latter one.
(req) -?-> req
if ELEVATOR_FRONT_MERGE
req -> (bio) -> bio
then try to merge this req with former one.
req -?-> (req)
elv_attempt_insert_merge
This is used to merge req with req.
It is usually in req inserting path.
Both elv_merge and elv_attempt_insert_merge employ q->last_merge
and request_queue elv rqhash to find out contiguous reqs.
Note: req is just a package. The real things are bios in them.
attempt_merge is used to merge two reqs (req, next).
The mainly checking it does:
If two requests could be merged with echo other:
req->biotail->bi_next = next->bio;
req->biotail = next->biotail;
req->__data_len += blk_rq_bytes(next);
elv_merge_requests(q, req, next);
/*
* 'next' is going away, so update stats accordingly
*/
blk_account_io_merge(next);
req->ioprio = ioprio_best(req->ioprio, next->ioprio);
if (blk_rq_cpu_valid(next))
req->cpu = next->cpu;
/*
* ownership of bio passed from next to req, return 'next' for
* the caller to free
*/
next->bio = NULL;
Then next one will be freed though __blk_put_request().
First, we need to know the volatile write cache.
Quote from Documentation/block/writeback_cache_control.txt
Many storage devices, especially in the consumer market, come with volatile
write back caches. That means the devices signal I/O completion to the
operating system before data actually has hit the non-volatile storage. This
behavior obviously speeds up various workloads, but it means the operating
system needs to force data out to the non-volatile storage when it performs
a data integrity operation like fsync, sync or an unmount. >
There are two flag set in bio or req to indicate which operation on vwc will be
carried out.
The block device driver need to notify the queue that whether it supports
REQ_FLUSH and REQ_FUA through blk_queue_write_cache(). And the flags will
be set into queue->queue_flags.
The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from
the filesystem and will make sure the volatile cache of the storage device
has been flushed before the actual I/O operation is started. This explicitly
guarantees that previously completed write requests are on non-volatile
storage before the flagged bio starts. In addition the REQ_FLUSH flag can be
set on an otherwise empty bio structure, which causes only an explicit cache
flush without any dependent I/O.
The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure that I/O completion for this request is only
signaled after the data has been committed to non-volatile storage.
void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
{
spin_lock_irq(q->queue_lock);
if (wc)
queue_flag_set(QUEUE_FLAG_WC, q);
else
queue_flag_clear(QUEUE_FLAG_WC, q);
if (fua)
queue_flag_set(QUEUE_FLAG_FUA, q);
else
queue_flag_clear(QUEUE_FLAG_FUA, q);
spin_unlock_irq(q->queue_lock);
wbt_set_write_cache(q->rq_wb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
}
How to implement the flush operation
There are 4 flush sequence flag:
These flush operation life cycle could include any ones of them. blk core will
execute them in sequence. blk_flush_policy() is used to construct this sequence.
Let's see it.
static unsigned int blk_flush_policy(unsigned long fflags, struct request *rq)
{
unsigned int policy = 0;
if (blk_rq_sectors(rq))
policy |= REQ_FSEQ_DATA;
if (fflags & (1UL << QUEUE_FLAG_WC)) {
if (rq->cmd_flags & REQ_PREFLUSH)
policy |= REQ_FSEQ_PREFLUSH;
if (!(fflags & (1UL << QUEUE_FLAG_FUA)) &&
(rq->cmd_flags & REQ_FUA))
policy |= REQ_FSEQ_POSTFLUSH;
}
return policy;
}
Two things need to be emphasized here.
If blk_flush_policy() just return REQ_FSEQ_DATA, the request can be processed
directly without going through flush machinery. For blk-mq, it will be inserted
into the tail of hctx->dispatch.
Otherwise, a flush sequence will be started.
The flush sequence is carried out based on blk_flush_queue->flush_queue[2].
In addition, there are two idx to indicates the current state of the flush_queue.
Both of them only have two values 0/1. At initial state, pending == running.
After kick a flush sequence, the pending_idx is toggled, then the pending_idx become
different from running_idx which means flush is in flight. During the process
while flush is in flight, the new flushes will be queued on pending_idx which is
different from the running_idx. After the flush is completed, the running_idx
is toggled then the running_idx is same with pending_idx again.
a preallocated request - flush_rq will do the actual flush work on behalf of the
FLUSH requests. when completed, all the FLUSH request on the running queuee would
be pushed forward to next step.
blk_flush_queue->flush_queue[2]
running 0
pending 0
rq0 (PREFLUSH + DATA)
rq1 (DATA + POSTFLUSH)
rq2 (PREFLUSH + DATA)
Time 0: running 0, pending 0
(seq = PREFLUSH)
flush_queue[0] - rq0
blk_kick_flush toggle the pending_idx and send out
the flush_rq.
Time 1: running 0, pending 1
(seq = PREFLUSH)
flush_queue[0] - rq0
hctx->dispatch - flush_rq (w/ tag from rq0, RQF_FLUSH_SEQ)
requeue -> bypass insert
rq1 is inserted by blk_insert_flush
Time 2: running 0, pending 1
(seq = PREFLUSH)
flush_queue[0] - rq0
(seq = DATA)
flush_data_in_flight - rq1
hctx->dispatch - rq1 (RQF_FLUSH_SEQ) - flush_rq (w/ tag from rq0, RQF_FLUSH_SEQ)
both requeue -> bypass insert
rq2 is inserted by blk_insert_flush
Time 3: running 0, pending 1
(seq = PREFLUSH)
flush_queue[1] - rq2
(seq = PREFLUSH)
flush_queue[0] - rq0
(seq = DATA)
flush_data_in_flight - rq1
hctx->dispatch - rq1 (RQF_FLUSH_SEQ) - flush_rq (w/ tag from rq0, RQF_FLUSH_SEQ)
both requeue -> bypass insert
rq1 is completed firstly, due to POSTFLUSH, it is inserted to pending
Time 4: running 0, pending 1
(seq = PREFLUSH) (seq = POSTFLUSH)
flush_queue[1] - rq2 - rq1
(seq = PREFLUSH)
flush_queue[0] - rq0
hctx->dispatch - flush_rq (w/ tag from rq0, RQF_FLUSH_SEQ)
flush_rq is completed
get running list flush_queue[0]
toggle running running = 1
iterate running_list flush_queue[0] to invoke blk_flush_complete_seq
rq0 is inserted to flush_data_in_flight and requeue, finally add head of hctx->dispatch
another flush is issued by blk_kick_flush due to rq1 and rq2
Time 5: running 1, pending 1
(seq = PREFLUSH) (seq = POSTFLUSH)
flush_queue[1] - rq2 - rq1
(seq = DATA)
flush_data_in_flight - rq1
hctx->dispatch - rq0 (RQF_FLUSH_SEQ) - flush_rq (w/ tag from rq0, RQF_FLUSH_SEQ)
Question
the flush_rq could pass through the io scheduler with RQF_FLUSH_SEQ, but why does
the original rq do the same ?
does that mean all the rq with FLUSH or FUA will pass through the io scheduler ?
A sequenced PREFLUSH/FUA request with DATA is completed twice.
Once while executing DATA and again after the whole sequence is complete.
The first completion updates the contained bio but doesn't finish it so that the
bio submitter is notified only after the whole sequence is complete.
This is implemented by testing RQF_FLUSH_SEQ in req_bio_endio().
Talking about the borrowed tag
FLUSH reqs below means the request with FLUSH or FUA operations
Why does the flush_rq borrow tags from the FLUSH request ?
flush_rq is allocated separately, so it is not in the tag_set of blk-mq.
For the non-scheduler case, the FLUSH req has occupied a driver tag and it
depends on the completion of flush_rq. Assume the scenario, all the driver tags
are held by FLUSH req, consequentially, the flush_rq cannot get driver tag
any more and cannot make the flush sequence forward. A IO hang comes up. To
avoid this, flush_rq should borrow driver tag from the FLUSH reqs.
Recently,
a commit 923218f (blk-mq: don't allocate driver tag upfront for flush rq)
was introduced, it change the way how to handle the tag borrowing in blk-mq.
Before this patch, when with io scheduler, the blk-mq will allocate driver tag ahead
of delivering it to blk-flush. Then blk-flush may borrow this driver tag to the proxy
flush_rq. Then this flush_rq will be queued to hctx->dispatch.
blk_mq_make_request()
---
if (unlikely(is_flush_fua)) {
blk_mq_put_ctx(data.ctx);
blk_mq_bio_to_request(rq, bio);
if (q->elevator) {
blk_mq_sched_insert_request(rq, false, true, true,
true);
}
---
blk_mq_sched_insert_request()
---
if (rq->tag == -1 && op_is_flush(rq->cmd_flags)) {
blk_mq_sched_insert_flush(hctx, rq, can_block);
return;
}
---
static void blk_mq_sched_insert_flush(struct blk_mq_hw_ctx *hctx,
struct request *rq, bool can_block)
{
if (blk_mq_get_driver_tag(rq, &hctx, can_block)) {
blk_insert_flush(rq);
blk_mq_run_hw_queue(hctx, true);
} else
blk_mq_add_to_requeue_list(rq, false, true);
}
And this will cause a issue. Look at the comment of reorder_tags_to_front()
---
If we fail getting a driver tag because all the driver tags are already
assigned and on the dispatch list, BUT the first entry does not have a
tag, then we could deadlock. For that case, move entries with assigned
driver tags to the front, leaving the set of tagged requests in the
same order, and the untagged set in the same order.
---
if the driver tags are all occupied by FLUSH reqs, and other reqs has to be
queued on hctx->dispatch because shortage of driver tag.
the flush_rq with driver tag will be queued to the tail of hctx->dispatch.
then we will get the scenario described above.
The patch changes the way to handle this case, let flush_rq get a driver tag
just before .queue_rq() in blk_mq_dispatch_rq_list().
This will not cause IO hang described above, because the FLUSH requests just
occupy sched tags. But the flush_rq still need to borrow the sched tag to cheat
the blk-mq.
blk_kick_flush()
>>>>
if (q->mq_ops) {
struct blk_mq_hw_ctx *hctx;
flush_rq->mq_ctx = first_rq->mq_ctx;
if (!q->elevator) {
fq->orig_rq = first_rq;
flush_rq->tag = first_rq->tag;
hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
} else {
flush_rq->internal_tag = first_rq->internal_tag;
>>>>
Let's look at the similar 3 flags of request_queue.
QUEUE_FLAG_DYING will not stop requests to be issued.
It is only used in block legacy, looks like the BLK_MQ_S_STOPPED.
The quiescing mechanism has big advantage on it, so BLK_MQ_S_STOPPED is rarely
used now.
QUEUE_FLAG_DYING indicates no request could enter a request anymore.
queue dying is different from queue freeze which will block new IO comming in,
blk_queue_enter returns -ENODEV for it.
Look at the check points of QUEUE_FLAG_DYING
[1] blk_queue_enter (-ENODEV)
[2] blk_get_queue
[3] get_request (blk-legacy)
[4] generic_make_request/direct_make_request
[5] blk_insert_cloned_request (blk-legacy)
[6] blk_flush_plug_list (blk-legacy)
[7] blk_execute_rq_nowait (blk-legacy)
[8] sysfs interfaces
void blk_set_queue_dying(struct request_queue *q)
{
spin_lock_irq(q->queue_lock);
queue_flag_set(QUEUE_FLAG_DYING, q);
spin_unlock_irq(q->queue_lock);
blk_freeze_queue_start(q); // kill the percpu-ref q_usage_counter, then blk_queue_dying will be
// checked in slow path in blk_queue_enter
if (q->mq_ops)
blk_mq_wake_waiters(q); //wake the ones waiting on driver tag
...
/* Make blk_queue_enter() reexamine the DYING flag. */
wake_up_all(&q->mq_freeze_wq); //after this, no one could cross blk_queue_enter() in generic_make_request()
}
Checked through blk_queue_quiesced() in the following paths.
__blk_mq_run_hw_queue()
-> blk_mq_sched_dispatch_requests() // under rcu or src lock
-> if blk_queue_quiesced()
return // will not dequeue from io scheduler or ctx queue
blk_mq_try_issue_directly()
-> __blk_mq_try_issue_directly() // under rcu or src lock
-> if blk_queue_quiesced
blk_mq_sched_insert_request() // to io scheduler or ctx queue
When the queue is quiesced, the reqs will not enter into lldd but only stay in
blk-mq layer queues. In the other words, bios still could be submitted and will
not be issued.
In blk_mq_quiesce_queue, synchronize_srcu/rcu ensure the QUEUE_FLAG_QUIESCED
will be visible when it returns.
WBT = Write Buff Throttle
Why we need wbt ?
Let's quote some comment from the developer of this feature Jens.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background activity...
But for as long as I can remember, heavy buffered writers have not
behaved like that. For instance, if I do something like this:
$ dd if=/dev/zero of=foo bs=1M count=10k
on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts database reads or sync writes. When that happens, I get people
yelling at me.
In conclusion, the foreground IOs should be priorized
over the background ones.
Who will be throttled
wbt_should_throttle() gives the answer.
static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
{
const int op = bio_op(bio);
/*
* If not a WRITE, do nothing
*/
if (op != REQ_OP_WRITE)
return false;
/*
* Don't throttle WRITE_ODIRECT
*/
if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE))
return false;
return true;
}
The suspect is what's about the synchronous write ?
For example, the updating of the metadata of filesystem ?
How to implement it
Let's first look at the hooks across the blk-mq layer.
blk_mq_make_request()
wbt_wait()
if !may_queue()
sleep
wbt_track()
save track info
on rq->issue_stat
blk_mq_start_request() wb_timer_fn()
wbt_issue() account the latency of sync IO
sync issue time and adjust the limits of different IO type
blk_mq_free_request()/__blk_mq_end_request()
wbt_done()
dec inflight
wake up
__blk_mq_requeue_request()
wbt_requeue()
clear sync issue time
Yeah, it looks like the kyber IO scheduler.
But there is a big difference regarding to the action when limit is reached.
When we access the block device directly, for example /dev/sda1, we will not pass
througth bdev fs first. /dev/ is devtmpfs, not bdev fs. We could refer to init_special_inode
to know this.
sda1 sda2 sda3 sda4 devtmpfs
| [1]
V
blkdev1 blkdev3 blkdev3 blkdev4 blkdev fs
blkdev - block_device
disk - gendisk
hd - hd_struct
[1] - bdget get blkdev with inode->i_rdev (block devt) from blkdev fs
get_gendisk get gendisk and partno with block devt and install
them on blkdev->bd_disk and blkdev->bd_partno
In a realy workload, the stream is as following:
mount_bdev
sget
set_bdev_super xxx_get_block
set sb->s_bdev map_bh
bh->bdev = sb->s_bdev
|
V
submit_bh_wbc
bio_set_dev(bio, bh->b_bdev)
bio->bi_disk = bdev->bd_disk
bio->bi_partno = bdev->bd_partno
|
V
generic_make_request
generic_make_request_checks
blk_partition_remap
bio->bi_iter.bi_sector += hd->start_sect |
bio->bi_partno = 0;
queue->make_request_fn
Let's look at how is the following sysfs interface added.
/sys/block/nvme/queue/
^ ^ ^
[1] [2] [3]
/sys/block/nvme/mq
^
[4]
genhd_device_init
---
/* create top-level block dir */
if (!sysfs_deprecated)
block_depr = kobject_create_and_add("block", NULL);
---
device_add_disk
-> __device_add_disk
-> register_disk
-> sysfs_create_link(block_depr, &ddev->kobj,
kobject_name(&ddev->kobj));
device_add_disk
-> __device_add_disk
-> blk_register_queue
-> kobject_add(&q->kobj, kobject_get(&dev->kobj), "%s", "queue")
The ktype of q->kobj is blk_queue_ktype
device_add_disk
-> __device_add_disk
-> blk_register_queue
-> __blk_mq_register_dev parent dev gendisk->part0.__dev
-> kobject_add(&q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq")
The first thing what blk_cleanup_queue need to do is to prevent others from
entering blk path again. This is achieved by invoking blk_set_queue_dying.
void blk_set_queue_dying(struct request_queue *q)
{
blk_queue_flag_set(QUEUE_FLAG_DYING, q);
/*
* When queue DYING flag is set, we need to block new req
* entering queue, so we call blk_freeze_queue_start() to
* prevent I/O from crossing blk_queue_enter().
*/
blk_freeze_queue_start(q);
if (q->mq_ops)
blk_mq_wake_waiters(q);
wake up the tag waiters.
the hw queues will be run.
DYING flag is not same with QUIESCED, the post one will prevent requests from
entering into lldd.
else {
...
}
/* Make blk_queue_enter() reexamine the DYING flag. */
wake_up_all(&q->mq_freeze_wq);
}
blk_queue_dying and blk_queue_enter will gate the other contexts out of blk
path.
blk_queue_dying gates:
Then blk_cleanup_queue will invoke blk_freeze_queue. It will defense any new
requests and also drained all the requests, no matter pending or outstanding.
Even if all we have drained the queue, but there could be still contexts that
will access the request_queue resources. such as blk-mq run work, requeue work
blk_sync_queue is used to flush them.
void blk_sync_queue(struct request_queue *q)
{
del_timer_sync(&q->timeout);
cancel_work_sync(&q->timeout_work);
if (q->mq_ops) {
struct blk_mq_hw_ctx *hctx;
int i;
cancel_delayed_work_sync(&q->requeue_work);
queue_for_each_hw_ctx(q, hctx, i)
cancel_delayed_work_sync(&hctx->run_work);
} else {
cancel_delayed_work_sync(&q->delay_work);
}
}
Finally, blk_put_queue put a reference of q->kobj.
When the reference reaches zero, blk_queue_ktype.blk_release_queue will be invoked.
It queue the __blk_release_queue which will do the final release.
What need to be noted is the gendisk will take an extra ref on its request_queue
in __device_add_disk and put it in disk_release. So the request_queue will sticks around as long as
gendisk.
What is blk_integrity for ?
[ system memory ]
|
| D
| M path1
| A
V sas/fc/iscsi
[ HBA memory]- - - - - - - - ->[ storage volume ]
path2
The data integrity on path2 could be ensured by the transport protocol, for example:
e.g
The path1 is protected by blk_integrity what we will talk next.
How is blk_integrity implemented ?
Quote from Documentation/block/data-integrity.txt
Because the format of the protection data is tied to the physical
disk, each block device has been extended with a block integrity
profile (struct blk_integrity). This optional profile is registered
with the block layer using blk_integrity_register().
The profile contains callback functions for generating and verifying
the protection data, as well as getting and setting application tags.
The profile also contains a few constants to aid in completing,
merging and splitting the integrity metadata.
Let's look at how does the scsi sd implement this.
sd_probe_async
-> sd_dif_config_host
--
/* Enable DMA of protection information */
if (scsi_host_get_guard(sdkp->device->host) & SHOST_DIX_GUARD_IP) {
if (type == T10_PI_TYPE3_PROTECTION)
bi.profile = &t10_pi_type3_ip;
else
bi.profile = &t10_pi_type1_ip;
bi.flags |= BLK_INTEGRITY_IP_CHECKSUM;
} else
if (type == T10_PI_TYPE3_PROTECTION)
bi.profile = &t10_pi_type3_crc;
else
bi.profile = &t10_pi_type1_crc;
bi.tuple_size = sizeof(struct t10_pi_tuple);
sd_printk(KERN_NOTICE, sdkp,
"Enabling DIX %s protection\n", bi.profile->name);
if (dif && type) {
bi.flags |= BLK_INTEGRITY_DEVICE_CAPABLE;
if (!sdkp->ATO)
goto out;
if (type == T10_PI_TYPE3_PROTECTION)
bi.tag_size = sizeof(u16) + sizeof(u32);
else
bi.tag_size = sizeof(u16);
sd_printk(KERN_NOTICE, sdkp, "DIF application tag size %u\n",
bi.tag_size);
}
out:
blk_integrity_register(disk, &bi);
--
The process of blk_integrity
blk_mq_make_request
-> bio_integrity_prep
-> bio_integrity_add_page //bio->bi_integrity
-> bio_integrity_process(bio, &bio->bi_iter, bi->profile->generate_fn); //bio_data_dir(bio) == WRITE)
bio_endio
-> bio_integrity_endio
-> __bio_integrity_endio
--
if (bio_op(bio) == REQ_OP_READ && !bio->bi_status &&
(bip->bip_flags & BIP_BLOCK_INTEGRITY) && bi->profile->verify_fn) {
INIT_WORK(&bip->bip_work, bio_integrity_verify_fn);
queue_work(kintegrityd_wq, &bip->bip_work);
return false;
}
--
static void bio_integrity_verify_fn(struct work_struct *work)
{
struct bio_integrity_payload *bip =
container_of(work, struct bio_integrity_payload, bip_work);
struct bio *bio = bip->bip_bio;
struct blk_integrity *bi = blk_get_integrity(bio->bi_disk);
struct bvec_iter iter = bio->bi_iter;
/*
* At the moment verify is called bio's iterator was advanced
* during split and completion, we need to rewind iterator to
* it's original position.
*/
if (bio_rewind_iter(bio, &iter, iter.bi_done)) {
bio->bi_status = bio_integrity_process(bio, &iter,
bi->profile->verify_fn);
} else {
bio->bi_status = BLK_STS_IOERR;
}
bio_integrity_free(bio);
bio_endio(bio);
}
blk_integrity and fs
After the request is issued to HBA, the data will be transported to HBA internal
buffer through DMA and then verify it based on protection meta data. During the
DMA transporting, the data in the sglist (page caches) cannot be be
modified. This is guaranteed by fs.
Steps of writing data to a file:
1. writing into the page cache
aops.write_begin
-> lock page
-> wait_for_stable_page
-> if bdi_cap_stable_pages_required //BDI_CAP_STABLE_WRITES
wait_on_page_writeback
copy from user buffer to page cache
aops.write_end
2. writeback the pagecache to disk
lock page
set page writeback
submit_bio
unlock page
3. io completion
end bio
-> end_page_writeback
-> test_clear_page_writeback
-> wake_up_page(page, PG_writeback)
BDI_CAP_STABLE_WRITES is set in blk_integrity_register.
What's blk-loop for ?
/dev/loopX /home/ubuntu-16.04.4-desktop-amd64.iso
| ^ | |
v | v v
+-------------C-------------------+ +-------+
| vfs cache| | | DIO |
+-------------C-------------------+ +-------+
| | | |
v | v v
+-------------C------------------------------+
| block layer | |
+-------------C------------------------------+
| | |
v | v
blk-loop driver SCSI layer
The backend of a block device could be a HDD, SSD, or storage subsystem linked
by fc or iscsi, and also could be a local file.
There is another concept: direct IO.
The data from applications will go directly to block layer, bypassing the system
file cache.
Step 1
/dev/loop-control
loop_ctl_fops
-> loop_control_ioctl //LOOP_CTL_ADD
-> loop_add
There are a lot of interesting things in loop_add, let's look at it.
static int loop_add(struct loop_device **l, int i)
{
struct loop_device *lo;
struct gendisk *disk;
int err;
err = -ENOMEM;
lo = kzalloc(sizeof(*lo), GFP_KERNEL);
if (!lo)
goto out;
lo->lo_state = Lo_unbound; //This means no file is bound on this device
/* allocate id, if @id >= 0, we're requesting that specific id */
if (i >= 0) {
err = idr_alloc(&loop_index_idr, lo, i, i + 1, GFP_KERNEL);
if (err == -ENOSPC)
err = -EEXIST;
} else {
err = idr_alloc(&loop_index_idr, lo, 0, 0, GFP_KERNEL);
}
if (err < 0)
goto out_free_dev;
i = err;
err = -ENOMEM;
lo->tag_set.ops = &loop_mq_ops;
lo->tag_set.nr_hw_queues = 1;
/*
It should be an interesting theme to find out how many hw_queues to be
required to get better performance.
The real work is done in loop kthread, what .queue_rq does is just to insert
a work or wakeup the kthread.
*/
lo->tag_set.queue_depth = 128;
lo->tag_set.numa_node = NUMA_NO_NODE;
lo->tag_set.cmd_size = sizeof(struct loop_cmd);
lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
lo->tag_set.driver_data = lo;
err = blk_mq_alloc_tag_set(&lo->tag_set);
if (err)
goto out_free_idr;
lo->lo_queue = blk_mq_init_queue(&lo->tag_set);
if (IS_ERR_OR_NULL(lo->lo_queue)) {
err = PTR_ERR(lo->lo_queue);
goto out_cleanup_tags;
}
lo->lo_queue->queuedata = lo;
blk_queue_max_hw_sectors(lo->lo_queue, BLK_DEF_MAX_SECTORS);
/*
* By default, we do buffer IO, so it doesn't make sense to enable
* merge because the I/O submitted to backing file is handled page by
* page. For directio mode, merge does help to dispatch bigger request
* to underlayer disk. We will enable merge once directio is enabled.
*/
queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
err = -ENOMEM;
disk = lo->lo_disk = alloc_disk(1 << part_shift);
...
disk->fops = &lo_fops; //this the fops for /dev/loopX
disk->private_data = lo;
disk->queue = lo->lo_queue;
sprintf(disk->disk_name, "loop%d", i);
add_disk(disk);
*l = lo;
return lo->lo_number;
...
}
Step 2
/dev/loopX
lo_fops
-> lo_ioctl //LOOP_SET_FD
-> loop_set_fd
static int loop_set_fd(struct loop_device *lo, fmode_t mode,
struct block_device *bdev, unsigned int arg)
{
...
file = fget(arg);
if (!file)
goto out;
...
mapping = file->f_mapping;
inode = mapping->host;
//regular file or block file
if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
goto out_putf;
if (!(file->f_mode & FMODE_WRITE) || !(mode & FMODE_WRITE) ||
!file->f_op->write_iter)
lo_flags |= LO_FLAGS_READ_ONLY;
error = -EFBIG;
size = get_loop_size(lo, file);
if ((loff_t)(sector_t)size != size)
goto out_putf;
error = loop_prepare_queue(lo);
kthread_init_worker(&lo->worker);
lo->worker_task = kthread_run(loop_kthread_worker_fn,
&lo->worker, "loop%d", lo->lo_number);
if (IS_ERR(lo->worker_task))
return -ENOMEM;
set_user_nice(lo->worker_task, MIN_NICE);
set_device_ro(bdev, (lo_flags & LO_FLAGS_READ_ONLY) != 0);
lo->use_dio = false;
lo->lo_device = bdev;
lo->lo_flags = lo_flags;
lo->lo_backing_file = file;
lo->transfer = NULL;
lo->ioctl = NULL;
lo->lo_sizelimit = 0;
lo->old_gfp_mask = mapping_gfp_mask(mapping);
mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
blk_queue_write_cache(lo->lo_queue, true, false);
loop_update_dio(lo);
set_capacity(lo->lo_disk, size);
bd_set_size(bdev, size << 9);
loop_sysfs_init(lo);
/* let user-space know about the new size */
kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE);
set_blocksize(bdev, S_ISBLK(inode->i_mode) ?
block_size(inode->i_bdev) : PAGE_SIZE);
lo->lo_state = Lo_bound;
...
}
When request enters into .queue_rq, how to handle it next ?
It need to be handled in another context, because we have owned a deep stack
from vfs_read/write to driver .queue_rq. This context could be kworker or
standalone kthread. But which one shoud we use ?
commit e03a3d7 ( block: loop: use kthread_work ) change the block loop from work
to kthread context. Let's look at what block loop does before and after this patch.
Work based.
Concurrently Sequentially
Read Read Read Read Write<->Write<->Write<->Write
+---+ +---+ +---+ +---+ +---+
| W | | W | | W | | W | | W |
+---+ +---+ +---+ +---+ +---+
| | | | |
+ -v- - - v - - -v- - - v - - - - v - - +
| Unbound worker pool |
+ - - - - - - - - - - - - - - - - - - - +
+---+
| W | work instance
+---+
For the read, block loop issues them concurrently as far as possible.
This is due to read operastions often need to wait for the page caches to be
filled, it is usually a sychronous one. Issuing Read concurrently is good for
random read, but it is not so efficient for sequential read which often could
hit the page cache.
For the write, block loop issue them sequentially, because writes usually
reaches on page cache, it is usually fast enough.
Write<->Write<->Read<->Read<->Write ....
+- - - - -+
| kthread |
+- - - - -+
When DIO/AIO is introduced, the read/write on backing file is not blocking
operations.
In linux, read operastions are almost synchronous except for the required data
has been already in the page cache, otherwise, it has to wait for the page cache
to be filled by the block device through block layer and blk driver. Even if we
have readahead mechanism, but the page cache cannot be often hit with random
read.
Consequently, the loop driver execute context (kworker or standalone kthread)
has to wait and this will delay the other requests which may has associated page
cache already.
On ther other hand, there are two layer page cache would be involved, one for
file over loop device, one for the backing file. This is unnecessary and wastes
memory.
Leiming introduced backing file DIO and AIO supporting in block loop.
commit bc07c10a3603a5ab3ef01ba42b3d41f9ac63d1b6
Author: Ming Lei
After that, we get following diagram.
/dev/loopX > /home/ubuntu-16.04.4-desktop-amd64.iso
| / |
v / v
+-------------+ / +-------+
| vfs cache| | / | DIO |
+-------------+ / +-------+
| / |
v / v
+-------------C-----------------------------+
| block layer | |
+-------------C-----------------------------+
| | |
v | v
blk-loop driver SCSI layer
Before look into the implementations of blk-stat in kernel, let's first look at
how to utilize the information provided by blk-stats, iostat.
#iostat -c -d -x /dev/sda2 2 100
Linux 4.16.0-rc3+ (will-ThinkPad-L470) 03/20/2018 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
12.61 0.03 2.23 0.82 0.00 84.31
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda2 0.20 5.86 2.46 4.04 23.54 56.83 24.72 0.14 20.56 6.17 29.31 5.67 3.69
rrqm/s The number of read requests merged per second queued to the device.
wrqm/s The number of write requests merged per second queued to the device.
r/s The number of read requests issued to the device per second.
w/s The number of write requests issued to the device per second.
avgrq-sz The average size (in sectors) of the requests issued to the device.
avgqu-sz The average queue length of the requests issued to the device.
await The average time (milliseconds) for I/O requests issued to the device to be served.
This includes the time spent by the requests in queue and the time spent servicing them.
r_await The average time (in milliseconds) for read requests issued to the device to be served.
This includes the time spent by the requests in queue and the time spent servicing them.
w_await The average time (in milliseconds) for write requests issued to the device to be served.
This includes the time spent by the requests in queue and the time spent servicing them.
svctm The average service time (in milliseconds) for I/O requests issued to the device.
Warning! Do not trust this field; it will be removed in a future version of sysstat.
%util Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device).
Device saturation occurs when this values is close to 100%.
How to calculate them ?
Based on write_ext_stat
ioj, ioi two samples, j = i + 1
itv interval of two samples
rrqm/s (ioj->rd_merges, ioi->rd_merges)/itv
wrqm/s (ioj->wr_merges, ioi->wr_merges)/itv
r/s (ioj->rd_ios, ioi->rd_ios)/itv
w/s (ioj->wr_ios, ioi->wr_ios)/itv
avgrq-sz ((ioj->rd_sect - ioi->rd_sect) + (ioj->wr_sect - ioi->wr_sect))/
(ioj->nr_ios - ioi->nr_ios)
avgqu-sz (ioj->rq_ticks, ioi->rq_ticks)/itv
await ((ioj->rd_ticks - ioi->rd_ticks) + (ioj->wr_ticks + ioj->wr_ticks))/
(ioj->nr_ios - ioi->nr_ios)
r_await similar with await
w_await similar with await
%util (ioj->tot_ticks - ioi->tot_ticks)/itv
We could refer read_diskstats_stat to know where does these
data come from.
Next, let's find out how to generate this statistics data in kernel.
Based on diskstats_show
There are following members in hd_struct.dkstats ( a percpu variable)
Reference
__blk_mq_end_request
-> blk_account_io_done
-> part_stat_inc(cpu, part, ios[rw]);
bio_attempt_back/font/disacard_merge
-> blk_account_io_start // new_io == false
-> part_stat_inc(cpu, part, merges[rw]);
blk_mq_end_request
-> blk_update_request
-> blk_account_io_completion
-> part_stat_add(cpu, part, sectors[rw], bytes >> 9);
__blk_mq_end_request
-> blk_account_io_done
-> part_stat_add(cpu, part, ticks[rw], (jiffies - req->start_time));
rq->start_time is set in blk_mq_rq_ctx_init and will inherit the smaller start_time of the merged rqs
What if the duration here is smaller than 1 jiffies ?
This could be possible on a machine that has a high-speed storage device and low HZ
blk_account_io_start/merge/done diskstats_show/part_stat_show
-> part_round_stats
-> part_in_flight // f299b7c (blk-mq: provide internal in-flight variant)
-> blk_mq_in_flight
-> part_round_stats_single
static void part_round_stats_single(struct request_queue *q, int cpu,
struct hd_struct *part, unsigned long now,
unsigned int inflight)
{
if (inflight) {
__part_stat_add(cpu, part, time_in_queue,
inflight * (now - part->stamp));
__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
}
part->stamp = now;
}
io_ticks here means the time when there is in-flight IO in request queue.
read_diskstats_stat
void read_diskstats_stat(int curr)
{
...
if ((fp = fopen(DISKSTATS, "r")) == NULL) // proc/diskstats
return;
while (fgets(line, 256, fp) != NULL) {
/* major minor name rio rmerge rsect ruse wio wmerge wsect wuse running use aveq */
i = sscanf(line, "%u %u %s %lu %lu %lu %lu %lu %lu %lu %u %u %u %u",
&major, &minor, dev_name,
&rd_ios, &rd_merges_or_rd_sec, &rd_sec_or_wr_ios, &rd_ticks_or_wr_sec,
&wr_ios, &wr_merges, &wr_sec, &wr_ticks, &ios_pgr, &tot_ticks, &rq_ticks);
if (i == 14) {
/* Device or partition */
if (!dlist_idx && !DISPLAY_PARTITIONS(flags) &&
!is_device(dev_name, ACCEPT_VIRTUAL_DEVICES))
continue;
sdev.rd_ios = rd_ios;
sdev.rd_merges = rd_merges_or_rd_sec;
sdev.rd_sectors = rd_sec_or_wr_ios;
sdev.rd_ticks = (unsigned int) rd_ticks_or_wr_sec;
sdev.wr_ios = wr_ios;
sdev.wr_merges = wr_merges;
sdev.wr_sectors = wr_sec;
sdev.wr_ticks = wr_ticks;
sdev.ios_pgr = ios_pgr;
sdev.tot_ticks = tot_ticks;
sdev.rq_ticks = rq_ticks;
}
...
save_stats(dev_name, curr, &sdev, iodev_nr, st_hdr_iodev);
}
...
}
diskstats_show
static int diskstats_show(struct seq_file *seqf, void *v)
{
struct gendisk *gp = v;
struct disk_part_iter piter;
struct hd_struct *hd;
char buf[BDEVNAME_SIZE];
unsigned int inflight[2];
int cpu;
/*
if (&disk_to_dev(gp)->kobj.entry == block_class.devices.next)
seq_puts(seqf, "major minor name"
" rio rmerge rsect ruse wio wmerge "
"wsect wuse running use aveq"
"\n\n");
*/
disk_part_iter_init(&piter, gp, DISK_PITER_INCL_EMPTY_PART0);
while ((hd = disk_part_iter_next(&piter))) {
cpu = part_stat_lock();
part_round_stats(gp->queue, cpu, hd);
part_stat_unlock();
part_in_flight(gp->queue, hd, inflight);
seq_printf(seqf, "%4d %7d %s %lu %lu %lu "
"%u %lu %lu %lu %u %u %u %u\n",
MAJOR(part_devt(hd)), MINOR(part_devt(hd)),
disk_name(gp, hd->partno, buf),
part_stat_read(hd, ios[READ]),
part_stat_read(hd, merges[READ]),
part_stat_read(hd, sectors[READ]),
jiffies_to_msecs(part_stat_read(hd, ticks[READ])),
part_stat_read(hd, ios[WRITE]),
part_stat_read(hd, merges[WRITE]),
part_stat_read(hd, sectors[WRITE]),
jiffies_to_msecs(part_stat_read(hd, ticks[WRITE])),
inflight[0],
jiffies_to_msecs(part_stat_read(hd, io_ticks)),
jiffies_to_msecs(part_stat_read(hd, time_in_queue))
);
}
disk_part_iter_exit(&piter);
return 0;
}
write_ext_stat
void write_ext_stat(int curr, unsigned long long itv, int fctr,
struct io_hdr_stats *shi, struct io_stats *ioi,
struct io_stats *ioj)
{
char *devname = NULL;
struct stats_disk sdc, sdp;
struct ext_disk_stats xds;
double r_await, w_await;
/*
* Counters overflows are possible, but don't need to be handled in
* a special way: The difference is still properly calculated if the
* result is of the same type as the two values.
* Exception is field rq_ticks which is incremented by the number of
* I/O in progress times the number of milliseconds spent doing I/O.
* But the number of I/O in progress (field ios_pgr) happens to be
* sometimes negative...
*/
sdc.nr_ios = ioi->rd_ios + ioi->wr_ios;
sdp.nr_ios = ioj->rd_ios + ioj->wr_ios;
sdc.tot_ticks = ioi->tot_ticks;
sdp.tot_ticks = ioj->tot_ticks;
sdc.rd_ticks = ioi->rd_ticks;
sdp.rd_ticks = ioj->rd_ticks;
sdc.wr_ticks = ioi->wr_ticks;
sdp.wr_ticks = ioj->wr_ticks;
sdc.rd_sect = ioi->rd_sectors;
sdp.rd_sect = ioj->rd_sectors;
sdc.wr_sect = ioi->wr_sectors;
sdp.wr_sect = ioj->wr_sectors;
compute_ext_disk_stats(&sdc, &sdp, itv, &xds);
r_await = (ioi->rd_ios - ioj->rd_ios) ?
(ioi->rd_ticks - ioj->rd_ticks) /
((double) (ioi->rd_ios - ioj->rd_ios)) : 0.0;
w_await = (ioi->wr_ios - ioj->wr_ios) ?
(ioi->wr_ticks - ioj->wr_ticks) /
((double) (ioi->wr_ios - ioj->wr_ios)) : 0.0;
/* Print device name */
if (DISPLAY_PERSIST_NAME_I(flags)) {
devname = get_persistent_name_from_pretty(shi->name);
}
if (!devname) {
devname = shi->name;
}
if (DISPLAY_HUMAN_READ(flags)) {
printf("%s\n%13s", devname, "");
}
else {
printf("%-13s", devname);
}
/* rrq/s wrq/s r/s w/s rsec wsec rqsz qusz await r_await w_await svctm %util */
printf(" %8.2f %8.2f %7.2f %7.2f %8.2f %8.2f %8.2f %8.2f %7.2f %7.2f %7.2f %6.2f %6.2f\n",
S_VALUE(ioj->rd_merges, ioi->rd_merges, itv),
S_VALUE(ioj->wr_merges, ioi->wr_merges, itv),
S_VALUE(ioj->rd_ios, ioi->rd_ios, itv),
S_VALUE(ioj->wr_ios, ioi->wr_ios, itv),
ll_s_value(ioj->rd_sectors, ioi->rd_sectors, itv) / fctr,
ll_s_value(ioj->wr_sectors, ioi->wr_sectors, itv) / fctr,
xds.arqsz,
S_VALUE(ioj->rq_ticks, ioi->rq_ticks, itv) / 1000.0,
xds.await,
r_await,
w_await,
/* The ticks output is biased to output 1000 ticks per second */
xds.svctm,
/*
* Again: Ticks in milliseconds.
* In the case of a device group (option -g), shi->used is the number of
* devices in the group. Else shi->used equals 1.
*/
shi->used ? xds.util / 10.0 / (double) shi->used
: xds.util / 10.0); /* shi->used should never be null here */
}
There is a timer per request_queue to defense blk device no response.
The timer is armed by blk_add_timer.
The timer is request_queue.timeout and timeout fn is blk_rq_timed_out_timer.
static void blk_rq_timed_out_timer(struct timer_list *t)
{
struct request_queue *q = from_timer(q, t, timeout);
kblockd_schedule_work(&q->timeout_work);
}
The main stuff of timeout is executed in kworker context.
There is a difference between blk-legacy and blk-mq.
In blk-legacy, when arm the timer, the request will be added on request_queue.timeout_list.
And when the request is completed, the request will be dequeued from it.
blk_requeue_request/blk_finish_request
-> blk_delete_timer
The blk_timeout_work will check the requests on request_queue.timeout_list.
In blk-mq, the request_queue.timeout_list is not used any more, instead, it
employ the blk_mq_queue_tag_busy_iter. It use the occupied
driver tag to track the requests.
static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
{
struct bt_iter_data *iter_data = data;
struct blk_mq_hw_ctx *hctx = iter_data->hctx;
struct blk_mq_tags *tags = hctx->tags;
bool reserved = iter_data->reserved;
struct request *rq;
if (!reserved)
bitnr += tags->nr_reserved_tags;
rq = tags->rqs[bitnr];
/*
* We can hit rq == NULL here, because the tagging functions
* test and set the bit before assining ->rqs[].
*/
if (rq && rq->q == hctx->queue)
iter_data->fn(hctx, rq, iter_data->data, reserved);
return true;
}
When there is no io scheduler, the request will always occupy a driver tag.
If the lldd prevent new requests from entering through blk_mq_quiesce_queue or
other ways, and the request_queue.timeout has been armed, will the requests in
blk-mq queues be expired ?
So when a request is completed, we don't see blk_delete_timer in __blk_mq_complete_request
or __blk_mq_end_request.
Another difference is the method to handle Race between timeout
completion and regular completion
blk-legacy employs blk_mark_rq_complete.
void blk_complete_request(struct request *req)
{
if (unlikely(blk_should_fake_timeout(req->q)))
return;
if (!blk_mark_rq_complete(req))
__blk_complete_request(req);
}
static void blk_rq_check_expired(struct request *rq, unsigned long *next_timeout,
unsigned int *next_set)
{
const unsigned long deadline = blk_rq_deadline(rq);
if (time_after_eq(jiffies, deadline)) {
list_del_init(&rq->timeout_list);
/*
* Check if we raced with end io completion
*/
if (!blk_mark_rq_complete(rq))
blk_rq_timed_out(rq);
} else if (!*next_set || time_after(*next_timeout, deadline)) {
*next_timeout = deadline;
*next_set = 1;
}
}
In blk-mq, after tejun's blk-mq: reimplement timeout handling
(https://lkml.org/lkml/2018/1/9/761), blk_mark_rq_complete has been discarded.
rcu/srcu is employed to synchronize between timeout path and regular completion
path instead of atomic operations. In addition, it could avoid the following
scenario below.
blk_mq_check_expired
---
deadline = READ_ONCE(rq->deadline);
A delay introduced here by preempt or interrupt or other, during this, the rq is
completed and freed, then got and reinitialized again by others.
And we could timeout a new instance here.
if (time_after_eq(jiffies, deadline)) {
if (!blk_mark_rq_complete(rq)) {
blk_mq_rq_timed_out(rq, reserved);
}
---
After tejun's commit, things become this:
blk_mq_check_expired
---
/* read coherent snapshots of @rq->state_gen and @rq->deadline */
while (true) {
start = read_seqcount_begin(&rq->gstate_seq);
gstate = READ_ONCE(rq->gstate);
deadline = blk_rq_deadline(rq);
if (!read_seqcount_retry(&rq->gstate_seq, start))
break;
cond_resched();
}
A delay introduced here by preempt or interrupt or other, during this, the rq is
completed and freed, then got and reinitialized again by others.
/* if in-flight && overdue, mark for abortion */
if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
time_after_eq(jiffies, deadline)) {
blk_mq_rq_update_aborted_gstate(rq, gstate);
data->nr_expired++;
hctx->nr_expired++;
}
---
static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
{
/*
* We marked @rq->aborted_gstate and waited for RCU. If there were
* completions that we lost to, they would have finished and
* updated @rq->gstate by now; otherwise, the completion path is
* now guaranteed to see @rq->aborted_gstate and yield. If
* @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
*/
Note: the rcu/srcu synchronize is between blk_mq_check_expired and
blk_mq_terminate_expired.
if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
READ_ONCE(rq->gstate) == rq->aborted_gstate)
There two parts of the gstate, generation and state.
When we save the gstate to aborted_gstate, its state was MQ_RQ_IN_FLIGHT.
If the recycle new instance has not been started, the state will not match,
because it is MQ_RQ_IDLE, if started, the generation will not match, because the
generation part of gstate will be increased when state switches to
MQ_RQ_IN_FLIGHT.
blk_mq_rq_timed_out(rq, reserved);
}
generic_make_request
|
V
tg_A->sq->queued (qn_A_r_self (bio, bio, bio)) tg_B->sq->queued (qn_B_r_self (bio, bio, bio))
|
V
tg_ABg->sq->queued (qn_ABg_r_self(bio, bio) qn_A_r_parent (bio), qn_B_r_parent (bio bio))
|
V
td->sq->queued (qn_ABg_r_parent(bio))
|
V
generic_make_request (td->dispatch_work context)
bio (w/ BIO_THROTTLED) will not pass
through blk-throttle again.
qn per-tg, contains throttled bios.
when dispatch bios, qn by qn, rather than bio by bio, otherwise, one tg could
fill up the budget and starve others. (throtl_pop_queued)
There are two dimensions of qn.
r/w , when dispatch, 75% read, 25% write (throtl_dispatch_tg)
self/parent, during dispatching, some bios maybe queued upwards to parent's
sq, some others not. At the moment, parent qn is used to contained ther bios
queued to parent's sq, self qn contains others.
sq throtl_service_queue, per-tg or td
construct the hierarchy, td->sq is the root node
queued throl_qnode
first_pending_disptime
pending_timer, dispatch bios upwards to parent sq until td->sq, queue td dispatch_work
tg throtl_grp, per (blk-throt cgroup - request_queue)
bps,iops limits, bytes, ios dispatched number
td throtl_data, per-request_queue
queued[r/w] qn list, only the bios that has reached here could be issued.
dispatch_work, generic_make_request
limit_index (LOW/MAX)
How to account the bps and iops ?
current
|
tg->slice_start v tg->slice_end
|-------|------|-------|------| ....
|< - - - - - - - - ->|
V
elapsed_rnd
limit = tg_bps/iops_limit(tg, rw) * elapsed_rnd
| - - - | td->throtl_slice
Refer to tg_with_in_tg/iops_limit
When the tg->bytes/io_disp is over the limit:
blk_throtl_bio
-> throtl_add_bio_tg
-> set THROTL_TG_WAS_EMPTY when sq->nr_queued == 0
-> throtl_qnode_add_bio(bio, qn, &sq->queued[rw]);
-> add bio to qn, add qn to sq
-> blkg_get(tg_to_blkg(qn->tg))
throttled bio dispatching is an asynchronous event,
we need a reference of blkg to prevent it to be freed
-> add tg to parent sq pending rb tree with tg->disptime as key
if THROTL_TG_WAS_EMPTY is set
-> tg_update_disptime
next dispatch time will be calculated here through tg_may_dispatch
-> throtl_schedule_next_dispatch(tg->service_queue.parent_sq, true);
-> update_min_dispatch_time
-> pick up the leftest node from the parent sq pending rb tree
and update parent_sq->first_pending_disptime
-> throtl_schedule_pending_timer
-> schedule parent_sq pending_timer on first_pending_disptime
Think of a case here:
A bio is throttled and its dispatch time is 5 jiffies. What if a new bio comes
in with a 3 jiffies dispatch time ?
Why does every tg need a dispatch time ?
bio size
^
| o - bio
|
| o2
| | o3
| o0 o1 | |
| | | | |
+-----------------------------------------> time
t0 t1
if we issue the o2 on t0, the bps limit will be reached, we have to delay it to
t1, then bps limit could be complied.
Howeve, what if the following case:
bio size
^
| o - bio
|
| o2 (planed)
| |
| o0 o1o3 |
| | | | |
+-----------------------------------------> time
t0 t1
We have schedule the parent_sq pending timer to t1 to dispatch o2, when we have
a o3 on t0, the pending_timer need to expire ahead to disaptch o3, otherwise, o3
is delayed. How to handle this case in blk-throtl ?
No such kind of issue
Except for o3 has a higher priority than o2. What does blk-throl do here is to
limit the bps.
In fact, blk-throtl maintain the rq_list of read and write separately, so write
bios will not block read bios. And blk-throtl will try to dispatch 75% READS and
25% WRITES, refer to throtl_dispatch_tg.
We have illustrated the hierarchy structure of blk-throtl. Let's walk through
the source code here.
submit path
generic_make_request
-> generic_make_request_checks
-> blkcg_bio_issue_check
-> blk_throtl_bio
---
while (true) {
if (tg->last_low_overflow_time[rw] == 0)
tg->last_low_overflow_time[rw] = jiffies;
throtl_downgrade_check(tg);
throtl_upgrade_check(tg);
/* throtl is FIFO - if bios are already queued, should queue */
if (sq->nr_queued[rw])
break;
/* if above limits, break to queue */
if (!tg_may_dispatch(tg, bio, NULL)) {
tg->last_low_overflow_time[rw] = jiffies;
if (throtl_can_upgrade(td, tg)) {
throtl_upgrade_state(td);
goto again;
}
break;
}
/* within limits, let's charge and dispatch directly */
throtl_charge_bio(tg, bio);
/*
* We need to trim slice even when bios are not being queued
* otherwise it might happen that a bio is not queued for
* a long time and slice keeps on extending and trim is not
* called for a long time. Now if limits are reduced suddenly
* we take into account all the IO dispatched so far at new
* low rate and * newly queued IO gets a really long dispatch
* time.
*
* So keep on trimming slice even if bio is not queued.
*/
throtl_trim_slice(tg, rw);
/*
* @bio passed through this layer without being throttled.
* Climb up the ladder. If we''re already at the top, it
* can be executed directly.
*/
qn = &tg->qnode_on_parent[rw];
sq = sq->parent_sq; // check limit upward
tg = sq_to_tg(sq);
if (!tg)
goto out_unlock;
}
---
Dispatch path:
static void throtl_pending_timer_fn(struct timer_list *t)
{
...
again:
parent_sq = sq->parent_sq;
dispatched = false;
while (true) {
throtl_log(sq, "dispatch nr_queued=%u read=%u write=%u",
sq->nr_queued[READ] + sq->nr_queued[WRITE],
sq->nr_queued[READ], sq->nr_queued[WRITE]);
ret = throtl_select_dispatch(sq);
-> throtl_dispatch_tg // if tg_may_dispatch
-> tg_dispatch_one_bio
-> throtl_pop_queued
-> throtl_charge_bio
-> add to sq of parent tg or td
if (ret) {
throtl_log(sq, "bios disp=%u", ret);
dispatched = true;
}
there maybe still queued bio in the tg
if (throtl_schedule_next_dispatch(sq, false))
break;
/* this dispatch windows is still open, relax and repeat */
spin_unlock_irq(q->queue_lock);
cpu_relax(); //give some others chances to get in.
queued spinlock will ensure the waiters to get this lock in turn.
spin_lock_irq(q->queue_lock);
}
if (!dispatched)
goto out_unlock;
if (parent_sq) {
/* @parent_sq is another throl_grp, propagate dispatch */
if (tg->flags & THROTL_TG_WAS_EMPTY) {
tg_update_disptime(tg);
if (!throtl_schedule_next_dispatch(parent_sq, false)) {
/* window is already open, repeat dispatching */
sq = parent_sq;
tg = sq_to_tg(sq);
goto again;
}
}
} else {
/* reached the top-level, queue issueing */
queue_work(kthrotld_workqueue, &td->dispatch_work);
}
out_unlock:
spin_unlock_irq(q->queue_lock);
}
io.low limit is only available in cgroup2. cgroup with a io.max limit will never
dispatch more IO than its max limit, but it cannot ensure the cgroup always has
a appropriate bps or iops. For example:
tasks in cgroup_read have very high read workload, and tasks in cgroup_write
have very high write workload. They both issues requests on a same disk with wbt
enabled. The writes operations will be limitted due to wbt and IO performance in
cgroup_write will be very pool when cgroup_read always issues read operations.
These two cgroup both don't exceed the io.max, but cgroup_write has a very pool
performance. This is not fair for cgroup_write.
Or another example from https://lwn.net/Articles/709474/
An example usage is we have a high prio cgroup with high 'low' limit and a low
prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
prio can run above its 'low' limit, so we don't waste the bandwidth. When the
high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
under its 'low' limit. This will protect high prio cgroup to get more
resources.
The final destination is to limit the bps/iops between io.low ~ io.max.
There are two questions that need to be figured out.
When to switch to io.low limit
Related varaibles in tg
Check the bps or iops through last_bytes/io_disp/(jiffies - last_check_time)
If the result > io.low limit, set last_low_overflow_time, which means the
bps/iops is higher than io.low during the last period.
If jiffies >= tg->last_low_overflow_time + td->throtl_slice, we say the io.low
limit is reached.
This is done by throtl_downgrade_check.
throtl_downgrade_state switches the limit to LOW.
static void throtl_downgrade_state(struct throtl_data *td, int new)
{
td->scale /= 2;
throtl_log(&td->service_queue, "downgrade, scale %d", td->scale);
if (td->scale) {
td->low_upgrade_time = jiffies - td->scale * td->throtl_slice;
return;
}
td->limit_index = new;
td->low_downgrade_time = jiffies;
}
After switch to io.low limit, when to get back to io.max ?
When swith to io.low limit,
blk_throtl_bio -> tg_may_dispatch -> tg_with_in_bps_limit -> tg_bps_limit will
return the io.low limit through tg->bps[rw][td->limit_index]
then more bios will be throttled and queued.
last_low_overflow_time ( bps/iops is higher than limit) is updated in following
if limit_index == MAX
^ throttled and queued, blk_throtl_bio updates last_low_overflow_time
|
| ---------------------------------------------- LIMIT_MAX
|
| if limit_index == MAX
| charge and dispatch, throtl_downgrade_check updates last_low_overflow_time
|disp if limit_index == LOW
|bps/ throttled and queue, blk_throtl_bio updates last_low_overflow_time
|iops
|
|
| ---------------------------------------------- LIMIT_LOW
| if limit_index == MAX && time_after(now, last_low_overflow_time + throtl_slice)
| downgrade
| if limit_index == LOW
| charge and dispatch
| if limit_index == LOW && time_after(now, last_low_overflow_time + throtl_slice)
upgrade
position:
Quote from comment of throtl_tg_is_idle:
cgroup is idle if:
- single idle is too long, longer than a fixed value (in case user
configure a too big threshold) or 4 times of idletime threshold
- average think time is more than threshold
- IO latency is largely below threshold
Think time
The interval between the completion of previous IO and submitting of next IO.
blk_throtl_bio_endio will record the time of completion in tg->last_finish_time.
Then in blk_throtl_bio -> blk_throtl_update_idletime, the average think time
will be calculated.
static void blk_throtl_update_idletime(struct throtl_grp *tg)
{
unsigned long now = ktime_get_ns() >> 10;
unsigned long last_finish_time = tg->last_finish_time;
if (now <= last_finish_time || last_finish_time == 0 ||
last_finish_time == tg->checked_last_finish_time)
return;
tg->avg_idletime = (tg->avg_idletime * 7 + now - last_finish_time) >> 3;
tg->checked_last_finish_time = last_finish_time;
}
Latency
The latency here is the interval between issuing request to device and completion of the request.
This is based on the processing capability of the storage device.
If a cgroup's IO latency is blow the IO latency threshold, it means this cgroup
is handled by device fairly.
My question is: if one cgroup is below the low limit, but its
IO latency is acceptable, we could say this cgroup is served by device fairly,
but not served fairly by block layer, right ?
commit comment of b9147dd (blk-throttle: add a mechanism to estimate IO latency)
User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.
To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.
But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.
Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD
only, we probably
can use a fixed threshold like 4ms for hard disk though.
td will have average latency for echo size separately, every tg has its own
latency_target, ITOW, a tolerance.
For SSD, when the td's average latency is low, we could say the device is
relatively relaxed.
This explains why it is '&&' throtl_tg_is_idle, which means the device fall into
idle.
The samples collection is hooked in blk_stat_add.
blk_stat_add //the latency here is the interval between blk_mq_start_request and __blk_mq_complete_request
-> blk_throtl_stat_add
-> throtl_track_latency
static void throtl_track_latency(struct throtl_data *td, sector_t size,
int op, unsigned long time)
{
struct latency_bucket *latency;
int index;
if (!td || td->limit_index != LIMIT_LOW ||
!(op == REQ_OP_READ || op == REQ_OP_WRITE) ||
!blk_queue_nonrot(td->queue))
//We assume there is no congestion when LIMIT_LOW,
//and the latency make sense only when there is no congestion in device
return;
index = request_bucket_index(size);
latency = get_cpu_ptr(td->latency_buckets[op]);
latency[index].total_latency += time;
latency[index].samples++;
put_cpu_ptr(td->latency_buckets[op]);
}
The Linux sg driver is a upper level SCSI subsystem device driver that is used primarily to handle devices _not_ covered by the other upper
level drivers: sd (disks), st (tapes) and sr (CDROMs and DVDs). The sg driver is used for enclosure management, cd writers,
applications that read cd audio digitally and scanners. Sg can also be used for less usual tasks performed on disks, tapes and cdroms.
Sg is a character device driver which, in some contexts, gives it advantages over block device drivers such as sd and sr. The interface of sg
is at the level of SCSI command requests and their associated responses.
From about Linux kernel 2.6.24, there is an alternate SCSI pass-through driver called "bsg" (block SCSI generic driver). The bsg driver has
device names of the form /dev/bsg/0:1:2:3 and supports the SG_IO ioctl with the sg version 3 interface. The bsg driver also supports the sg
version 4 interface which at this time the sg driver does not. Amongst other improvements the sg version 4 interface supports SCSI bidirectional commands.
How does it work ?
bidi request
bsg_setup_queue
---
// A new request_queue
q = blk_alloc_queue(GFP_KERNEL);
if (!q)
return ERR_PTR(-ENOMEM);
q->cmd_size = sizeof(struct bsg_job) + dd_job_size;
q->init_rq_fn = bsg_init_rq;
q->exit_rq_fn = bsg_exit_rq;
q->initialize_rq_fn = bsg_initialize_rq;
q->request_fn = bsg_request_fn;
ret = blk_init_allocated_queue(q);
if (ret)
goto out_cleanup_queue;
q->queuedata = dev;
q->bsg_job_fn = job_fn;
blk_queue_flag_set(QUEUE_FLAG_BIDI, q);
blk_queue_softirq_done(q, bsg_softirq_done);
blk_queue_rq_timeout(q, BLK_DEFAULT_SG_TIMEOUT);
ret = bsg_register_queue(q, dev, name, &bsg_transport_ops, release);
---
take write as example:
bsg_write
-> __bsg_write
-> bsg_map_hdr
-> blk_get_request
-> q->bsg_dev.ops->fill_hdr
-> blk_rq_map_user //hdr->dout_xferp points to userland buffer
-> blk_rq_map_user_iov // userland buffer will be mapped directly for zero copy I/O
-> bsg_add_command
-> blk_execute_rq_nowait
bsg_request_fn
-> blk_fetch_request
-> blk_peek_request
-> blk_start_request
-> bsg_prepare_job // kref_init(&job->kref)
-> q->bsg_job_fn
bsg_softirq_done
-> bsg_job_put
-> kref_put(&job->kref, bsg_teardown_job)
bsg_teardown_job
-> blk_end_request_all
This is a very interesting method.
Unless the job->kref reaches zero, the bsg request will not be completed.
It will fix the race between blk-timeout and completion path.
Look at the following code:
fc_bsg_job_timeout
---
inflight = bsg_job_get(job);
if (inflight && i->f->bsg_timeout) {
/* call LLDD to abort the i/o as it has timed out */
err = i->f->bsg_timeout(job);
if (err == -EAGAIN) {
bsg_job_put(job);
return BLK_EH_RESET_TIMER;
} else if (err)
printk(KERN_ERR "ERROR: FC BSG request timeout - LLD "
"abort failed with status %d\n", err);
}
/* the blk_end_sync_io() doesn't check the error */
if (!inflight)
return BLK_EH_NOT_HANDLED;
else
return BLK_EH_HANDLED;
---
bidi aka bidirectional commands. There will be output and input in this kind of
command concurrently.
Look at bsg_map_hdr
---
if (hdr->dout_xfer_len && hdr->din_xfer_len) {
if (!test_bit(QUEUE_FLAG_BIDI, &q->queue_flags)) {
ret = -EOPNOTSUPP;
goto out;
}
next_rq = blk_get_request(q, REQ_OP_SCSI_IN, GFP_KERNEL);
if (IS_ERR(next_rq)) {
ret = PTR_ERR(next_rq);
goto out;
}
rq->next_rq = next_rq;
ret = blk_rq_map_user(q, next_rq, NULL, uptr64(hdr->din_xferp),
hdr->din_xfer_len, GFP_KERNEL);
if (ret)
goto out_free_nextrq;
}
---
What will be done when direct IO on a block device ?
__generic_file_write_iter
---
if (iocb->ki_flags & IOCB_DIRECT) {
loff_t pos, endbyte;
written = generic_file_direct_write(iocb, from);
if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
goto out;
// if direct_IO doesn't complete all of the IO, fallback to buffered IO.
status = generic_perform_write(file, from, pos = iocb->ki_pos);
...
/*
* We need to ensure that the page cache pages are written to
* disk and invalidated to preserve the expected O_DIRECT
* semantics.
*/
endbyte = pos + status - 1;
err = filemap_write_and_wait_range(mapping, pos, endbyte);
if (err == 0) {
iocb->ki_pos = endbyte + 1;
written += status;
invalidate_mapping_pages(mapping,
pos >> PAGE_SHIFT,
endbyte >> PAGE_SHIFT);
} else {
/*
* We don't know how much we wrote, so just return
* the number of bytes which were direct-written
*/
}
}
---
generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
{
...
if (iocb->ki_flags & IOCB_NOWAIT) {
/* If there are pages to writeback, return */
if (filemap_range_has_page(inode->i_mapping, pos,
pos + iov_iter_count(from)))
return -EAGAIN;
} else {
written = filemap_write_and_wait_range(mapping, pos,
pos + write_len - 1);
if (written)
goto out;
}
/*
* After a write we want buffered reads to be sure to go to disk to get
* the new data. We invalidate clean cached page from the region we're
* about to write. We do this *before* the write so that we can return
* without clobbering -EIOCBQUEUED from ->direct_IO().
*/
written = invalidate_inode_pages2_range(mapping,
pos >> PAGE_SHIFT, end);
...
written = mapping->a_ops->direct_IO(iocb, from);
...
if (written > 0) {
pos += written;
write_len -= written;
//Interesting thing here, the file is expanded by the direct IO.
// we have to modify the size of the inode.
if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
i_size_write(inode, pos);
mark_inode_dirty(inode);
}
iocb->ki_pos = pos;
}
iov_iter_revert(from, write_len - iov_iter_count(from));
out:
return written;
}
blkdev_direct_IO
-> __blkdev_direct_IO_simple // Let's look at the simpler case.
---
...
struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs, *bvec;
...
if (nr_pages <= DIO_INLINE_BIO_VECS)
vecs = inline_vecs;
else {
vecs = kmalloc_array(nr_pages, sizeof(struct bio_vec),
GFP_KERNEL);
if (!vecs)
return -ENOMEM;
}
bio_init(&bio, vecs, nr_pages);
bio_set_dev(&bio, bdev);
bio.bi_iter.bi_sector = pos >> 9;
bio.bi_write_hint = iocb->ki_hint;
bio.bi_private = current;
bio.bi_end_io = blkdev_bio_end_io_simple;
bio.bi_ioprio = iocb->ki_ioprio;
// The most important thing here is to fill the bi_io_vec
/
| bv_page
bio->bi_io_vec [ bio_vec ] < bv_len
[ bio_vec ] | bv_offset
[ bio_vec ] \
...
bio_iov_iter_get_pages
-> iov_iter_get_pages
-> get_user_pages_fast
It will get and pin the pages behind the userland buffers.
Then these pages will be sent to block layer directly.
So we could say this is zero-copy.
Note: get_user_pages_fast will not ensure all of the requested pages will be got
and pined.
ret = bio_iov_iter_get_pages(&bio, iter);
if (unlikely(ret))
return ret;
ret = bio.bi_iter.bi_size;
if (iov_iter_rw(iter) == READ) {
bio.bi_opf = REQ_OP_READ;
if (iter_is_iovec(iter))
should_dirty = true;
} else {
bio.bi_opf = dio_bio_write_op(iocb);
task_io_account_write(ret);
}
qc = submit_bio(&bio);
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!READ_ONCE(bio.bi_private))
break;
if (!(iocb->ki_flags & IOCB_HIPRI) ||
!blk_poll(bdev_get_queue(bdev), qc))
io_schedule();
}
// we will sleep here to wait for the completion.
// the blkdev_bio_end_io_simple will wake up us.
__set_current_state(TASK_RUNNING);
bio_for_each_segment_all(bvec, &bio, i) {
if (should_dirty && !PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
put_page(bvec->bv_page);
}
if (vecs != inline_vecs)
kfree(vecs);
if (unlikely(bio.bi_status))
ret = blk_status_to_errno(bio.bi_status);
bio_uninit(&bio);
---
Traditional suspend/resume
Runtime suspend/resume
* Once the subsystem-level suspend callback (or the driver suspend callback,
if invoked directly) has completed successfully for the given device, the PM
core regards the device as suspended, which need not mean that it has been
put into a low power state. It is supposed to mean, however, that the
device will not process data and will not communicate with the CPU(s) and
RAM until the appropriate resume callback is executed for it. The runtime
PM status of a device after successful execution of the suspend callback is
'suspended'.
Hooks in blk-legacy
__elv_add_request
-> blk_pm_add_request
---
if q->dev // support RPM
&& rq->rq_flags & RQF_PM //not PM command
&& q->nr_pending++ == 0
&& (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
pm_request_resume(q->dev) // start resume
---
elv_requeue_request
-> blk_pm_requeue_request
---
if (rq->q->dev && !(rq->rq_flags & RQF_PM))
rq->q->nr_pending--;
---
-> __elv_add_request()//ELEVATOR_INSERT_REQUEUE
__blk_put_request
-> blk_pm_put_request
---
if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
pm_runtime_mark_last_busy(rq->q->dev);
---
blk_peek_request
-> elv_next_request
-> iterate q->queue_head
if blk_pm_allow_request
return it
---
switch (rq->q->rpm_status) {
case RPM_RESUMING:
case RPM_SUSPENDING:
return rq->rq_flags & RQF_PM;
case RPM_SUSPENDED:
return false;
}
return true;
---
Don't process normal requests when queue is suspended
or in the process of suspending/resuming
The normal process of the runtime PM running in block layer is:
blk_pre_runtime_suspend
if q->nr_pending is zero
set q->rpm_status to RPM_SUSPENDING
|
v
sdev_runtime_suspend
-> pm->runtime_suspend
|
v
blk_post_runtime_suspend
-> set state to RPM_SUSPENDED
When new request is added:
__elv_add_request
-> blk_pm_add_request
---
if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
(q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
pm_request_resume(q->dev);
---
The resume process will be started here.
Before the resume is completed, the requests will not been issued to LLDD.
blk_peek_request
-> elv_next_request
---
list_for_each_entry(rq, &q->queue_head, queuelist) {
if (blk_pm_allow_request(rq))
return rq;
---
During the process of pm runtime resuming:
blk_pre_runtime_resume
-> set rpm_status to RPM_RESUMING
pm->runtime_resume
blk_post_runtime_resume
---
q->rpm_status = RPM_ACTIVE;
__blk_run_queue(q);
pm_runtime_mark_last_busy(q->dev);
pm_request_autosuspend(q->dev);
---
rpm_suspend // if RPM_AUTO
-> pm_runtime_autosuspend_expiration
-> last_busy = READ_ONCE(dev->power.last_busy);
it will check whether the device has been idle for some time,
if yes, the suspend process will proceed, otherwise, set up the
suspend_timer.
the check here depends on the dev->power.last_busy
it is updated around the blk-legacy layer.
the most important one is in blk_pm_put_request.
pm_suspend_timer_fn
---
if (expires > 0 && !time_after(expires, jiffies)) {
dev->power.timer_expires = 0;
rpm_suspend(dev, dev->power.timer_autosuspends ?
(RPM_ASYNC | RPM_AUTO) : RPM_ASYNC);
}
---
pm_runtime_put
-> __pm_runtime_idle //RPM_GET_PUT | RPM_ASYNC
---
if (rpmflags & RPM_GET_PUT) {
if (!atomic_dec_and_test(&dev->power.usage_count))
return 0;
}
might_sleep_if(!(rpmflags & RPM_ASYNC) && !dev->power.irq_safe);
spin_lock_irqsave(&dev->power.lock, flags);
//This spinlock will serialize all the things
retval = rpm_idle(dev, rpmflags);
spin_unlock_irqrestore(&dev->power.lock, flags);
---
rpm_idle
---
...
callback = RPM_GET_CALLBACK(dev, runtime_idle);
if (callback)
retval = __rpm_callback(callback, dev);
// __rpm_callback will unlock the dev->power.lock before invokes the
// driver's callback.
...
return retval ? retval : rpm_suspend(dev, rpmflags | RPM_AUTO);
---
scsi_runtime_idle will always returns -EBUSY.
Let's look at RPM_SUSPENDED
---
repeat:
retval = rpm_check_suspend_allowed(dev);
-> if dev->power.runtime_status == RPM_SUSPENDED, return 1
...
if (retval)
goto out;
...
/* Other scheduled or pending requests need to be canceled. */
pm_runtime_cancel_pending(dev);
if (dev->power.runtime_status == RPM_SUSPENDING) {
DEFINE_WAIT(wait);
...
/* Wait for the other suspend running in parallel with us. */
for (;;) {
prepare_to_wait(&dev->power.wait_queue, &wait,
TASK_UNINTERRUPTIBLE);
if (dev->power.runtime_status != RPM_SUSPENDING)
break;
spin_unlock_irq(&dev->power.lock);
schedule();
spin_lock_irq(&dev->power.lock);
}
finish_wait(&dev->power.wait_queue, &wait);
goto repeat;
}
__update_runtime_status(dev, RPM_SUSPENDING);
callback = RPM_GET_CALLBACK(dev, runtime_suspend);
dev_pm_enable_wake_irq_check(dev, true);
retval = rpm_callback(callback, dev);
if (retval)
goto fail;
no_callback:
__update_runtime_status(dev, RPM_SUSPENDED);
pm_runtime_deactivate_timer(dev);
if (dev->parent) {
parent = dev->parent;
atomic_add_unless(&parent->power.child_count, -1, 0);
}
wake_up_all(&dev->power.wait_queue);
---
Some storage controllers have DMA alignment requirement, which is often set
through blk_queue_dma_alignment, such as 512 bytes.
One of the usages of dma_alignment of request_queue.
blk_rq_map_kern
---
do_copy = !blk_rq_aligned(q, addr, len) || object_is_on_stack(kbuf);
//unsigned int alignment = queue_dma_alignment(q) | q->dma_pad_mask;
//return !(addr & alignment) && !(len & alignment);
if (do_copy)
bio = bio_copy_kern(q, kbuf, len, gfp_mask, reading);
//New page will be allocated and copy data in it.
//When bio is done, the data will be copied back to the original buffer.
//Refer to bio_copy_kern_endio_read
else
bio = bio_map_kern(q, kbuf, len, gfp_mask);
//Add the page associated with the buffer into bio.
---
The caller of this blk_rq_map_kern:
- __scsi_execute
- __nvme_submit_sync_cmd
Another similar interface is blk_rq_map_user_iov.
The blocksize of filesystem and block device.
Block: The smallest unit writable by a disk or file system. Everything a file system does is
composed of operations done on blocks. A file system block is always the same size as or larger
(in integer multiples) than the disk block size.
The bdev_logical_block_size is the q->limits.logical_block_size.
Look at how does the nvme set it.
__nvme_revalidate_disk
---
ns->lba_shift = id->lbaf[id->flbas & NVME_NS_FLBAS_LBA_MASK].ds;
...
nvme_update_disk_info
---
unsigned short bs = 1 << ns->lba_shift;
blk_mq_freeze_queue(disk->queue);
blk_integrity_unregister(disk);
blk_queue_logical_block_size(disk->queue, bs);
blk_queue_physical_block_size(disk->queue, bs);
blk_queue_io_min(disk->queue, bs);
---
---
The most important point here is that the blocksize is set during mkfs.
What is the gap ?
It is indicated by queue_virt_boundary.
The NVME PRP descriptor which is PAGE_SIZE aligned
page A+-----+ page A+-----+
| | \ PAGE_SIZE | | \ PAGE_SIZE
| | / | | /
page B+-----+ page B+-----+
| | \ PAGE_SIZE |_ _ _| > PAGE_SIZE/2
| | / | GAP |
page C+-----+ page C+-----+
| | \ PAGE_SIZE | | \ PAGE_SIZE
| | / | | /
+-----+ +-----+
So if we want to handle the unaligned PAGE_SIZE IO, need to
split the IO into 3 parts as following,
page A+-----+ page B+-----+ page C+-----+
| | \ PAGE_SIZE |_ _ _| > PAGE_SIZE/2 | | \ PAGE_SIZE
| | / | | /
+-----+ +-----+
This is done by blk_queue_split.
blk_queue_split
-> blk_bio_segment_split
---
bio_for_each_segment(bv, bio, iter) {
/*
* If the queue doesn't support SG gaps and adding this
* offset would create a gap, disallow it.
*/
if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
goto split;
....
}
split:
*segs = nsegs;
if (do_split) {
new = bio_split(bio, sectors, GFP_NOIO, bs);
if (new)
bio = new;
}
---
And let's check other places that need to check this.
// the buffer may come from userspace and not aligned
blk_rq_map_user_iov
// don't merge bios or requests if will gap
bio_will_gap <- req_gap_back_merge <- ll_back_merge_fn
<- ll_merge_requests_fn
bvec_gap_to_prev <- bio_integrity_add_page
<- bio_add_pc_page
<- integrity_req_gap_back_merge
Before queue_virt_boundary is introduced, we use QUEUE_FLAG_SG_GAPS
QUEUE_FLAG_SG_GAPS
And we check this flag in following positions.
__bio_add_page
ll_merge_requests_fn
blk_rq_merge_ok
write amplification
|----| Write granularity (e.g 32K)
|----------------------------| Erase granularity (e.g 128K)
There are contiguous user data blocks.
If we want to write a 32K block in it, we have to
- read in 128K data, and update data in it
- erase
- write this 128K
Wear leveling
A write can only occur to those pages that are erased, thereforehost write commands
invoke flash erase cycles prior to writing to the flash. This write/erase cycling causes
cell wear which imposes the limited write-life. Host write accesses can occur to any location
which can cause hot-spots, which causes premature wear in these locations.
wear-leveling is used to prevent the hot-spots
Mapping
In most cases, the controller maintains a lookup table to translate the memory array physical
block address (PBA) to the logical block address (LBA) used by the host system. The controller's
wear-leveling algorithm determines which physical block to use each time data is programmed,
eliminating the relevance of the physical location of data and enabling data to be stored
anywhere within the memory array.
Selecting
The controller typically either writes to the available erased block with the lowest erase count
(dynamic wear leveling); or it selects an available target block with the lowest overall erase
count, erases the block if necessary
Garbage collection
Given that previously written-to blocks must be erased before they are able to receive data again,
the SSD controller must, for performance, actively pre-erase blocks so new write commands can always
get an empty block.
What is the discard command for ?
If the user or operating system erases a file (not just remove parts of it), the file
will typically be marked for deletion, but the actual contents on the disk are never
actually erased. Because of this, the SSD does not know that it can erase the LBAs
previously occupied by the file, so the SSD will keep including such LBAs in the
garbage collection.
Enables the operating system to tell an SSD which blocks of previously saved data are
no longer needed as a result of file deletions or volume formatting. When an LBA is
replaced by the OS, as with an overwrite of a file, the SSD knows that the original
LBA can be marked as stale or invalid and it will not save those blocks during Garbage
collection.
A simple example of SSD write,
(Assume application only write in erase blocks)
|----| erase block
- free
o used
i invalid
|ooooo|-----|-----|-----|-----|
\__ __/ \__ __/
v v
File1 Reserved
When we write to File1,
RMW
.-----.
/ v
|iiiii|ooooo|-----|-----|-----|
\__ __/ \__ __/
v v
File1 Reserved
The original position of File1 will be reclaimed then.
If we delete File1 in filesystem layer,
|-----|ooooo|-----|-----|-----|
\__ __/
v
Reserved
SSD controller doesn't know this the File1 has been deleted,
so it still think there is valid data in the block. And if
such things happen multiple times, we would get,
|ooooo|ooooo|ooooo|ooooo|-----|
\__ __/ \__ __/ \__ __/
v v v
File2 File3 Reserved
And only two of them has a valid file on it. (filesytem knows
which block is free).
When we write data in File2 and File3 in parallel,
SSD controller has to use the Reserved block. However, there
is only one in our case, when one write is ongoing, another one
has to wait.
This is why the SSDs would become slower when it fills up.
If we DISCARD support in filesystem, when a file is deleted, filesystem
will tell the SSD controller that the associated blocks are invalid
and could be reclaimed. Then we would have,
|ooooo|-----|ooooo|-----|-----|
\__ __/ \__ __/ \__ __/
v v v
File2 File3 Reserved
Another useful link about this Block layer discard requests
Linux calls this as DISCARD
Different storage protocol has a different name, e.g TRIM ( ATA ) UNMAP ( SBC ) Deallocate ( NVME )
The restriction of merging of DISCARD is more relaxed, because the command in underlying storage
protocol supports Ranges.
For example:
- TRIM in ATA
Supports up to 64 ranges
16 bits worth of blocks per range
- UNMAP in SBC
Supports an implementation specific number of ranges
32 bits worth of blocks per range
- Deallocate in NVMe
Supports up to 256 ranges
32 bits worth of blocks per range
Look at how does nvme setup the discard command.
nvme_setup_discard
---
unsigned short segments = blk_rq_nr_discard_segments(req), n = 0;
...
range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
if (!range)
return BLK_STS_RESOURCE;
__rq_for_each_bio(bio, req) {
u64 slba = nvme_block_nr(ns, bio->bi_iter.bi_sector);
u32 nlb = bio->bi_iter.bi_size >> ns->lba_shift;
if (n < segments) {
range[n].cattr = cpu_to_le32(0);
range[n].nlb = cpu_to_le32(nlb);
range[n].slba = cpu_to_le64(slba);
}
n++;
}
---
So discontiguous bios or requests could be merged together.
blk_mq_bio_list_merge
---
if (!blk_rq_merge_ok(rq, bio))
continue;
switch (blk_try_merge(rq, bio)) {
...
case ELEVATOR_DISCARD_MERGE:
merged = bio_attempt_discard_merge(q, rq, bio);
break;
default:
continue;
}
---
In blk_try_merge
---
if (req_op(rq) == REQ_OP_DISCARD &&
queue_max_discard_segments(rq->q) > 1)
return ELEVATOR_DISCARD_MERGE;
else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
return ELEVATOR_BACK_MERGE;
else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
return ELEVATOR_FRONT_MERGE;
return ELEVATOR_NO_MERGE;
---
We could see only the req op is concerned. They needn't to be contiguous.
DISCARD has been throttled by WBT,
Look at the comment of patch from Jens to limit DISCARD in WBT.
"
Throttle discards like we would any background write. Discards should
be background activity, so if they are impacting foreground IO, then
we will throttle them down.
"
In kyber, DISCARD is counted as KYBER_OTHER which has very low priority.
There is a danger about discard in fs
the filesystem may well discard a set of sectors, then write new data to them once they are allocated to
a new file. It would be a serious mistake to reorder the new writes ahead of the discard operation,
causing the newly-written data to be lost.
Let's look at how to handle this in individual fs.
xlog_cil_committed
//transaction log of free block operation has been on the disk
//these blocks has been able to be allocated by others.
//so it is safe to discard these blocks
)
---
xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
ctx->start_lsn, abort);
xfs_extent_busy_sort(&ctx->busy_extents);
xfs_extent_busy_clear(mp, &ctx->busy_extents,
(mp->m_flags & XFS_MOUNT_DISCARD) && !abort);
---
list_for_each_entry_safe(busyp, n, list, list) {
...
if (do_discard && busyp->length &&
!(busyp->flags & XFS_EXTENT_BUSY_SKIP_DISCARD)) {
busyp->flags = XFS_EXTENT_BUSY_DISCARDED;
} else {
xfs_extent_busy_clear_one(mp, pag, busyp);
wakeup = true;
}
---
...
if (!list_empty(&ctx->busy_extents))
xlog_discard_busy_extents(mp, ctx);
---
xlog_discard_busy_extents
---
blk_start_plug(&plug);
list_for_each_entry(busyp, list, list) {
error = __blkdev_issue_discard(mp->m_ddev_targp->bt_bdev,
XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno),
XFS_FSB_TO_BB(mp, busyp->length),
GFP_NOFS, 0, &bio);
...
}
if (bio) {
bio->bi_private = ctx;
bio->bi_end_io = xlog_discard_endio;
submit_bio(bio);
} else {
xlog_discard_endio_work(&ctx->discard_endio_work);
}
blk_finish_plug(&plug);
---
This function is called by the jbd2 layer once the commit has finished,
so we know we can free the blocks that were released with that commit.
ext4_process_freed_data
---
if (test_opt(sb, DISCARD)) {
list_for_each_entry(entry, &freed_data_list, efd_list) {
err = ext4_issue_discard(sb, entry->efd_group,
entry->efd_start_cluster,
entry->efd_count,
&discard_bio);
...
if (discard_bio) {
submit_bio_wait(discard_bio);
bio_put(discard_bio);
}
}
---
https://lwn.net/Articles/347511/
At the ATA protocol level, a discard request is implemented by a "TRIM" command sent to the device.
For reasons unknown to your editor, the protocol committee designed TRIM as a non-queued command.
That means that, before sending a TRIM command to the device, the block layer must first wait for
all outstanding I/O operations on that device to complete; no further operations can be started
until the TRIM command completes. So every TRIM operation stalls the request queue. Even if TRIM
were completely free, its non-queued nature would impose a significant I/O performance cost. (It's
worth noting that the SCSI equivalent to TRIM is a tagged command which doesn't suffer from this
problem).
With current SSDs, TRIM appears to be anything but free. Mark Lord has measured regular delays of
hundreds of milliseconds. Delays on that scale would be most unwelcome on a rotating storage device.
On an SSD, hundred-millisecond latencies are simply intolerable.
In one word, discard is not free.
Someone complained that
XFS has had async discard support, but it has been problematic for our
fleet. We were seeing bursts of large discard requests caused by async
discard in XFS. This resulted in degraded drive performance increasing
latency for dependent services.
And proposed alternative that filesystem layer could reuse the blocks which has just been freed.
|ooooo|-----|-----|-----|-----|
\__ __/ \__ __/
v v
File1 Reserved
Deleted File1 and then create File2,
|ooooo|-----|-----|-----|-----|
\__ __/ \__ __/
v v
File2 Reserved