SCSI Model
Initiator Target
+------+ +-------------------+
| | | +----+ |
| | | ,->| LU | |
| | | / +----+ |
| +-----+---+ Connecting +---+ +---+ / +----+ |
| | HBA | P |- - - - - - - - -| P |-| R |- - ->| LU | |
| +-----+---+ +---+ +---+ +----+ |
| | | / \ |
+------+ +------------/ \--+
/ \
/ \
/ \
/ \
+---------------------+
| Task Manager |
[T] | +------------+ |
- - - - - ->| [T] | |
| | [T] | |
| | [T] [T] [T]| |
| +------------+ |
| Device Server |
| +------------+ |
| | | |
| +------------+ |
+-----------------------
Task/T - scsi command
LU - logic unit, SCSI targets have logical units that provide the processing context for SCSI commands.
The work of the logical unit is split between two different functionsthe device server and the task
manager.
Task Manager - The task manager is the work scheduler for the logical unit, determining the order in which commands
are processed in the queue.
Device Server- The device server executes commands received from initiators and is responsible for detecting and
reporting errors that might occur
Connecting - sas/fc/iscsi
P - connecting port, a NIC for iscsi
R - task router
Nexus Object
The nexus object describes the initiator/storage communication relationship.
The type of nexus object used determines the number of concurrent commands that can be pending at any time. An I_T
nexus allows only a single command between an initiator and a specific target. An I_T_L nexus allows a single command
between an initiator and a specific logical unit. An I_T_L_Q nexus allows many possible commands to be pending, as
long as the commands are tagged.
Tagged Command Queuing
The most important feature of tagging in SCSI is tagged command queuing (T CQ), a mechanism that allows the logical
unit's task manager to reorder tasks to optimize the performance of a storage device or subsystem.Tagged command queuing was developed to optimize the performance of mechanical components in disk drives,
particularly the disk arms and actuators. The basic idea is to reorder a group of commands to reduce the overall latency
involved in seeking tracks on disk platters.
Assume there are 20 tagged tasks in a task set, each with a directive to read or write data across a random distribution
of tracks on disk media. W ithout the ability to rearrange tasks, the seek time latency would be the average seek time for
the drive. Using command queuing, the tasks could be structured so the actuator moves the minimal amount for each
task as it moves from one task's track to the track of its nearest neighbor.
The logical unit number (LUN) identifies a specific logical unit.
But LUN is not in the scsi cdb. How to use it to route to specific LU ?
This is done by scsi transport layer.
Take iscsi as example:
/* iSCSI PDU Header */
struct iscsi_scsi_req {
uint8_t opcode;
uint8_t flags;
__be16 rsvd2;
uint8_t hlength;
uint8_t dlength[3];
struct scsi_lun lun;
itt_t itt; /* Initiator Task Tag */
__be32 data_length;
__be32 cmdsn;
__be32 exp_statsn;
uint8_t cdb[ISCSI_CDB_SIZE]; /* SCSI Command Block */
/* Additional Data (Command Dependent) */
};
static int iscsi_prep_scsi_cmd_pdu(struct iscsi_task *task)
{
...
hdr->opcode = ISCSI_OP_SCSI_CMD;
hdr->flags = ISCSI_ATTR_SIMPLE;
int_to_scsilun(sc->device->lun, &hdr->lun);//scsi_cmnd->device->lun
task->lun = hdr->lun;
...
}
SCSI_SCAN_MANUAL
scsi_scan
-> scsi_scan_host_selected
-> scsi_scan_channel
-> __scsi_scan_target
-> scsi_probe_and_add_lun //Scan LUN 0, if there is some response, scan further.
->scsi_alloc_sdev //the sdev associated with LUN 0
-> sdev->request_queue = scsi_mq_alloc_queue
-> shost->hostt->slave_alloc
-> scsi_probe_lun //probe a single LUN using a SCSI INQUIRY
-> scsi_execute
There is still no sd driver loaded now, how
to setup the command ?
The req op is REQ_OP_SCSI_OUT/IN here
scsi_mq_prep_fn
-> scsi_setup_cmnd
-> scsi_setup_scsi_cmnd // blk_rq_is_scsi
-> scsi_add_lun
-> set state to SDEV_RUNNING
-> shost->hostt->slave_configure(sdev);
-> scsi_sysfs_add_sdev
...
-> really_probe
-> sd_probe
-> async_schedule_domain(sd_probe_async, sdkp, &scsi_sd_probe_domain)
-> scsi_report_lun_scan
-> send out REPORT_LUNS to LUN0
-> scsi_probe_and_add_lun based on REPORT_LUNS's result
sd_probe_async()
-> add_disk()
-> register_disk()
-> bdget_disk() // other partitions will be scanned here
There are two ways to get scmds completed:
During this, the scmd->result will be set.
scsi_cmnd->result
(((result) >> 1) & 0x7f) status byte = set from target device
(((result) >> 8) & 0xff) msg_byte = return status from host adapter itself.
(((result) >> 16) & 0xff) host_byte = set by low-level driver to indicate status.
(((result) >> 24) & 0xff) driver_byte = set by mid-level.
scsi_decide_disposition will check them to decide what to do next.
It will also check the sense data status byte is CHECK_CONDITION and sense data
is valid.
scsi_decide_disposition
--
case CHECK_CONDITION:
rtn = scsi_check_sense(scmd);
if (rtn == NEEDS_RETRY)
goto maybe_retry;
/* if rtn == FAILED, we have no sense information;
* returning FAILED will wake the error handler thread
* to collect the sense and redo the decide
* disposition */
return rtn;
--
4 results will be returned by scsi_decide_disposition
Next, scsi_finish_command will check the whether the scmd has been completed
correctly.
__scsi_queue_insert
-> scsi_set_blocked
--
case SCSI_MLQUEUE_DEVICE_BUSY:
case SCSI_MLQUEUE_EH_RETRY:
atomic_set(&device->device_blocked,
device->max_device_blocked);
--
-> blk_requeue_request
scsi_finish_command
--
good_bytes = scsi_bufflen(cmd);
..
scsi_io_completion(cmd, good_bytes);
--
scsi_io_completion
--
/*
* special case: failed zero length commands always need to
* drop down into the retry code. Otherwise, if we finished
* all bytes in the request we are done now.
*/
if (!(blk_rq_bytes(req) == 0 && error) &&
!scsi_end_request(req, error, good_bytes, 0))
return;
--
It will check whether the scmd has been finished completely.
If not, further action will be done based on the scsi_cmnd->result and sense data.
The error handling procedure is done in scsi EH thread context.
Before the acutal procedure, it need to take over the host.
Take over scsi host
How SCSI EH work
scsi_queue_rq
-> scsi_host_queue_ready
-> scsi_host_in_recovery
if yes, goto out_dec_target_busy
-> scsi_mq_prep_fn // if !RQF_DONTPREP
-> scsi_dispatch_command
scsi_queue_rq
-> scsi_host_queue_ready
--
busy = atomic_inc_return(&shost->host_busy) - 1;
...
if (shost->can_queue > 0 && busy >= shost->can_queue)
goto starved;
--
shost->host_failed means the scmds entering into EH through scsi_eh_scmd_add.
Even through we say "shost->host_busy == shost->host_failed", the LLDD cannot be
really quiescent, the timeout path may not clean up the
scmds on the target, it may be still active there, the irq commpletion could
occur at any time. Currently, we will try to abort the scmd firstly in the
timeout path, but it still may fail.
On the other hand, all such completions are ignored as the scmds have been marked
completed by the timeout path (scsi_times_out usually return BLK_EH_NOT_HANDLED).
Quote from the Documentation/scsi/scsi_eh.txt
If eh_strategy_handler() is not present, SCSI midlayer takes charge
of driving error handling. EH's goals are two - make LLDD, host and
device forget about timed out scmds and make them ready for new
commands. A scmd is said to be recovered if the scmd is forgotten by
lower layers and lower layers are ready to process or fail the scmd
again.
To achieve these goals, EH performs recovery actions with increasing
severity. Some actions are performed by issuing SCSI commands and
others are performed by invoking one of the following fine-grained
hostt EH callbacks. Callbacks may be omitted and omitted ones are
considered to fail always.
int (* eh_abort_handler)(struct scsi_cmnd *);
int (* eh_device_reset_handler)(struct scsi_cmnd *);
int (* eh_bus_reset_handler)(struct scsi_cmnd *);
int (* eh_host_reset_handler)(struct scsi_cmnd *);
Higher-severity actions are taken only when lower-severity actions
cannot recover some of failed scmds. Also, note that failure of the
highest-severity action means EH failure and results in offlining of
all unrecovered devices.
More details we could refer to Documentation/scsi/scsi_eh.txt.
Here, let's talk about a theme below:
When we issue a r/w command to scsi host adapter, we will do the dma map for the
sglist (usually stream mappings). When the command times out, can we complete
the command directly ? In the other words, can we unmap the DMA mappings
directly ?
The answer should be NO, given that the mappings are
stream ones, so they will be unmapped when the request is completed. And then
the DMA resource maybe used by other context. If this
DMA resource is still active in the scsi host adapter, memory corruption will
come up.
How does the SCSI EH handle this ? Let's look into it
enum blk_eh_timer_return scsi_times_out(struct request *req)
{
struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
struct Scsi_Host *host = scmd->device->host;
...
if (host->hostt->eh_timed_out)
rtn = host->hostt->eh_timed_out(scmd);
// eh_timed_out is usually NULL
if (rtn == BLK_EH_NOT_HANDLED) {
// hand over scmd to abort_work
if (scsi_abort_command(scmd) != SUCCESS) {
// if abort_work has been scheduled, hand over to EH
set_host_byte(scmd, DID_TIME_OUT);
scsi_eh_scmd_add(scmd);
}
}
// BLK_EH_NOT_HANDLED will be always returned.
return rtn;
}
timeout path will not complete the request.
Let's look at how does abort work handle it.
void
scmd_eh_abort_handler(struct work_struct *work)
{
...
rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd);
if (rtn == SUCCESS) {
/*If the scmd is aborted successfully,
the DMA mapping is cleaned. Right now, we could finish/reissue it safely */
set_host_byte(scmd, DID_TIME_OUT);
if (scsi_host_eh_past_deadline(sdev->host)) {
...
} else if (!scsi_noretry_cmd(scmd) &&
(++scmd->retries <= scmd->allowed)) {
SCSI_LOG_ERROR_RECOVERY(3,
scmd_printk(KERN_WARNING, scmd,
"retry aborted command\n"));
scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
return;
} else {
...
scsi_finish_command(scmd);
return;
}
}
...
// If abort fails, hand over to EH
scsi_eh_scmd_add(scmd);
}
scmd_eh_abort_handler will not complete the request if abort fails, but hand
over it to SCSI EH.
scsi_error_handler
-> scsi_unjam_host
-> scsi_eh_get_sense //will not work on the scmd with failed aborting
-> scsi_eh_ready_devs
void scsi_eh_ready_devs(struct Scsi_Host *shost,
struct list_head *work_q,
struct list_head *done_q)
{
if (!scsi_eh_stu(shost, work_q, done_q))
if (!scsi_eh_bus_device_reset(shost, work_q, done_q))
if (!scsi_eh_target_reset(shost, work_q, done_q))
if (!scsi_eh_bus_reset(shost, work_q, done_q))
if (!scsi_eh_host_reset(shost, work_q, done_q))
scsi_eh_offline_sdevs(work_q,
done_q);
}
The scsi_eh_stu will also not work for scmd with failed aborted.
static int scsi_eh_stu(struct Scsi_Host *shost,
struct list_head *work_q,
struct list_head *done_q)
{
...
stu_scmd = NULL;
list_for_each_entry(scmd, work_q, eh_entry)
if (scmd->device == sdev && SCSI_SENSE_VALID(scmd) &&
scsi_check_sense(scmd) == FAILED ) {
stu_scmd = scmd;
break;
}
if (!stu_scmd)
continue;
...
if (!scsi_eh_try_stu(stu_scmd)) {
...
}
SCSI EH will try bus_device_reset/target_reset/bus_reset/host_reset.
If succeeds, the DMA mapping must could be cleaned (the target will also be
operational), then we could finish the command or reissue safely.
If all of them fail, it indicates the target or host adapter is totally dead.
It should be ok to complete the request.
void scsi_eh_flush_done_q(struct list_head *done_q)
{
struct scsi_cmnd *scmd, *next;
list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
list_del_init(&scmd->eh_entry);
if (scsi_device_online(scmd->device) &&
!scsi_noretry_cmd(scmd) &&
(++scmd->retries <= scmd->allowed)) {
scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
} else {
/*
* If just we got sense for the device (called
* scsi_eh_get_sense), scmd->result is already
* set, do not set DRIVER_TIMEOUT.
*/
if (!scmd->result)
scmd->result |= (DRIVER_TIMEOUT << 24);
scsi_finish_command(scmd);
}
}
}