scsi

Concepts
Implementation

Concepts


SCSI Model


 Initiator                                Target
 +------+                            +-------------------+
 |      |                            |            +----+ |
 |      |                            |         ,->| LU | |
 |      |                            |        /   +----+ |
 |   +-----+---+   Connecting    +---+ +---+ /    +----+ |
 |   | HBA | P |- - - - - - - - -| P |-| R |- - ->| LU | |
 |   +-----+---+                 +---+ +---+      +----+ |
 |      |                            |             / \   |
 +------+                            +------------/   \--+
                                                /       \ 
                                              /           \
                                            /               \
                                          /                   \
                                         +---------------------+
                                         |    Task Manager     |
                                    [T]  |    +------------+   |
                                  - - - - - ->| [T]        |   |
                                         |    | [T]        |   |
                                         |    | [T] [T] [T]|   | 
                                         |    +------------+   |
                                         |    Device Server    |
                                         |    +------------+   |
                                         |    |            |   |
                                         |    +------------+   |
                                         +-----------------------

 Task/T       - scsi command
 LU           - logic unit, SCSI targets have logical units that provide the processing context for SCSI commands.
                The work of the logical unit is split between two different functionsthe device server and the task
                manager.
 Task Manager - The task manager is the work scheduler for the logical unit, determining the order in which commands
                are processed in the queue.
 Device Server- The device server executes commands received from initiators and is responsible for detecting and
                reporting errors that might occur
 Connecting   - sas/fc/iscsi
 P            - connecting port, a NIC for iscsi
 R            - task router
Nexus Object
The nexus object describes the initiator/storage communication relationship.
The type of nexus object used determines the number of concurrent commands that can be pending at any time. An I_T
nexus allows only a single command between an initiator and a specific target. An I_T_L nexus allows a single command
between an initiator and a specific logical unit. An I_T_L_Q nexus allows many possible commands to be pending, as
long as the commands are tagged.
Tagged Command Queuing
The most important feature of tagging in SCSI is tagged command queuing (T CQ), a mechanism that allows the logical
unit's task manager to reorder tasks to optimize the performance of a storage device or subsystem.Tagged command queuing was developed to optimize the performance of mechanical components in disk drives,
particularly the disk arms and actuators. The basic idea is to reorder a group of commands to reduce the overall latency
involved in seeking tracks on disk platters.
Assume there are 20 tagged tasks in a task set, each with a directive to read or write data across a random distribution
of tracks on disk media. W ithout the ability to rearrange tasks, the seek time latency would be the average seek time for
the drive. Using command queuing, the tasks could be structured so the actuator moves the minimal amount for each
task as it moves from one task's track to the track of its nearest neighbor.

The logical unit number (LUN) identifies a specific logical unit.
But LUN is not in the scsi cdb. How to use it to route to specific LU ?
This is done by scsi transport layer.
Take iscsi as example:
/* iSCSI PDU Header */
struct iscsi_scsi_req {
	uint8_t opcode;
	uint8_t flags;
	__be16 rsvd2;
	uint8_t hlength;
	uint8_t dlength[3];
	struct scsi_lun lun;
	itt_t	 itt;	/* Initiator Task Tag */
	__be32 data_length;
	__be32 cmdsn;
	__be32 exp_statsn;
	uint8_t cdb[ISCSI_CDB_SIZE];	/* SCSI Command Block */
	/* Additional Data (Command Dependent) */
};
static int iscsi_prep_scsi_cmd_pdu(struct iscsi_task *task)
{
	...
	hdr->opcode = ISCSI_OP_SCSI_CMD;
	hdr->flags = ISCSI_ATTR_SIMPLE;
	int_to_scsilun(sc->device->lun, &hdr->lun);//scsi_cmnd->device->lun
	task->lun = hdr->lun;
	...
}

Implementation


LUN Probe

SCSI_SCAN_MANUAL

scsi_scan
  -> scsi_scan_host_selected
    -> scsi_scan_channel
      -> __scsi_scan_target
        -> scsi_probe_and_add_lun //Scan LUN 0, if there is some response, scan further.
            ->scsi_alloc_sdev //the sdev associated with LUN 0
                -> sdev->request_queue = scsi_mq_alloc_queue
                -> shost->hostt->slave_alloc
            -> scsi_probe_lun //probe a single LUN using a SCSI INQUIRY
                -> scsi_execute
                There is still no sd driver loaded now, how
                to setup the command ?
                The req op is REQ_OP_SCSI_OUT/IN here
                scsi_mq_prep_fn
                  -> scsi_setup_cmnd
                    -> scsi_setup_scsi_cmnd // blk_rq_is_scsi
                
            -> scsi_add_lun
              -> set state to SDEV_RUNNING
              -> shost->hostt->slave_configure(sdev);
                -> scsi_sysfs_add_sdev
                ...
                  -> really_probe
                    -> sd_probe
                      -> async_schedule_domain(sd_probe_async, sdkp, &scsi_sd_probe_domain)
        -> scsi_report_lun_scan
          -> send out REPORT_LUNS to LUN0
          -> scsi_probe_and_add_lun based on REPORT_LUNS's result

sd_probe_async()
  -> add_disk()
    -> register_disk()
      -> bdget_disk() // other partitions will be scanned here

Completion

There are two ways to get scmds completed:

During this, the scmd->result will be set.
scsi_cmnd->result
(((result) >> 1) & 0x7f)     status byte = set from target device
(((result) >> 8) & 0xff)     msg_byte    = return status from host adapter itself.
(((result) >> 16) & 0xff)    host_byte   = set by low-level driver to indicate status.
(((result) >> 24) & 0xff)    driver_byte = set by mid-level.
scsi_decide_disposition will check them to decide what to do next.
It will also check the sense data status byte is CHECK_CONDITION and sense data
is valid.
scsi_decide_disposition
--
case CHECK_CONDITION:
        rtn = scsi_check_sense(scmd);
        if (rtn == NEEDS_RETRY)
            goto maybe_retry;
        /* if rtn == FAILED, we have no sense information;
         * returning FAILED will wake the error handler thread
         * to collect the sense and redo the decide
         * disposition */
        return rtn;
--
4 results will be returned by scsi_decide_disposition Next, scsi_finish_command will check the whether the scmd has been completed correctly.
scsi_finish_command
--
    good_bytes = scsi_bufflen(cmd);
    ..
    scsi_io_completion(cmd, good_bytes);
--

scsi_io_completion
--
    /*
     * special case: failed zero length commands always need to
     * drop down into the retry code. Otherwise, if we finished
     * all bytes in the request we are done now.
     */
    if (!(blk_rq_bytes(req) == 0 && error) &&
        !scsi_end_request(req, error, good_bytes, 0))
        return;
--
It will check whether the scmd has been finished completely.
If not, further action will be done based on the scsi_cmnd->result and sense data.

Error Handler

The error handling procedure is done in scsi EH thread context.
Before the acutal procedure, it need to take over the host.
Take over scsi host

How SCSI EH work
Quote from the Documentation/scsi/scsi_eh.txt
If eh_strategy_handler() is not present, SCSI midlayer takes charge
of driving error handling.  EH's goals are two - make LLDD, host and
device forget about timed out scmds and make them ready for new
commands.  A scmd is said to be recovered if the scmd is forgotten by
lower layers and lower layers are ready to process or fail the scmd
again.

 To achieve these goals, EH performs recovery actions with increasing
severity.  Some actions are performed by issuing SCSI commands and
others are performed by invoking one of the following fine-grained
hostt EH callbacks.  Callbacks may be omitted and omitted ones are
considered to fail always.

int (* eh_abort_handler)(struct scsi_cmnd *);
int (* eh_device_reset_handler)(struct scsi_cmnd *);
int (* eh_bus_reset_handler)(struct scsi_cmnd *);
int (* eh_host_reset_handler)(struct scsi_cmnd *);

 Higher-severity actions are taken only when lower-severity actions
cannot recover some of failed scmds.  Also, note that failure of the
highest-severity action means EH failure and results in offlining of
all unrecovered devices.

More details we could refer to Documentation/scsi/scsi_eh.txt.


Here, let's talk about a theme below:
When we issue a r/w command to scsi host adapter, we will do the dma map for the sglist (usually stream mappings). When the command times out, can we complete the command directly ? In the other words, can we unmap the DMA mappings directly ?
The answer should be NO, given that the mappings are stream ones, so they will be unmapped when the request is completed. And then the DMA resource maybe used by other context. If this DMA resource is still active in the scsi host adapter, memory corruption will come up.
How does the SCSI EH handle this ? Let's look into it
enum blk_eh_timer_return scsi_times_out(struct request *req)
{
    struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
    enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
    struct Scsi_Host *host = scmd->device->host;
...
    if (host->hostt->eh_timed_out)
        rtn = host->hostt->eh_timed_out(scmd);
    // eh_timed_out is usually NULL
    if (rtn == BLK_EH_NOT_HANDLED) {
    // hand over scmd to abort_work
        if (scsi_abort_command(scmd) != SUCCESS) {
            // if abort_work has been scheduled, hand over to EH
            set_host_byte(scmd, DID_TIME_OUT);
            scsi_eh_scmd_add(scmd);
        }
    }
    // BLK_EH_NOT_HANDLED will be always returned.
    return rtn;
}
timeout path will not complete the request.
Let's look at how does abort work handle it.
void
scmd_eh_abort_handler(struct work_struct *work)
{
    ...
        rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd);
        if (rtn == SUCCESS) {
            /*If the scmd is aborted successfully,
              the DMA mapping is cleaned. Right now, we could finish/reissue it safely */
            set_host_byte(scmd, DID_TIME_OUT);
            if (scsi_host_eh_past_deadline(sdev->host)) {
                ...
            } else if (!scsi_noretry_cmd(scmd) &&
                (++scmd->retries <= scmd->allowed)) {
                SCSI_LOG_ERROR_RECOVERY(3,
                    scmd_printk(KERN_WARNING, scmd,
                            "retry aborted command\n"));
                scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
                return;
            } else {
                ...
                scsi_finish_command(scmd);
                return;
            }
        }
        ...
    // If abort fails, hand over to EH
    scsi_eh_scmd_add(scmd);
}
scmd_eh_abort_handler will not complete the request if abort fails, but hand over it to SCSI EH.
scsi_error_handler
  -> scsi_unjam_host
    -> scsi_eh_get_sense //will not work on the scmd with failed aborting 
    -> scsi_eh_ready_devs
void scsi_eh_ready_devs(struct Scsi_Host *shost,
            struct list_head *work_q,
            struct list_head *done_q)
{
    if (!scsi_eh_stu(shost, work_q, done_q))
        if (!scsi_eh_bus_device_reset(shost, work_q, done_q))
            if (!scsi_eh_target_reset(shost, work_q, done_q))
                if (!scsi_eh_bus_reset(shost, work_q, done_q))
                    if (!scsi_eh_host_reset(shost, work_q, done_q))
                        scsi_eh_offline_sdevs(work_q,
                                      done_q);
}

The scsi_eh_stu will also not work for scmd with failed aborted.
static int scsi_eh_stu(struct Scsi_Host *shost,
                  struct list_head *work_q,
                  struct list_head *done_q)
{
...
        stu_scmd = NULL;
        list_for_each_entry(scmd, work_q, eh_entry)
            if (scmd->device == sdev && SCSI_SENSE_VALID(scmd) &&
                scsi_check_sense(scmd) == FAILED ) {
                stu_scmd = scmd;
                break;
            }

        if (!stu_scmd)
            continue;
...
        if (!scsi_eh_try_stu(stu_scmd)) {
...
}
SCSI EH will try bus_device_reset/target_reset/bus_reset/host_reset.
If succeeds, the DMA mapping must could be cleaned (the target will also be operational), then we could finish the command or reissue safely.
If all of them fail, it indicates the target or host adapter is totally dead. It should be ok to complete the request.
void scsi_eh_flush_done_q(struct list_head *done_q)
{
    struct scsi_cmnd *scmd, *next;

    list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
        list_del_init(&scmd->eh_entry);
        if (scsi_device_online(scmd->device) &&
            !scsi_noretry_cmd(scmd) &&
            (++scmd->retries <= scmd->allowed)) {
                scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
        } else {
            /*
             * If just we got sense for the device (called
             * scsi_eh_get_sense), scmd->result is already
             * set, do not set DRIVER_TIMEOUT.
             */
            if (!scmd->result)
                scmd->result |= (DRIVER_TIMEOUT << 24);
            scsi_finish_command(scmd);
        }
    }
}