scsi

Concepts

SCSI Model


 Initiator                                Target
 +------+                            +-------------------+
 |      |                            |            +----+ |
 |      |                            |         ,->| LU | |
 |      |                            |        /   +----+ |
 |   +-----+---+   Connecting    +---+ +---+ /    +----+ |
 |   | HBA | P |- - - - - - - - -| P |-| R |- - ->| LU | |
 |   +-----+---+                 +---+ +---+      +----+ |
 |      |                            |             / \   |
 +------+                            +------------/   \--+
                                                /       \ 
                                              /           \
                                            /               \
                                          /                   \
                                         +---------------------+
                                         |    Task Manager     |
                                    [T]  |    +------------+   |
                                  - - - - - ->| [T]        |   |
                                         |    | [T]        |   |
                                         |    | [T] [T] [T]|   | 
                                         |    +------------+   |
                                         |    Device Server    |
                                         |    +------------+   |
                                         |    |            |   |
                                         |    +------------+   |
                                         +-----------------------

 Task/T       - scsi command
 LU           - logic unit, SCSI targets have logical units that provide the processing context for SCSI commands.
                The work of the logical unit is split between two different functionsthe device server and the task
                manager.
 Task Manager - The task manager is the work scheduler for the logical unit, determining the order in which commands
                are processed in the queue.
 Device Server- The device server executes commands received from initiators and is responsible for detecting and
                reporting errors that might occur
 Connecting   - sas/fc/iscsi
 P            - connecting port, a NIC for iscsi
 R            - task router

Nexus Object
The nexus object describes the initiator/storage communication relationship.

Initiator/target (an I_T nexus)
Initiator/target/LUN (an I_T_L nexus)
Initiator/target/LUN/tag (an I_T_L_Q nexus)

The type of nexus object used determines the number of concurrent commands that can be pending at any time. An I_T
nexus allows only a single command between an initiator and a specific target. An I_T_L nexus allows a single command
between an initiator and a specific logical unit. An I_T_L_Q nexus allows many possible commands to be pending, as
long as the commands are tagged.

Tagged Command Queuing

The most important feature of tagging in SCSI is tagged command queuing (T CQ), a mechanism that allows the logical
unit's task manager to reorder tasks to optimize the performance of a storage device or subsystem.Tagged command queuing was developed to optimize the performance of mechanical components in disk drives,
particularly the disk arms and actuators. The basic idea is to reorder a group of commands to reduce the overall latency
involved in seeking tracks on disk platters.
Assume there are 20 tagged tasks in a task set, each with a directive to read or write data across a random distribution
of tracks on disk media. W ithout the ability to rearrange tasks, the seek time latency would be the average seek time for
the drive. Using command queuing, the tasks could be structured so the actuator moves the minimal amount for each
task as it moves from one task's track to the track of its nearest neighbor.

The logical unit number (LUN) identifies a specific logical unit.
But LUN is not in the scsi cdb. How to use it to route to specific LU ?
This is done by scsi transport layer.
Take iscsi as example:

/* iSCSI PDU Header */
struct iscsi_scsi_req {
	uint8_t opcode;
	uint8_t flags;
	__be16 rsvd2;
	uint8_t hlength;
	uint8_t dlength[3];
	struct scsi_lun lun;
	itt_t	 itt;	/* Initiator Task Tag */
	__be32 data_length;
	__be32 cmdsn;
	__be32 exp_statsn;
	uint8_t cdb[ISCSI_CDB_SIZE];	/* SCSI Command Block */
	/* Additional Data (Command Dependent) */
};
static int iscsi_prep_scsi_cmd_pdu(struct iscsi_task *task)
{
	...
	hdr->opcode = ISCSI_OP_SCSI_CMD;
	hdr->flags = ISCSI_ATTR_SIMPLE;
	int_to_scsilun(sc->device->lun, &hdr->lun);//scsi_cmnd->device->lun
	task->lun = hdr->lun;
	...
}

Implementation

LUN Probe

SCSI_SCAN_MANUAL

scsi_scan
  -> scsi_scan_host_selected
    -> scsi_scan_channel
      -> __scsi_scan_target
        -> scsi_probe_and_add_lun //Scan LUN 0, if there is some response, scan further.
            ->scsi_alloc_sdev //the sdev associated with LUN 0
                -> sdev->request_queue = scsi_mq_alloc_queue
                -> shost->hostt->slave_alloc
            -> scsi_probe_lun //probe a single LUN using a SCSI INQUIRY
                -> scsi_execute
                There is still no sd driver loaded now, how
                to setup the command ?
                The req op is REQ_OP_SCSI_OUT/IN here
                scsi_mq_prep_fn
                  -> scsi_setup_cmnd
                    -> scsi_setup_scsi_cmnd // blk_rq_is_scsi
                
            -> scsi_add_lun
              -> set state to SDEV_RUNNING
              -> shost->hostt->slave_configure(sdev);
                -> scsi_sysfs_add_sdev
                ...
                  -> really_probe
                    -> sd_probe
                      -> async_schedule_domain(sd_probe_async, sdkp, &scsi_sd_probe_domain)
        -> scsi_report_lun_scan
          -> send out REPORT_LUNS to LUN0
          -> scsi_probe_and_add_lun based on REPORT_LUNS's result

sd_probe_async()
  -> add_disk()
    -> register_disk()
      -> bdget_disk() // other partitions will be scanned here

Completion

There are two ways to get scmds completed:

LLDD complete the scmd by calling scsi_done

block layer time the scmd out

During this, the scmd->result will be set.

scsi_cmnd->result
(((result) >> 1) & 0x7f)     status byte = set from target device
(((result) >> 8) & 0xff)     msg_byte    = return status from host adapter itself.
(((result) >> 16) & 0xff)    host_byte   = set by low-level driver to indicate status.
(((result) >> 24) & 0xff)    driver_byte = set by mid-level.

scsi_decide_disposition will check them to decide what to do next.

It will also check the sense data status byte is CHECK_CONDITION and sense data
is valid.
scsi_decide_disposition
--
case CHECK_CONDITION:
        rtn = scsi_check_sense(scmd);
        if (rtn == NEEDS_RETRY)
            goto maybe_retry;
        /* if rtn == FAILED, we have no sense information;
         * returning FAILED will wake the error handler thread
         * to collect the sense and redo the decide
         * disposition */
        return rtn;
--

4 results will be returned by scsi_decide_disposition

SUCCESS, scsi_finish_command will be invoked next

NEEDS_RETRY/ADD_TO_MLQUEUE, scmd will be requeued to blk queue During this, the scsi_device will be set blocked.

__scsi_queue_insert
  -> scsi_set_blocked
  --
    case SCSI_MLQUEUE_DEVICE_BUSY:
    case SCSI_MLQUEUE_EH_RETRY:
        atomic_set(&device->device_blocked,
               device->max_device_blocked);
  --
  -> blk_requeue_request

others, scmd will be delivered to EH.

Next, scsi_finish_command will check the whether the scmd has been completed correctly.

scsi_finish_command
--
    good_bytes = scsi_bufflen(cmd);
    ..
    scsi_io_completion(cmd, good_bytes);
--

scsi_io_completion
--
    /*
     * special case: failed zero length commands always need to
     * drop down into the retry code. Otherwise, if we finished
     * all bytes in the request we are done now.
     */
    if (!(blk_rq_bytes(req) == 0 && error) &&
        !scsi_end_request(req, error, good_bytes, 0))
        return;
--

It will check whether the scmd has been finished completely.
If not, further action will be done based on the scsi_cmnd->result and sense data.

We can call scsi_requeue_command(). The request will be unprepared and put back on the queue. Then a new command will be created for it. This should be used if we made forward progress, or if we want to switch from READ(10) to READ(6) for example. cmd->device->use_10_for_rw = 0;
We can call __scsi_queue_insert(). The request will be put back on the queue and retried using the same command as before, possibly after a delay.
We can call scsi_end_request() with -EIO to fail the remainder of the request.

Error Handler

The error handling procedure is done in scsi EH thread context.
Before the acutal procedure, it need to take over the host.
Take over scsi host

set SHOST_RECOVERY in shost->shost_state, then new scmds cannot enter LLDD any more.

scsi_queue_rq
  -> scsi_host_queue_ready
    -> scsi_host_in_recovery
    if yes, goto out_dec_target_busy
  -> scsi_mq_prep_fn // if !RQF_DONTPREP
  -> scsi_dispatch_command

wait for all in-flight scmds to be failed. Note: the 'failed' here indicates the scmds enters into EH through scsi_eh_scmd_add, it could be either due to target's feedback, or abort failed in time out path. The EH kthread will be waked up only if shost->host_failed == shost->host_busy shost->host_busy means the in-flight scmds around the scsi host.

scsi_queue_rq
  -> scsi_host_queue_ready
--
    busy = atomic_inc_return(&shost->host_busy) - 1;
    ...
    if (shost->can_queue > 0 && busy >= shost->can_queue)
        goto starved;
--

shost->host_failed means the scmds entering into EH through scsi_eh_scmd_add. Even through we say "shost->host_busy == shost->host_failed", the LLDD cannot be really quiescent, the timeout path may not clean up the scmds on the target, it may be still active there, the irq commpletion could occur at any time. Currently, we will try to abort the scmd firstly in the timeout path, but it still may fail. On the other hand, all such completions are ignored as the scmds have been marked completed by the timeout path (scsi_times_out usually return BLK_EH_NOT_HANDLED).

How SCSI EH work
Quote from the Documentation/scsi/scsi_eh.txt

If eh_strategy_handler() is not present, SCSI midlayer takes charge
of driving error handling.  EH's goals are two - make LLDD, host and
device forget about timed out scmds and make them ready for new
commands.  A scmd is said to be recovered if the scmd is forgotten by
lower layers and lower layers are ready to process or fail the scmd
again.

 To achieve these goals, EH performs recovery actions with increasing
severity.  Some actions are performed by issuing SCSI commands and
others are performed by invoking one of the following fine-grained
hostt EH callbacks.  Callbacks may be omitted and omitted ones are
considered to fail always.

int (* eh_abort_handler)(struct scsi_cmnd *);
int (* eh_device_reset_handler)(struct scsi_cmnd *);
int (* eh_bus_reset_handler)(struct scsi_cmnd *);
int (* eh_host_reset_handler)(struct scsi_cmnd *);

 Higher-severity actions are taken only when lower-severity actions
cannot recover some of failed scmds.  Also, note that failure of the
highest-severity action means EH failure and results in offlining of
all unrecovered devices.

More details we could refer to Documentation/scsi/scsi_eh.txt.

Here, let's talk about a theme below:
When we issue a r/w command to scsi host adapter, we will do the dma map for the sglist (usually stream mappings). When the command times out, can we complete the command directly ? In the other words, can we unmap the DMA mappings directly ?
The answer should be NO, given that the mappings are stream ones, so they will be unmapped when the request is completed. And then the DMA resource maybe used by other context. If this DMA resource is still active in the scsi host adapter, memory corruption will come up.
How does the SCSI EH handle this ? Let's look into it

enum blk_eh_timer_return scsi_times_out(struct request *req)
{
    struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
    enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
    struct Scsi_Host *host = scmd->device->host;
...
    if (host->hostt->eh_timed_out)
        rtn = host->hostt->eh_timed_out(scmd);
    // eh_timed_out is usually NULL
    if (rtn == BLK_EH_NOT_HANDLED) {
    // hand over scmd to abort_work
        if (scsi_abort_command(scmd) != SUCCESS) {
            // if abort_work has been scheduled, hand over to EH
            set_host_byte(scmd, DID_TIME_OUT);
            scsi_eh_scmd_add(scmd);
        }
    }
    // BLK_EH_NOT_HANDLED will be always returned.
    return rtn;
}

timeout path will not complete the request.
Let's look at how does abort work handle it.

void
scmd_eh_abort_handler(struct work_struct *work)
{
    ...
        rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd);
        if (rtn == SUCCESS) {
            /*If the scmd is aborted successfully,
              the DMA mapping is cleaned. Right now, we could finish/reissue it safely */
            set_host_byte(scmd, DID_TIME_OUT);
            if (scsi_host_eh_past_deadline(sdev->host)) {
                ...
            } else if (!scsi_noretry_cmd(scmd) &&
                (++scmd->retries <= scmd->allowed)) {
                SCSI_LOG_ERROR_RECOVERY(3,
                    scmd_printk(KERN_WARNING, scmd,
                            "retry aborted command\n"));
                scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
                return;
            } else {
                ...
                scsi_finish_command(scmd);
                return;
            }
        }
        ...
    // If abort fails, hand over to EH
    scsi_eh_scmd_add(scmd);
}

scmd_eh_abort_handler will not complete the request if abort fails, but hand over it to SCSI EH.

scsi_error_handler
  -> scsi_unjam_host
    -> scsi_eh_get_sense //will not work on the scmd with failed aborting 
    -> scsi_eh_ready_devs
void scsi_eh_ready_devs(struct Scsi_Host *shost,
            struct list_head *work_q,
            struct list_head *done_q)
{
    if (!scsi_eh_stu(shost, work_q, done_q))
        if (!scsi_eh_bus_device_reset(shost, work_q, done_q))
            if (!scsi_eh_target_reset(shost, work_q, done_q))
                if (!scsi_eh_bus_reset(shost, work_q, done_q))
                    if (!scsi_eh_host_reset(shost, work_q, done_q))
                        scsi_eh_offline_sdevs(work_q,
                                      done_q);
}

The scsi_eh_stu will also not work for scmd with failed aborted.
static int scsi_eh_stu(struct Scsi_Host *shost,
                  struct list_head *work_q,
                  struct list_head *done_q)
{
...
        stu_scmd = NULL;
        list_for_each_entry(scmd, work_q, eh_entry)
            if (scmd->device == sdev && SCSI_SENSE_VALID(scmd) &&
                scsi_check_sense(scmd) == FAILED ) {
                stu_scmd = scmd;
                break;
            }

        if (!stu_scmd)
            continue;
...
        if (!scsi_eh_try_stu(stu_scmd)) {
...
}

SCSI EH will try bus_device_reset/target_reset/bus_reset/host_reset.
If succeeds, the DMA mapping must could be cleaned (the target will also be operational), then we could finish the command or reissue safely.
If all of them fail, it indicates the target or host adapter is totally dead. It should be ok to complete the request.

void scsi_eh_flush_done_q(struct list_head *done_q)
{
    struct scsi_cmnd *scmd, *next;

    list_for_each_entry_safe(scmd, next, done_q, eh_entry) {
        list_del_init(&scmd->eh_entry);
        if (scsi_device_online(scmd->device) &&
            !scsi_noretry_cmd(scmd) &&
            (++scmd->retries <= scmd->allowed)) {
                scsi_queue_insert(scmd, SCSI_MLQUEUE_EH_RETRY);
        } else {
            /*
             * If just we got sense for the device (called
             * scsi_eh_get_sense), scmd->result is already
             * set, do not set DRIVER_TIMEOUT.
             */
            if (!scmd->result)
                scmd->result |= (DRIVER_TIMEOUT << 24);
            scsi_finish_command(scmd);
        }
    }
}