Multipath

How to identify pathes to one LU
Deliver request
path select
path group
device handler
retry_tick

Concepts

multiple path

The storage process could be broken into 3 different functional areas as following:

Connecting - Provides the transmission of data between systems and storage
Storing    - A lower-level application with specialized commands and control protocols for 
             system/device interactions.
Filing     - Directs the placement of data objects in storage and is responsible for
             presenting data to applications and users

For the iSCSI, the TCP/IP is _connecting_, iSCSI is the _storing_.
For the NVMe, the NVMe is storing, the _connecting_ could be pcie, fc or rdma....
Filing is commonly the filesystem.

We could easily infer that the multipath is in _connecting_, it depends on the redundant IO path between
system and storage.

The connection from the server through the HBA to the storage controller is referred as a path. When multiple paths
exists to a storage device(LUN) on a storage subsystem, it is referred as multipath connectivity. It is a enterprise
level storage capability. Main purpose of multipath connectivity is to provide redundant access to the storage devices,
i.e to have access to the storage device when one or more of the components in a path fail. Another advantage of
multipathing is the increased throughput by way of load balancing.

Note: Multipathing protects against the failure of path(s) and not the failure of a specific storage device.

Common example of multipath is a SAN connected storage device. Usually one or more fibre channel HBAs from the host will be
connected to the fabric switch and the storage controllers will be connected to the same switch.
A simple example of multipath could be: 2 HBAs connected to a switch to which the storage controllers are connected. In this
case the storage controller can be accessed from either of the HBAs and hence we have multipath connectivity.
Following is an example of multipath

                              +----------------------+
                              |                      |  
                              |        SERVER        |
                              |                      |
                              +-----+----------+-----+
                              |HBA_0|          |HBA_1|
                              +--+--+          +--+--+
                                 |                |
                                 |   +--------+   |
                                 +---| SWITCH |---+ 
                                     +-+----+-+
                                       |    | 
                                 +-----+    +-----+
                                 |                |
                             +---+---+        +---+---+
                             |CONTRL0|        |CONTRL1|
                             +-------+--------+-------+
                             |       STORAGE          |
                             +------------------------+

a/a a/p alua

active/active vs active/passive storage controllers
What is alua

Active-Active storage controllers describe an architecture where both controllers are doing the above functions and the load
 on the storage array is automatically distributed evenly between all the controllers.

Active-Passive storage controllers refer to the approach where only one controller is doing the above functions (that’s the
 “active” controller) and the second controller is not doing anything (you guessed it, that’s the “passive” controller).

Asymmetric Logical Unit Access or ALUA refers to a hybrid approach. I know that you are thinking, “ALUA is a horrible
 acronym.” In this approach, both controllers are at work, but Logical Unit Numbers (LUNs) have an affinity to a specific controller and
 usually, if you access the LUN from a different controller, you’ll pay a performance price. In addition, the performance per LUN is limited to
 that of a single controller (yes, also in a scale-out architecture). So you need to start manual load balancing of the LUNs between
 controllers. Did I mention it’s a headache?

Implementation

The kernal part of multipath is based on Device mapper and it is the only one request-based dm driver.

                 Filesystem
 ---------------------------------- generic_make_request
                 BIO
 ---------------------------------- blk_make_request
                 Blk-mq
                 Queue
 ---------------------------------- dm_mq_queue_rq
                 Blk-mq 
                 w/o sched
         Queue0         Queue1
 ---------------------------------- scsi_queue_rq
                 SCSI-mq

How to identify pathes to one LU

path_discovery
---
    udev_iter = udev_enumerate_new(udev);
    if (!udev_iter)
        return -ENOMEM;

    udev_enumerate_add_match_subsystem(udev_iter, "block");
    udev_enumerate_add_match_is_initialized(udev_iter);
    udev_enumerate_scan_devices(udev_iter);


    udev_list_entry_foreach(entry,
                udev_enumerate_get_list_entry(udev_iter)) {

        const char *devtype;
        devpath = udev_list_entry_get_name(entry);
        condlog(4, "Discover device %s", devpath);
        udevice = udev_device_new_from_syspath(udev, devpath);
        if (!udevice) {
            condlog(4, "%s: no udev information", devpath);
            continue;
        }
        devtype = udev_device_get_devtype(udevice);
        if(devtype && !strncmp(devtype, "disk", 4)) {
            total_paths++;
            conf = get_multipath_config();
            pthread_cleanup_push(put_multipath_config, conf);

            if (path_discover(pathvec, conf,
                      udevice, flag) == PATHINFO_OK)

                num_paths++;
            pthread_cleanup_pop(1);
        }
        udev_device_unref(udevice);
    }
---

configure
  -> path_discover
    -> store_pathinfo
      -> path_info // collect information.
        -> sysfs_pathinfo
        -> scsi_ioctl_pathinfo

all the paths will be collected together.
then coalesce the paths to a same target/LU together
a dm-mpath will be built on top of these paths.

  -> coalesce_paths
    -> add_map_with_path
      -> adopt_paths
---
    vector_foreach_slot (pathvec, pp, i) {

        if (!strncmp(mpp->wwid, pp->wwid, WWID_SIZE)) {

            condlog(3, "%s: ownership set to %s",
                pp->dev, mpp->alias);
            pp->mpp = mpp;

            if (!mpp->paths && !(mpp->paths = vector_alloc()))
                return 1;

            if (!find_path_by_dev(mpp->paths, pp->dev) &&
                store_path(mpp->paths, pp))
                    return 1;
            conf = get_multipath_config();
            pthread_cleanup_push(put_multipath_config, conf);
            ret = pathinfo(pp, conf,
                       DI_PRIO | DI_CHECKER);
            pthread_cleanup_pop(1);
            if (ret)
                return 1;
        }
    }

---

The wwid is the flag that identify paths that are connecting the same target/LU. Where does it come form ?

configure //DI_ALL
  -> path_discover
    -> store_pathinfo
      -> path_info //DI_WWID
        -> get_uid //SYSFS_BUS_SCSI
          -> get_vpd_uid
            -> get_vpd_sysfs //0x83
              -> sysfs_get_vpd

read /sys/block/sda/device/vpd_pg83

vps: vital product data

scsi_add_lun
  -> scsi_attach_vpd
    -> scsi_update_vpd_page
      -> scsi_get_vpd_buf
        -> scsi_vpd_inquiry
---

    cmd[0] = INQUIRY;
    cmd[1] = 1;        /* EVPD */
    cmd[2] = page; //0x83

    cmd[3] = len >> 8;
    cmd[4] = len & 0xff;
    cmd[5] = 0;        /* Control byte */

    /*
     * I'm not convinced we need to try quite this hard to get VPD, but
     * all the existing users tried this hard.
     */
    result = scsi_execute_req(sdev, cmd, DMA_FROM_DEVICE, buffer,
                  len, NULL, 30 * HZ, 3, NULL);

---
SCSI_Command_Manual_Reference

The application client requests the vital product data information by setting the EVPD bit to one 
and specifying the page code of a vital product data.

5.4.11 Device Identification VPD page (83h)
Device identifiers consist of one or more of the following:

 Logical unit names;
 SCSI target port identifiers;
 SCSI target port names;
 SCSI target device names;
 Relative target port identifiers;
 SCSI target port group number; or
 Logical unit group number.

Deliver request


  ----------------------------
         Queue of DM

   Merge, Sort, IO scheduling
  ----------------------------
    Path0          Path1           
  Queue of Dev    Queue of Dev
  -----------     ------------
      | issue         | issue
      v               v


All the merge, sort and io scheduling work is done in the queue of DM device.
The requests will only passthrough the queue of target device.
This is done by blk_insert_cloned_request.

blk_insert_cloned_request used to queue the original request to underlying
request_queues' hctx->dispatch and run the hctx directly.
When the underlying driver returns BLK_STS_DEV_RESOURCE/BLK_STS_RESOURCE,
the cloned request will be queued to underlying hctx->dispatch and wait for
the resource. And also, the original request has been issued from the top layer
queue (dm queue) and cannot do any merge on it.

A patch from leiming improve this.
396eaf2 (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback)
---
    Fix this by updating blk_insert_cloned_request() to use
    blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
    request to be issued directly to the underlying queue and returns the
    dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
    returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
    to _not_ destage the request.  Whereby preserving the opportunity to
    merge IO.
---
See map_request()
---
    case DM_MAPIO_REMAPPED:
        if (setup_clone(clone, rq, tio, GFP_ATOMIC)) {
            /* -ENOMEM */
            ti->type->release_clone_rq(clone);
            return DM_MAPIO_REQUEUE;
        }

        /* The target has remapped the I/O so dispatch it */
        trace_block_rq_remap(clone->q, clone, disk_devt(dm_disk(md)),
                     blk_rq_pos(rq));
        ret = dm_dispatch_clone_request(clone, rq);
        if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
            blk_rq_unprep_clone(clone);
            tio->ti->type->release_clone_rq(clone);
            tio->clone = NULL;

            //release the cloned request and requeue the original request.
q->mq_ops)
                r = DM_MAPIO_DELAY_REQUEUE;
            else
                r = DM_MAPIO_REQUEUE;
            goto check_again;
        }
        break;
    case DM_MAPIO_REQUEUE:
        /* The target wants to requeue the I/O */
        break;
    case DM_MAPIO_DELAY_REQUEUE:
        /* The target wants to requeue the I/O after a delay */
        dm_requeue_original_request(tio, true);
        break;

---

path select

multipath_clone_and_map
---
    /* Do we need to select a new pgpath? */
    pgpath = READ_ONCE(m->current_pgpath);
    if (!pgpath || !test_bit(MPATHF_QUEUE_IO, &m->flags))
        pgpath = choose_pgpath(m, nr_bytes);
---
We will choose path in following cases:

 m->current_pgpath is NULL
The typical case which will set current_pgpath to NULL is fail_path.
When io error occurs on the path.
dm_softirq_done
  -> dm_done
    -> multipath_end_io //blk_path_error
      -> fail_path
When user mode message indicates that (multipathd)
multipath_message
 MPATHF_QUEUE_IO is not set
It is set by __switch_pg when device handler is set
---
    if (m->hw_handler_name) {
        set_bit(MPATHF_PG_INIT_REQUIRED, &m->flags);
        set_bit(MPATHF_QUEUE_IO, &m->flags);
    }
---
The PG initialization process is started by pg_init_all_paths.
When it is completed, pg_init_done is invoked and MPATHF_QUEUE_IO is cleared.
after pg_init_done clears MPATHF_QUEUE_IO, it will also kick the requeue list.
as we know, when MPATHF_QUEUE_IO is set, the reqs are all requeued.
pg_init_done
  -> process_queued_io_list
    -> dm_mq_kick_requeue_list.
      -> __dm_mq_kick_requeue_list
        -> blk_mq_delay_kick_requeue_list



Therefore, we will select path every time!

choose_pgpath
---

/* Don't change PG until it has no remaining paths */

check_current_pg:
    pg = READ_ONCE(m->current_pg);
    if (pg) {
        pgpath = choose_path_in_pg(m, pg, nr_bytes);
        if (!IS_ERR_OR_NULL(pgpath))
            return pgpath;
    }
---
static struct pgpath *choose_path_in_pg(struct multipath *m,
                    struct priority_group *pg,
                    size_t nr_bytes)
{
    unsigned long flags;
    struct dm_path *path;
    struct pgpath *pgpath;

    path = pg->ps.type->select_path(&pg->ps, nr_bytes);
    if (!path)
        return ERR_PTR(-ENXIO);

    pgpath = path_to_pgpath(path);

    if (unlikely(READ_ONCE(m->current_pg) != pg)) {
        /* Only update current_pgpath if pg changed */
        spin_lock_irqsave(&m->lock, flags);
        m->current_pgpath = pgpath;

        __switch_pg(m, pg);

        spin_unlock_irqrestore(&m->lock, flags);
    }

    return pgpath;
}
Let's look at the round-robin selector:
static struct dm_path *rr_select_path(struct path_selector *ps, size_t nr_bytes)
{
    unsigned long flags;
    struct selector *s = ps->context;
    struct path_info *pi = NULL;

    spin_lock_irqsave(&s->lock, flags);
    if (!list_empty(&s->valid_paths)) {

    select and move it to the back of the list

        pi = list_entry(s->valid_paths.next, struct path_info, list);
        list_move_tail(&pi->list, &s->valid_paths);
    }
    spin_unlock_irqrestore(&s->lock, flags);

    return pi ? pi->path : NULL;
}

Except for round-robin selector, there are queue-length and service-time ones
which looks smarter.

path group

http://www.linux-admins.net/2010/10/multipath-usage-guide-for-san-in-rhel.html
Determines how the path group(s) are formed using the available paths. There are five different policies:

 multibus: One path group is formed with all paths to a LUN. Suitable for devices that are in Active/Active mode.
 failover: Each path group will have only one path.
 group_by_serial: One path group per storage controller(serial). All paths that connect to the LUN through a controller are assigned to a 
 path group. Suitable for devices that are in Active/Passive mode.
 group_by_prio: Paths with same priority will be assigned to a path group.
 group_by_node_name: Paths with same target node name will be assigned to a path group.

device handler

retry_tick

queue_if_no_path will not be set forever in multipath.
Look at checkerloop()
static void * checkerloop (void *ap)
{
    ...
    while (1) {
        ...

        if (vecs->pathvec) {
            vector_foreach_slot (vecs->pathvec, pp, i) {
                check_path(vecs, pp);
            }
        }
        if (vecs->mpvec) {
            defered_failback_tick(vecs->mpvec);
            retry_count_tick(vecs->mpvec);    // ----> here !
        }
        ...
        sleep(1);
    }
    return NULL;
}
static void
retry_count_tick(vector mpvec)
{
    struct multipath *mpp;
    unsigned int i;

    vector_foreach_slot (mpvec, mpp, i) {
        if (mpp->retry_tick) {
            mpp->stat_total_queueing_time++;
            condlog(4, "%s: Retrying.. No active path", mpp->alias);
            if(--mpp->retry_tick == 0) {
                dm_queue_if_no_path(mpp->alias, 0);     // ----> here !
                condlog(2, "%s: Disable queueing", mpp->alias);
            }
        }
    }
}

retry_tick is set in

extern void set_no_path_retry(struct multipath *mpp)
{
    mpp->retry_tick = 0;
    mpp->nr_active = pathcount(mpp, PATH_UP) + pathcount(mpp, PATH_GHOST);
    select_no_path_retry(mpp); // Usually, it will set mpp->no_path_retry based on the value 
                                   // of 'no_path_retry' in multipath.conf

    switch (mpp->no_path_retry) {
    ...
    default:
        dm_queue_if_no_path(mpp->alias, 1);
        if (mpp->nr_active == 0) {
            /* Enter retry mode */
            mpp->retry_tick = mpp->no_path_retry * conf->checkint;
            condlog(1, "%s: Entering recovery mode: max_retries=%d",
                mpp->alias, mpp->no_path_retry);
        }
        break;
    }
}

We could see that when there is no active path in one multipath, queue_if_no_path will be disabled.
There are two parameters decides the final time before disabling the queue_if_no_path.
no_path_retry   --  no_path_retry in multipath.conf
checkint        --  polling_interval in multipath.conf (by default 5)

Let's look at the dmesg log:
[541068.033114]  connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4835727978, last ping 4835732984, now 4835738000
[541068.033485]  connection4:0: detected conn error (1022)
[541073.033131]  session4: session recovery timed out after 5 secs

// the sdevs will be offlined after iscsi recovery timed out.

[541073.033140] sd 9:0:0:1: rejecting I/O to offline device
[541073.033448] sd 9:0:0:1: [sde] killing request
[541073.033451] sd 9:0:0:1: rejecting I/O to offline device
[541073.033730] sd 9:0:0:1: [sde] killing request
[541073.033732] sd 9:0:0:1: rejecting I/O to offline device
[541073.034007] sd 9:0:0:1: [sde] killing request
[541073.034016] sd 9:0:0:1: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[541073.034018] sd 9:0:0:1: [sde] CDB: Read(10) 28 00 00 00 00 00 00 00 08 00
[541073.034020] blk_update_request: I/O error, dev sde, sector 0
[541073.034315] sd 9:0:0:1: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[541073.034317] sd 9:0:0:1: [sde] CDB: Write(10) 2a 00 00 00 36 28 00 00 18 00
[541073.034318] blk_update_request: I/O error, dev sde, sector 13864
[541073.034593] sd 9:0:0:1: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[541073.034595] sd 9:0:0:1: [sde] CDB: Write(10) 2a 00 03 f1 14 38 00 01 c8 00
[541073.034596] blk_update_request: I/O error, dev sde, sector 66131000
[541073.034866] device-mapper: multipath: Failing path 8:64.

// We could see io error on target device layer, but not dm layer, it is because queue_if_no_path was still enabled
// the failed io and and new io will be requeued.

// After about 50s, io error on dm layer came up.
// This is due to the retry_tick had been exhausted.
// when queue_if_no_path is disabled, kernel will run the request_queue to fail the requeued io.
// refer to queue_if_no_path in driver/md/dm-mpath.c
[541123.486079] blk_update_request: I/O error, dev dm-5, sector 66131000
[541123.486342] Buffer I/O error on dev dm-5, logical block 8266375, lost async page write
[541123.486629] Buffer I/O error on dev dm-5, logical block 8266376, lost async page write
[541123.486900] Buffer I/O error on dev dm-5, logical block 8266377, lost async page write
[541123.487170] Buffer I/O error on dev dm-5, logical block 8266378, lost async page write
[541123.487438] Buffer I/O error on dev dm-5, logical block 8266379, lost async page write
[541123.487705] Buffer I/O error on dev dm-5, logical block 8266380, lost async page write
[541123.487958] Buffer I/O error on dev dm-5, logical block 8266381, lost async page write
[541123.488214] Buffer I/O error on dev dm-5, logical block 8266382, lost async page write
[541123.488475] Buffer I/O error on dev dm-5, logical block 8266383, lost async page write
[541123.488748] Buffer I/O error on dev dm-5, logical block 8266384, lost async page write


So the final result above is expected. There is no bug here.
In addition, I guess the 'no_path_retry' of the target device is 10. Please confirm with customer.

Thanks
Jianchao