IRQ

cpu_mask
managed irq
irqbalance
x86 irq vector

Framework

struct irq_chip
 - A set of methods describing how to drive the interrupt controller
 - Directly called by core IRQ code
struct irqdomain
 - A pointer to the firmware node for a given interrupt controller (fwnode)
 - A method to convert a firmware description of an IRQ into an ID local to this interrupt controller (hwirq)
 - A way to retrieve the Linux view of an IRQ from the hwirq
struct irq_desc
 - Linux’s view of an interrupt
 - Contains all the core stuff
 - 1:1 mapping to the Linux interrupt number
struct irq_data
 - Contains the data that is relevant to the irq_chip managing this interrupt
 - Both the Linux IRQ number and the hwirq
 - A pointer to the irq_chip
 - Embedded in irq_desc (for now)

IRQ Domain

There are two main reasons for linux kernel to introduce irq_domain.

there could be multiple interrupt controllers in system, the single irq number space cannot ensure non-overlapping accross irq numbers of different controllers

irq number has loose all kind of correspondence to hardware interrupt numbers

The irq_domain library adds mapping between hwirq and IRQ numbers on top of the irq_alloc_desc*() API

Where is the irq number in nowadays kernel ?

nr_irqs = NR_IRQS

static DECLARE_BITMAP(allocated_irqs, IRQ_BITMAP_BITS);

struct irq_desc *irq_to_desc(unsigned int irq)
{
	return radix_tree_lookup(&irq_desc_tree, irq);
}
All the irq_desc structures are stored in a radix tree irq_desc_tree indexed by irq number.

A irq domain example on x86 platform

On some architectures, there may be multiple interrupt controllers
involved in delivering an interrupt from the device to the target CPU.
Let's look at a typical interrupt delivering path on x86 platforms::

  Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU

There are three interrupt controllers involved:

1) IOAPIC controller
2) Interrupt remapping controller
3) Local APIC controller

To support such a hardware topology and make software architecture match
hardware architecture, an irq_domain data structure is built for each
interrupt controller and those irq_domains are organized into hierarchy.
When building irq_domain hierarchy, the irq_domain near to the device is
child and the irq_domain near to CPU is parent. So a hierarchy structure
as below will be built for the example above::

	CPU Vector irq_domain (root irq_domain to manage CPU vectors)
		^
		|
	Interrupt Remapping irq_domain (manage irq_remapping entries)
		^
		|
	IOAPIC irq_domain (manage IOAPIC delivery entries/pins)

Following is the output of irq debugfs about this:

domain:  INTEL-IR-MSI-1-2
 hwirq:   0x100003
 chip:    IR-PCI-MSI
  flags:   0x10
             IRQCHIP_SKIP_SET_WAKE
 parent:
    domain:  INTEL-IR-1
     hwirq:   0x160000
     chip:    INTEL-IR
      flags:   0x0
     parent:
        domain:  VECTOR
         hwirq:   0x80
         chip:    APIC
          flags:   0x0
         Vector:    33
         Target:     3

IRQ Affinity

cpu_mask

Before look into the irq affinity, let's look at something about the cpumask.

The following particular system cpumasks and operations manage
possible, present, active and online cpus.

cpu_possible_mask- has bit ‘cpu’ set iff cpu is populatable
cpu_present_mask – has bit ‘cpu’ set iff cpu is populated
cpu_online_mask – has bit ‘cpu’ set iff cpu available to scheduler
cpu_active_mask – has bit ‘cpu’ set iff cpu available to migration

If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.

The cpu_possible_mask is fixed at boot time, as the set of CPU id’s
that it is possible might ever be plugged in at anytime during the
life of that system boot. The cpu_present_mask is dynamic(*),
representing which CPUs are currently plugged in. And
cpu_online_mask is the dynamic subset of cpu_present_mask,
indicating those CPUs available for scheduling.

If HOTPLUG is enabled, then cpu_possible_mask is forced to have
all NR_CPUS bits set, otherwise it is just the set of CPUs that
ACPI reports present at boot.

If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
depending on what ACPI reports as currently plugged in, otherwise
cpu_present_mask is just a copy of cpu_possible_mask.

(*) Well, cpu_present_mask is dynamic in the hotplug case. If not
hotplug, it’s a copy of cpu_possible_mask, hence fixed at boot.

managed irq

The managed here means the kernel is responsible for managing the cpu affinity of the irqs.
If the driver provide some affinity hint when allocate vectors, then the irqs will be managed ones.

irq_remapping_setup_msi_irqs
  -> do_setup_msix_irqs
    -> irq_alloc_hwirq_affinity
	  -> __irq_alloc_hwirqs
	    -> __irq_alloc_descs
		  -> alloc_descs
		  ---
			flags = affinity ? IRQD_AFFINITY_MANAGED : 0;
			// If affinity mask is provided, affinity will be managed by kernel.
		  ---

The path to generate the irq affinity masks.

 prev       affinity spreading     post
 |- - - | - - - - - - - - - - - - | - - - -|
 
 resv = prev + post

pci_alloc_irq_vectors_affinity
  -> __pci_enable_msix_range
    -> irq_calc_affinity_vectors
    --
    ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
    by default, we prefer one-to-one mapping between hw queue and cpu
    --
    -> __pci_enable_msix
      -> msix_setup_entries
        -> irq_create_affinity_masks

Take an example here.


4 irq vectors v0,v1,v2,v3.

cpu topology

               Node0                              Node1
{ [core0, core2], [core1, core3]}  { [core0, core2], [core1, core3]}

[] -> siblings

There are two stages to spread the irqs.

spread the irq vectors across the NUMA nodes

     every node will get two irq vectors
     v0,v1  ->  Node0
     v2,v3  ->  Node1

spread the irq vectors that have been assigned to a node across the cpus in that node
```
     v0  ->  Node0-core0,core2
     v1  ->  Node0-core1,core3
```

Before commit 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
the irq vectors will be spread across the present cpus. So we may get following scenarios.

               Node0                              Node1
{ [core0, core2], [core1, core3]}  { [core0, core2], [core1, core3]}
                                                        X      X
[] -> siblings
X  -> not present

At most, 6 irq vectors could be allocated due to irq_calc_affinity_vectors
v0,v1,v2,v3,v4,v5

Let's look at how will the irq vectors be spread.

    for_each_node_mask(n, nodemsk) {
        int ncpus, v, vecs_to_assign, vecs_per_node;

        /* Spread the vectors per node */
        vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
        2 nodes, 6 vectors, vecs_per_node is 3 
        /* Get the cpus on this node which are in the mask */
        cpumask_and(nmsk, cpu_present_mask, node_to_present_cpumask[n]);

        /* Calculate the number of cpus per vector */
        ncpus = cpumask_weight(nmsk);
        node0 has 4 cpus
        vecs_to_assign = min(vecs_per_node, ncpus);
        vecs_to_assign will be 3 here
        /* Account for rounding errors */
        extra_vecs = ncpus - vecs_to_assign * (ncpus / vecs_to_assign);
        extra_vecs is 1
        
        
        for (v = 0; curvec < last_affv && v < vecs_to_assign;
             curvec++, v++) {
            cpus_per_vec = ncpus / vecs_to_assign;
            cpus_per_vec is 1
            /* Account for extra vectors to compensate rounding errors */
            if (extra_vecs) {
                cpus_per_vec++;
                --extra_vecs;
            }
            node0 has 4 cpus, but 3 vectors are assigned to
                it. so 1 vector will have 2 cpu assigned
            
            irq_spread_init_one(masks + curvec, nmsk, cpus_per_vec);
        }

        if (curvec >= last_affv)
            break;
        --nodes;
        
        There are 3 vectors left, but node1 just have 2 cpu present.
        
        vecs_to_assign = min(vecs_per_node, ncpus);

        Just 2 vectors could be assigned with cpu.
        The left one's masks will get irq_default_affinity.
        

    }
done:
    put_online_cpus();

    /* Fill out vectors at the end that don't need affinity */
    for (; curvec < nvecs; curvec++)
        cpumask_copy(masks + curvec, irq_default_affinity);
    free_node_to_possible_cpumask(node_to_possible_cpumask);
out:
    free_cpumask_var(nmsk);
    return masks

Here is one case that a patch I sent to upstream causes a big bug when block layer maintainer tested it.

This is the patch https://lkml.org/lkml/2018/3/13/227
What is the issue ?
It employs "pre_vector=1" when allocate irq vectors, and  assign separate irq vectors for adminq and ioq1.
So nvme io queues will get a irq vector which is equal to its qid. But I missed 
the blk_mq_pci_map_queues() which still thinks the irq vector is equal to qid-1.

To reproduce this issue we must add "possible_cpus=16" on the kernel commandline  and boot it on my machine with
8 cores. Then we will get following irq affinity result.
nvme0q1   nvme0q2   nvme0q3    nvme0q4    nvme0q5    nvme0q6     nvme0q7    nvme0q8
0,4       1,5       2,6        3,7        8-9        10-11       12-13      14-15
We could see irqs of nvme0q5~nvme0q8 are set affinity to not present cpu.
This is due to 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
Leiming has submitted a series patch to fix it.
genirq/affinity: irq vector spread among online CPUs as far as possible

Let's return to our issue.
int blk_mq_pci_map_queues(struct blk_mq_tag_set *set, struct pci_dev *pdev)
{
    const struct cpumask *mask;
    unsigned int queue, cpu;
    /*0 based here, the mask for vector 0 contains all the possible cpus.
    so it will do a wrong mapping between ctxs and hctxs.
    */
    for (queue = 0; queue < set->nr_hw_queues; queue++) {
            mask = pci_irq_get_affinity(pdev, queue);
            if (!mask)
                goto fallback;

        for_each_cpu(cpu, mask)
            set->mq_map[cpu] = queue;
    }

	return 0;
...
}
We get a mapping as following:                             |
                                                           V
ctxs            14,15     0,4       1,5        2,6        3,7        8,9         10,11      12,13
                hctx0     hctx1     hctx2      hctx3      hctx4      hctx5       hctx6      hctx7
                nvme0q1   nvme0q2   nvme0q3    nvme0q4    nvme0q5    nvme0q6     nvme0q7    nvme0q8
cpu affinitys   0,4       1,5       2,6        3,7        8-9        10-11       12-13      14-15

The issue is:
if we submit IO on cpu3 or cpu7, hctx4/nvme0q5 will handle it. After completion, the irq will be sent to 8,9 which are
not present on my machine.
taskset -c 3 dd if=/dev/nvme0n1 of=/dev/null bs=4k count=1
And get following log in dmesg
[  350.800366] nvme nvme0: I/O 253 QID 5 timeout, completion polled

The userland deamon (irqbalance) cannot set affinity of managed irqs.

write_irq_affinity
  -> irq_can_set_affinity_usr
    -> !irqd_affinity_is_managed //Affinity is auto-managed by the kernel, userland cannot modify it.
  -> irq_select_affinity_usr
    -> setup_affinity

irqbalance

Build the cpu hierarchy

main
  -> build_object_tree
    -> build_numa_node_list // parse /sys/devices/system/node/ 
    -> parse_cpu_tree

The cpu heirarchy is built as:

 - At the top are numa_nodes
 - All CPU packages belong to a single numa_node
 - All Cache domains belong to a CPU package
 - All CPU cores belong to a cache domain

Objects are built in that order (top down)
Object workload is the aggregate sum of the workload of the objects below it

All the components on the hierarchy is topo_obj

struct topo_obj {
    uint64_t load;
    uint64_t last_load;
    enum obj_type_e obj_type;
    int number;
    int powersave_mode;
    cpumask_t mask;
    GList *interrupts;           // GList entry of irq_info
    struct topo_obj *parent;
    GList *children;
    GList **obj_type_list;
};

enum obj_type_e {
    OBJ_TYPE_CPU,
    OBJ_TYPE_CACHE,
    OBJ_TYPE_PACKAGE,
    OBJ_TYPE_NODE
};

Collect irq info

main
  -> build_object_tree
    -> rebuild_irq_db

Calculate the positions where the irqs should be migrated

calculate_placement
---
    if (g_list_length(rebalance_irq_list) > 0) {

        // Place irqs in the best node.

        for_each_irq(rebalance_irq_list, place_irq_in_node, NULL);
        for_each_object(numa_nodes, place_irq_in_object, NULL);
        for_each_object(packages, place_irq_in_object, NULL);
        for_each_object(cache_domains, place_irq_in_object, NULL);
    }
---

The basic principle to balance irqs is:

    asign = place.least_irqs ? place.least_irqs : place.best;

The least_irqs and best are from find_best_object
---
    newload = d->load;
    if (newload < best->best_cost) {
        best->best = d;
        best->best_cost = newload;
        best->least_irqs = NULL;
    }

    if (newload == best->best_cost) {
        if (g_list_length(d->interrupts) < g_list_length(best->best->interrupts))
            best->least_irqs = d;
    }
---

Assign the irq to the cpu with least load, if load is equal, select the cpu with
least irqs.

Certainly, during this, the affinity_hint also plays a role there.

Look at:
find_best_object
---
    /*
      * If the hint policy is subset, then we only want 
      * to consider objects that are within the irqs hint, but
      * only if that irq in fact has published a hint
      */
    if (hint_policy == HINT_POLICY_SUBSET) {
        if (!cpus_empty(best->info->affinity_hint)) {
            cpus_and(subset, best->info->affinity_hint, d->mask);
            if (cpus_empty(subset))
                return;
        }
    }

---

Where is the irq_info->load from ?

static void assign_load_slice(struct irq_info *info, void *data)
{
    uint64_t *load_slice = data;
    info->load = (info->irq_count - info->last_irq_count) * *load_slice;

    /*
      * Every IRQ has at least a load of 1
      */
    if (!info->load)
        info->load++;
}

How to activate the mapping ?

activate_mapping
---

    /*
      * only activate mappings for irqs that have moved
      */

    if (!info->moved)
        return;

    if ((hint_policy == HINT_POLICY_EXACT) &&
        (!cpus_empty(info->affinity_hint))) {
        if (cpus_intersects(info->affinity_hint, banned_cpus))
            log(TO_ALL, LOG_WARNING,
                "irq %d affinity_hint and banned cpus confict\n",
                info->irq);
        else {
            applied_mask = info->affinity_hint;
            valid_mask = 1;
        }
    } else if (info->assigned_obj) {

        // This assigned_obj here could be a cpu object, and also could be a
        // cache, package even node obj.[1]

        applied_mask = info->assigned_obj->mask;
        if ((hint_policy == HINT_POLICY_SUBSET) &&
            (!cpus_empty(info->affinity_hint))) {
            cpus_and(applied_mask, applied_mask, info->affinity_hint);
            if (!cpus_intersects(applied_mask, unbanned_cpus))
                log(TO_ALL, LOG_WARNING,
                    "irq %d affinity_hint subset empty\n",
                   info->irq);
            else
                valid_mask = 1;
        } else {
            valid_mask = 1;
        }
    }

    /*
      * Don't activate anything for which we have an invalid mask 
      */
    if (!valid_mask || check_affinity(info, applied_mask))
        return;

    if (!info->assigned_obj)
        return;

    sprintf(buf, "/proc/irq/%i/smp_affinity", info->irq);
    file = fopen(buf, "w");
    if (!file)
        return;

    cpumask_scnprintf(buf, PATH_MAX, applied_mask);
    fprintf(file, "%s", buf);
    fclose(file);
    info->moved = 0; /*migration is done*/
}

[1] Look at the calculate_placement.

        for_each_irq(rebalance_irq_list, place_irq_in_node, NULL);
        for_each_object(numa_nodes, place_irq_in_object, NULL);
        for_each_object(packages, place_irq_in_object, NULL);
        for_each_object(cache_domains, place_irq_in_object, NULL

The balance steps are:

1. balance all the irqs between different numa nodes.
2. balance all the irqs in one node between different packages.
3. balance all the irqs in one package between different cache domains.
4. balance all the irqs in one cache domain between different cpus.

Look at following example:

Given a irq has been assigned a package topo_obj, however, the cpu on
the irq->affinity_hint is offlined, then due to the following code fragment,
find_best_object
---
    if (hint_policy == HINT_POLICY_SUBSET) {
        if (!cpus_empty(best->info->affinity_hint)) {
            cpus_and(subset, best->info->affinity_hint, d->mask);
            if (cpus_empty(subset))
                return;
        }
    }
---
this irq will not get further assignment.
Its final assigned obj is the package topo_obj and its final smp_affinity is the
mask of the package topo_obj.

balance level Initially, all the irq_infos are in rebalance_irq_list.

Then they will be queued to different topo_obj based on their balance level.

For example: a irq's balance level is BALANCE_PACKAGE.

Then:

1. select a best node, move irq from rebalance_irq_list to the node obj->interrupts
2. select a best package from the best node.
   find_best_object_for_irq on the best node's obj
3. find_best_object_for_irq on the best package's obj,
   due to the irq's balance level is BALANCE_PACKAGE, it will return.
4. the irq will be preserved on the package's obj->interrupts list.
5. even if we try to invoke find_best_object_for_irq on underlying obj, this irq
   will not be involved, because it is not on the interrupts list.

force_rebalance_irq will move a irq back to rebalance_irq_list, then it will be rebalanced from top to down.

x86 irq vector

x86 assign irq vector

For managed vector,
assign_managed_vector
---
    cpumask_and(vector_searchmask, vector_searchmask, affmsk);
    cpu = cpumask_first(vector_searchmask);
    ...
    vector = irq_matrix_alloc_managed(vector_matrix, cpu);
    if (vector < 0)
        return vector;
    apic_update_vector(irqd, vector, cpu);
    apic_update_irq_cfg(irqd, vector, cpu);
---

For other,
assign_vector_locked
  -> irq_matrix_alloc
---

    // select the cpu with least irq assigned.

    for_each_cpu(cpu, msk) {
        cm = per_cpu_ptr(m->maps, cpu);

        if (!cm->online || cm->available <= maxavl)
            continue;

        best_cpu = cpu;
        maxavl = cm->available;
    }
    if (maxavl) {
        cm = per_cpu_ptr(m->maps, best_cpu);
        bit = matrix_alloc_area(m, cm, 1, false);
        if (bit < m->alloc_end) {
            cm->allocated++;
            cm->available--;
            m->total_allocated++;
            m->global_available--;
            if (reserved)
                m->global_reserved--;
            *mapped_cpu = best_cpu;
            trace_irq_matrix_alloc(bit, best_cpu, m, cm);
            return bit;
        }
    }
---