struct irq_chip
- A set of methods describing how to drive the interrupt controller
- Directly called by core IRQ code
struct irqdomain
- A pointer to the firmware node for a given interrupt controller (fwnode)
- A method to convert a firmware description of an IRQ into an ID local to this interrupt controller (hwirq)
- A way to retrieve the Linux view of an IRQ from the hwirq
struct irq_desc
- Linux’s view of an interrupt
- Contains all the core stuff
- 1:1 mapping to the Linux interrupt number
struct irq_data
- Contains the data that is relevant to the irq_chip managing this interrupt
- Both the Linux IRQ number and the hwirq
- A pointer to the irq_chip
- Embedded in irq_desc (for now)
There are two main reasons for linux kernel to introduce irq_domain.
The irq_domain library adds mapping between hwirq and IRQ numbers on top of the irq_alloc_desc*() API
Where is the irq number in nowadays kernel ?
nr_irqs = NR_IRQS
static DECLARE_BITMAP(allocated_irqs, IRQ_BITMAP_BITS);
struct irq_desc *irq_to_desc(unsigned int irq)
{
return radix_tree_lookup(&irq_desc_tree, irq);
}
All the irq_desc structures are stored in a radix tree irq_desc_tree indexed by irq number.
A irq domain example on x86 platform
On some architectures, there may be multiple interrupt controllers
involved in delivering an interrupt from the device to the target CPU.
Let's look at a typical interrupt delivering path on x86 platforms::
Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU
There are three interrupt controllers involved:
1) IOAPIC controller
2) Interrupt remapping controller
3) Local APIC controller
To support such a hardware topology and make software architecture match
hardware architecture, an irq_domain data structure is built for each
interrupt controller and those irq_domains are organized into hierarchy.
When building irq_domain hierarchy, the irq_domain near to the device is
child and the irq_domain near to CPU is parent. So a hierarchy structure
as below will be built for the example above::
CPU Vector irq_domain (root irq_domain to manage CPU vectors)
^
|
Interrupt Remapping irq_domain (manage irq_remapping entries)
^
|
IOAPIC irq_domain (manage IOAPIC delivery entries/pins)
Following is the output of irq debugfs about this:
domain: INTEL-IR-MSI-1-2
hwirq: 0x100003
chip: IR-PCI-MSI
flags: 0x10
IRQCHIP_SKIP_SET_WAKE
parent:
domain: INTEL-IR-1
hwirq: 0x160000
chip: INTEL-IR
flags: 0x0
parent:
domain: VECTOR
hwirq: 0x80
chip: APIC
flags: 0x0
Vector: 33
Target: 3
Before look into the irq affinity, let's look at something about the cpumask.
The following particular system cpumasks and operations manage
possible, present, active and online cpus.
cpu_possible_mask- has bit ‘cpu’ set iff cpu is populatable
cpu_present_mask – has bit ‘cpu’ set iff cpu is populated
cpu_online_mask – has bit ‘cpu’ set iff cpu available to scheduler
cpu_active_mask – has bit ‘cpu’ set iff cpu available to migration
If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
The cpu_possible_mask is fixed at boot time, as the set of CPU id’s
that it is possible might ever be plugged in at anytime during the
life of that system boot. The cpu_present_mask is dynamic(*),
representing which CPUs are currently plugged in. And
cpu_online_mask is the dynamic subset of cpu_present_mask,
indicating those CPUs available for scheduling.
If HOTPLUG is enabled, then cpu_possible_mask is forced to have
all NR_CPUS bits set, otherwise it is just the set of CPUs that
ACPI reports present at boot.
If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
depending on what ACPI reports as currently plugged in, otherwise
cpu_present_mask is just a copy of cpu_possible_mask.
(*) Well, cpu_present_mask is dynamic in the hotplug case. If not
hotplug, it’s a copy of cpu_possible_mask, hence fixed at boot.
The managed here means the kernel is responsible for managing the cpu affinity
of the irqs.
If the driver provide some affinity hint when allocate vectors, then the irqs will be
managed ones.
irq_remapping_setup_msi_irqs
-> do_setup_msix_irqs
-> irq_alloc_hwirq_affinity
-> __irq_alloc_hwirqs
-> __irq_alloc_descs
-> alloc_descs
---
flags = affinity ? IRQD_AFFINITY_MANAGED : 0;
// If affinity mask is provided, affinity will be managed by kernel.
---
The path to generate the irq affinity masks.
prev affinity spreading post
|- - - | - - - - - - - - - - - - | - - - -|
resv = prev + post
pci_alloc_irq_vectors_affinity
-> __pci_enable_msix_range
-> irq_calc_affinity_vectors
--
ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
by default, we prefer one-to-one mapping between hw queue and cpu
--
-> __pci_enable_msix
-> msix_setup_entries
-> irq_create_affinity_masks
Take an example here.
4 irq vectors v0,v1,v2,v3.
cpu topology
Node0 Node1
{ [core0, core2], [core1, core3]} { [core0, core2], [core1, core3]}
[] -> siblings
There are two stages to spread the irqs.
Before commit 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
every node will get two irq vectors
v0,v1 -> Node0
v2,v3 -> Node1
v0 -> Node0-core0,core2
v1 -> Node0-core1,core3
the irq vectors will be spread across the present cpus. So we may get following
scenarios.
Node0 Node1
{ [core0, core2], [core1, core3]} { [core0, core2], [core1, core3]}
X X
[] -> siblings
X -> not present
At most, 6 irq vectors could be allocated due to irq_calc_affinity_vectors
v0,v1,v2,v3,v4,v5
Let's look at how will the irq vectors be spread.
for_each_node_mask(n, nodemsk) {
int ncpus, v, vecs_to_assign, vecs_per_node;
/* Spread the vectors per node */
vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
2 nodes, 6 vectors, vecs_per_node is 3
/* Get the cpus on this node which are in the mask */
cpumask_and(nmsk, cpu_present_mask, node_to_present_cpumask[n]);
/* Calculate the number of cpus per vector */
ncpus = cpumask_weight(nmsk);
node0 has 4 cpus
vecs_to_assign = min(vecs_per_node, ncpus);
vecs_to_assign will be 3 here
/* Account for rounding errors */
extra_vecs = ncpus - vecs_to_assign * (ncpus / vecs_to_assign);
extra_vecs is 1
for (v = 0; curvec < last_affv && v < vecs_to_assign;
curvec++, v++) {
cpus_per_vec = ncpus / vecs_to_assign;
cpus_per_vec is 1
/* Account for extra vectors to compensate rounding errors */
if (extra_vecs) {
cpus_per_vec++;
--extra_vecs;
}
node0 has 4 cpus, but 3 vectors are assigned to
it. so 1 vector will have 2 cpu assigned
irq_spread_init_one(masks + curvec, nmsk, cpus_per_vec);
}
if (curvec >= last_affv)
break;
--nodes;
There are 3 vectors left, but node1 just have 2 cpu present.
vecs_to_assign = min(vecs_per_node, ncpus);
Just 2 vectors could be assigned with cpu.
The left one's masks will get irq_default_affinity.
}
done:
put_online_cpus();
/* Fill out vectors at the end that don't need affinity */
for (; curvec < nvecs; curvec++)
cpumask_copy(masks + curvec, irq_default_affinity);
free_node_to_possible_cpumask(node_to_possible_cpumask);
out:
free_cpumask_var(nmsk);
return masks
Here is one case that a patch I sent to upstream causes a big bug when block layer maintainer tested it.
This is the patch https://lkml.org/lkml/2018/3/13/227
What is the issue ?
It employs "pre_vector=1" when allocate irq vectors, and assign separate irq vectors for adminq and ioq1.
So nvme io queues will get a irq vector which is equal to its qid. But I missed
the blk_mq_pci_map_queues() which still thinks the irq vector is equal to qid-1.
To reproduce this issue we must add "possible_cpus=16" on the kernel commandline and boot it on my machine with
8 cores. Then we will get following irq affinity result.
nvme0q1 nvme0q2 nvme0q3 nvme0q4 nvme0q5 nvme0q6 nvme0q7 nvme0q8
0,4 1,5 2,6 3,7 8-9 10-11 12-13 14-15
We could see irqs of nvme0q5~nvme0q8 are set affinity to not present cpu.
This is due to 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
Leiming has submitted a series patch to fix it.
genirq/affinity: irq vector spread among online CPUs as far as possible
Let's return to our issue.
int blk_mq_pci_map_queues(struct blk_mq_tag_set *set, struct pci_dev *pdev)
{
const struct cpumask *mask;
unsigned int queue, cpu;
/*0 based here, the mask for vector 0 contains all the possible cpus.
so it will do a wrong mapping between ctxs and hctxs.
*/
for (queue = 0; queue < set->nr_hw_queues; queue++) {
mask = pci_irq_get_affinity(pdev, queue);
if (!mask)
goto fallback;
for_each_cpu(cpu, mask)
set->mq_map[cpu] = queue;
}
return 0;
...
}
We get a mapping as following: |
V
ctxs 14,15 0,4 1,5 2,6 3,7 8,9 10,11 12,13
hctx0 hctx1 hctx2 hctx3 hctx4 hctx5 hctx6 hctx7
nvme0q1 nvme0q2 nvme0q3 nvme0q4 nvme0q5 nvme0q6 nvme0q7 nvme0q8
cpu affinitys 0,4 1,5 2,6 3,7 8-9 10-11 12-13 14-15
The issue is:
if we submit IO on cpu3 or cpu7, hctx4/nvme0q5 will handle it. After completion, the irq will be sent to 8,9 which are
not present on my machine.
taskset -c 3 dd if=/dev/nvme0n1 of=/dev/null bs=4k count=1
And get following log in dmesg
[ 350.800366] nvme nvme0: I/O 253 QID 5 timeout, completion polled
The userland deamon (irqbalance) cannot set affinity of managed irqs.
write_irq_affinity
-> irq_can_set_affinity_usr
-> !irqd_affinity_is_managed //Affinity is auto-managed by the kernel, userland cannot modify it.
-> irq_select_affinity_usr
-> setup_affinity
Build the cpu hierarchy
main
-> build_object_tree
-> build_numa_node_list // parse /sys/devices/system/node/
-> parse_cpu_tree
The cpu heirarchy is built as:
- At the top are numa_nodes
- All CPU packages belong to a single numa_node
- All Cache domains belong to a CPU package
- All CPU cores belong to a cache domain
Objects are built in that order (top down)
Object workload is the aggregate sum of the workload of the objects below it
All the components on the hierarchy is topo_obj
struct topo_obj {
uint64_t load;
uint64_t last_load;
enum obj_type_e obj_type;
int number;
int powersave_mode;
cpumask_t mask;
GList *interrupts; // GList entry of irq_info
struct topo_obj *parent;
GList *children;
GList **obj_type_list;
};
enum obj_type_e {
OBJ_TYPE_CPU,
OBJ_TYPE_CACHE,
OBJ_TYPE_PACKAGE,
OBJ_TYPE_NODE
};
Collect irq info
main
-> build_object_tree
-> rebuild_irq_db
Calculate the positions where the irqs should be migrated
calculate_placement
---
if (g_list_length(rebalance_irq_list) > 0) {
// Place irqs in the best node.
for_each_irq(rebalance_irq_list, place_irq_in_node, NULL);
for_each_object(numa_nodes, place_irq_in_object, NULL);
for_each_object(packages, place_irq_in_object, NULL);
for_each_object(cache_domains, place_irq_in_object, NULL);
}
---
The basic principle to balance irqs is:
asign = place.least_irqs ? place.least_irqs : place.best;
The least_irqs and best are from find_best_object
---
newload = d->load;
if (newload < best->best_cost) {
best->best = d;
best->best_cost = newload;
best->least_irqs = NULL;
}
if (newload == best->best_cost) {
if (g_list_length(d->interrupts) < g_list_length(best->best->interrupts))
best->least_irqs = d;
}
---
Assign the irq to the cpu with least load, if load is equal, select the cpu with
least irqs.
Certainly, during this, the affinity_hint also plays a role there.
Look at:
find_best_object
---
/*
* If the hint policy is subset, then we only want
* to consider objects that are within the irqs hint, but
* only if that irq in fact has published a hint
*/
if (hint_policy == HINT_POLICY_SUBSET) {
if (!cpus_empty(best->info->affinity_hint)) {
cpus_and(subset, best->info->affinity_hint, d->mask);
if (cpus_empty(subset))
return;
}
}
---
Where is the irq_info->load from ?
static void assign_load_slice(struct irq_info *info, void *data)
{
uint64_t *load_slice = data;
info->load = (info->irq_count - info->last_irq_count) * *load_slice;
/*
* Every IRQ has at least a load of 1
*/
if (!info->load)
info->load++;
}
How to activate the mapping ?
activate_mapping
---
/*
* only activate mappings for irqs that have moved
*/
if (!info->moved)
return;
if ((hint_policy == HINT_POLICY_EXACT) &&
(!cpus_empty(info->affinity_hint))) {
if (cpus_intersects(info->affinity_hint, banned_cpus))
log(TO_ALL, LOG_WARNING,
"irq %d affinity_hint and banned cpus confict\n",
info->irq);
else {
applied_mask = info->affinity_hint;
valid_mask = 1;
}
} else if (info->assigned_obj) {
// This assigned_obj here could be a cpu object, and also could be a
// cache, package even node obj.[1]
applied_mask = info->assigned_obj->mask;
if ((hint_policy == HINT_POLICY_SUBSET) &&
(!cpus_empty(info->affinity_hint))) {
cpus_and(applied_mask, applied_mask, info->affinity_hint);
if (!cpus_intersects(applied_mask, unbanned_cpus))
log(TO_ALL, LOG_WARNING,
"irq %d affinity_hint subset empty\n",
info->irq);
else
valid_mask = 1;
} else {
valid_mask = 1;
}
}
/*
* Don't activate anything for which we have an invalid mask
*/
if (!valid_mask || check_affinity(info, applied_mask))
return;
if (!info->assigned_obj)
return;
sprintf(buf, "/proc/irq/%i/smp_affinity", info->irq);
file = fopen(buf, "w");
if (!file)
return;
cpumask_scnprintf(buf, PATH_MAX, applied_mask);
fprintf(file, "%s", buf);
fclose(file);
info->moved = 0; /*migration is done*/
}
[1] Look at the calculate_placement.
for_each_irq(rebalance_irq_list, place_irq_in_node, NULL);
for_each_object(numa_nodes, place_irq_in_object, NULL);
for_each_object(packages, place_irq_in_object, NULL);
for_each_object(cache_domains, place_irq_in_object, NULL
The balance steps are:
1. balance all the irqs between different numa nodes.
2. balance all the irqs in one node between different packages.
3. balance all the irqs in one package between different cache domains.
4. balance all the irqs in one cache domain between different cpus.
Look at following example:
Given a irq has been assigned a package topo_obj, however, the cpu on
the irq->affinity_hint is offlined, then due to the following code fragment,
find_best_object
---
if (hint_policy == HINT_POLICY_SUBSET) {
if (!cpus_empty(best->info->affinity_hint)) {
cpus_and(subset, best->info->affinity_hint, d->mask);
if (cpus_empty(subset))
return;
}
}
---
this irq will not get further assignment.
Its final assigned obj is the package topo_obj and its final smp_affinity is the
mask of the package topo_obj.
balance level
Initially, all the irq_infos are in rebalance_irq_list.
Then they will be queued to different topo_obj based on their balance level.
For example: a irq's balance level is BALANCE_PACKAGE.
Then:
1. select a best node, move irq from rebalance_irq_list to the node obj->interrupts
2. select a best package from the best node.
find_best_object_for_irq on the best node's obj
3. find_best_object_for_irq on the best package's obj,
due to the irq's balance level is BALANCE_PACKAGE, it will return.
4. the irq will be preserved on the package's obj->interrupts list.
5. even if we try to invoke find_best_object_for_irq on underlying obj, this irq
will not be involved, because it is not on the interrupts list.
force_rebalance_irq will move a irq back to rebalance_irq_list, then it will be
rebalanced from top to down.
x86 assign irq vector
For managed vector,
assign_managed_vector
---
cpumask_and(vector_searchmask, vector_searchmask, affmsk);
cpu = cpumask_first(vector_searchmask);
...
vector = irq_matrix_alloc_managed(vector_matrix, cpu);
if (vector < 0)
return vector;
apic_update_vector(irqd, vector, cpu);
apic_update_irq_cfg(irqd, vector, cpu);
---
For other,
assign_vector_locked
-> irq_matrix_alloc
---
// select the cpu with least irq assigned.
for_each_cpu(cpu, msk) {
cm = per_cpu_ptr(m->maps, cpu);
if (!cm->online || cm->available <= maxavl)
continue;
best_cpu = cpu;
maxavl = cm->available;
}
if (maxavl) {
cm = per_cpu_ptr(m->maps, best_cpu);
bit = matrix_alloc_area(m, cm, 1, false);
if (bit < m->alloc_end) {
cm->allocated++;
cm->available--;
m->total_allocated++;
m->global_available--;
if (reserved)
m->global_reserved--;
*mapped_cpu = best_cpu;
trace_irq_matrix_alloc(bit, best_cpu, m, cm);
return bit;
}
}
---