Atomic Operations
Percpu reference count
Sequence lock
ACCESS_ONCE Series
RCU
On x86, it will turn into a lock prefixed assembly instruction, like LOCK XADD.
Being a single instruction, it is non-interruptible. As an added "feauture", the lock prefix results in a full memory barrier:
IntelĀ® 64 and IA-32 Architectures Developer's Manual: Vol. 3A
8.1.2.2
Locked operations are atomic with respect to all other memory operations and all externally visible events. Only
instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize
data written by one processor and read by another processor
A memory barrier is in fact implemented as a dummy LOCK OR or LOCK AND in both the .NET and the JAVA JIT on x86/x64.
So you have a full fence on x86 as an added bonus, whether you like it or not. :)
On PPC, it is different. An LL/SC pair - lwarx & stwcx - with a subtraction inside can be used to load the memory operand into a register, subtract one, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted.
It also does not mean an automatic full fence.
This does not however compromise the atomicity of the counter in any way.
It just means that in the x86 case, you happen to get a fence as well, "for free".
On PPC, one can insert a full fence by emitting a (lw)sync instruction.
All in all, explicit memory barriers are not necessary for the atomic counter to work properly.
Reference counting is used by the kernel to know when a data structure is unused and can be freed.
Most of the time, reference counts are represented by an atomic_t variable, perhaps wrapped by a
structure like a kref. If references are added and removed frequently over an object's lifetime,
though, that atomic_t variable can become a performance bottleneck.
The frequent write access to the shared reference count variable could result in heavy cache line
bouncing between different cpus. On other hand, the atomic operations itself could introduce extra
overhead in parallel environment.
The percpu reference count was introduced to the 3.11 kernel created by Kent Overstreet. It is to
improve the scalability of reference count in parallel environment. Let's look at how does it work next.
Typical usage will involve embedding a percpu_ref structure within the data structure being tracked.
[1] The counter must be initialized with:
int percpu_ref_init(struct percpu_ref *ref, percpu_ref_release *release);
Where release() is the function to be called when the reference count drops to zero:
The call to percpu_ref_init() will initialize the reference count to one.
[2] References are added and removed with:
void percpu_ref_get(struct percpu_ref *ref);
void percpu_ref_put(struct percpu_ref *ref);
These two functions will operate on a per-cpu array of reference counters, so they will not cause
cacheline bouncing across the system. But the question is how does percpu_ref_put() determine the
reference count dropped to zero and invoke the realse callback?
[3] Drop to zero
There is a basic practice in kernel, the reference count will not be zero as long as the initial
one held. So percpu_ref_put() needn't to check it until someone decide to drop the initial reference.
It's done though percpu_ref_kill().
void percpu_ref_kill(struct percpu_ref *ref);
After this call, the reference count degrades to the usual model with a single atomic_t counter.
This atomic_t counter will be decremented and checked by percpu_ref_put().
Kill
percpu_ref_kill_and_confirm()
-> __percpu_ref_switch_mode()
---
// A global lock here to protect the following operations.
spin_lock_irqsave(&percpu_ref_switch_lock, flags);
ref->percpu_count_ptr |= __PERCPU_REF_DEAD;
__percpu_ref_switch_mode(ref, confirm_kill);
percpu_ref_put(ref); //put the initial ref
spin_unlock_irqrestore(&percpu_ref_switch_lock, flags)
---
__percpu_ref_switch_mode
-> __percpu_ref_switch_to_atomic
---
/* switching from percpu to atomic */
ref->percpu_count_ptr |= __PERCPU_REF_ATOMIC;
/*
* Non-NULL ->confirm_switch is used to indicate that switching is
* in progress. Use noop one if unspecified.
*/
ref->confirm_switch = confirm_switch ?: percpu_ref_noop_confirm_switch;
percpu_ref_get(ref); /* put after confirmation */
call_rcu_sched(&ref->rcu, percpu_ref_switch_to_atomic_rcu);
---
All the get/put operations are under rcu_sched lock, the call_rcu_sched will
guarantee:
when percpu_ref_switch_to_atomic_rcu is invoked, __PERCPU_REF_ATOMIC_DEAD is
seen, then the get/put reference would only occur on atomic counter
percpu_ref_switch_to_atomic_rcu
---
for_each_possible_cpu(cpu)
count += *per_cpu_ptr(percpu_count, cpu);
// PERCPU_COUNT_BIAS is (1LU << (BITS_PER_LONG - 1)).
// This will ensure the atomic counter will not reach zero before this.
// In addition, an extra reference is got in
// percpu_ref_switch_to_atomic_rcu.
atomic_long_add((long)count - PERCPU_COUNT_BIAS, &ref->count);
percpu_ref_call_confirm_rcu(rcu);
---
Counting
Think about following scenarios:
A task getting a ref on one cpu, then this task is migrated on another cpu. The
held ref will be put there.
A request allocated on one cpu, a ref of q->q_usage_counter, then completion irq
associated with the request raises on another cpu and completes the request there.
The ref held by the request will be put on a different cpu.
In above cases, the get/put operations are balanced across the whole system, but
not a single cpu.
A more complicated scenario:
---
percpu [active]
cpu0 cpu1 cpu2 cpu3
1 2 1 1
atomic
1 + PERCPU_COUNT_BIAS
---
percpu_ref_switch_to_atomic
---
percpu
cpu0 cpu1 cpu2 cpu3
1 2 1 1
atomic [active]
6
---
percpu_ref_put_many 1 on cpu3
---
percpu
cpu0 cpu1 cpu2 cpu3
1 2 1 1
atomic [active]
5
---
percpu_ref_switch_to_percpu
---
percpu
cpu0 cpu1 cpu2 cpu3
0 0 0 0
atomic [active]
5 + PERCPU_COUNT_BIAS
---
Is it OK ?
Certainly!!!
The zero check would only happen when atomic mode. The counts left in atomic
counter will still work when switch to atomic mode again.
sched rcu
Why is the sched rcu is used here instead of common rcu ?
commit a4244454df1296e90cc961c1b636b1176ef0d9a0
percpu-refcount: use RCU-sched insted of normal RCU
rcu_read_[un]lock() for the preemptible RCU implementation -
CONFIG_TREE_PREEMPT_RCU, chosen when CONFIG_PREEMPT - are slightly
more expensive than preempt_disable/enable().
In a contrived microbench which repeats the followings,
- percpu_ref_get()
- copy 32 bytes of data into percpu buffer
- percpu_put_get()
- copy 32 bytes of data into percpu buffer
rcu_read_[un]lock() used in percpu_ref_get/put() makes it go slower by
about 15% when compared to using sched-RCU.
As the RCU critical sections are extremely short, using sched-RCU
shouldn't have any latency implications. Convert to RCU-sched.
Sequence lock is used in linux kernel for the read almost data that must be seen in a consistent state by readers. However, unlike the reader-writer lock, the readers don't exclude the writer, but the reader _must_ retry the operation if they detect activity from the concurrent writer.
In fact, in linux kernel, the sequence lock is just a combination of a sequence count and a spinlock which is used to execlude concurrent writers. So sequence lock is nearly equal to sequence count.
static inline void write_seqlock(seqlock_t *sl)
{
spin_lock(&sl->lock);
write_seqcount_begin(&sl->seqcount);
}
static inline void write_sequnlock(seqlock_t *sl)
{
write_seqcount_end(&sl->seqcount);
spin_unlock(&sl->lock);
}
Following, let's focus on the implementation of sequence count.
The basic idea of seqence count is:
The writer will update the sequence count before and after its updating operation. So if the sequence count is a even value, there is no writer ongoing. If it is a odd value, a writer is updating it. In the view of readers, it need to snapshot the value of seqence count. if get a odd value, it indicates a writing is in progress, if two snapshots differ, it indicates a concurrent update has completed and reader may get inconsistent data and need to discard it.
read_seqcount_begin()
-> raw_read_seqcount_begin()
-> __read_seqcount_begin()
>>>>
repeat:
ret = READ_ONCE(s->sequence);
if (unlikely(ret & 1)) {
cpu_relax();
goto repeat;
} // ensure the concurrent update is completed.
return ret;
>>>>
-> smp_rmb()
//smp_rmb() has two destinations:
//1. ensure all read operations is after __read_seqcount_begin() to sharp
//the boundary of read critical section.
//2. ensure all the read operations see the updating from the other cpu.
read_seqcount_begin() ensure to return an even value, then it could ensure a concurrent updating has been completed at least when read_seqcount_begin() returns.
write_seqcount_begin()
updateing
write_seqcount_end() read_seqcount_begin()
reading
read_seqcount_retry()
Most important, this could avoid the scenario below:
context A context B
write_seqcount_begin();
updating ...
read_seqcount_begin()
reading
read_seqcount_retry()
write_seqcount_end();
But this could introduce other issue, look at the following scenario.
context A context B
write_seqcount_begin();
updating ...
---preempt--->
__read_seqcount_begin()
while (seq & 1)
cpurelax();
write_seqcount_end();
If the context B is irq context, a live lock comes up. If context B is a normal task, it has to use up its time slice (ideal time in cfs) then come back to context A to update the sequence count to even value. But if context B is a FIFO task, live lock again.
To fix this, a irq/preempt disable is needed here.
The classic usage example in linux kernel, the timekeeping.
void update_wall_time(void)
{
>>>>
raw_spin_lock_irqsave(&timekeeper_lock, flags);
>>>>
clock_set |= accumulate_nsecs_to_secs(tk);
write_seqcount_begin(&tk_core.seq);
/*
* Update the real timekeeper.
*
* We could avoid this memcpy by switching pointers, but that
* requires changes to all other timekeeper usage sites as
* well, i.e. move the timekeeper pointer getter into the
* spinlocked/seqcount protected sections. And we trade this
* memcpy under the tk_core.seq against one before we start
* updating.
*/
timekeeping_update(tk, clock_set);
memcpy(real_tk, tk, sizeof(*tk));
/* The memcpy must come last. Do not put anything here! */
write_seqcount_end(&tk_core.seq);
out:
raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
if (clock_set)
/* Have to call _delayed version, since in irq context*/
clock_was_set_delayed();
}
ktime_t ktime_get_raw(void)
{
struct timekeeper *tk = &tk_core.timekeeper;
unsigned int seq;
ktime_t base;
u64 nsecs;
do {
seq = read_seqcount_begin(&tk_core.seq);
base = tk->tkr_raw.base;
nsecs = timekeeping_get_ns(&tk->tkr_raw);
} while (read_seqcount_retry(&tk_core.seq, seq));
return ktime_add_ns(base, nsecs);
}
There are a lot of ACCESS_ONCE macro in the kernel, what are they for ?
https://lwn.net/Articles/508991/
Its functionality has been actually well described by its name. Its purpose is to ensure the address passed by the parameter to be accessed exactly once by the final assembly code generated by the compiler. We may suspect that would the address be accessed twice or non ? Actually, yes. The optimizaion option of the C compiler could do that.
If not given reasons to the contrary, assume that there is only one thread of execution in the address space of the program it is compiling. Concurrency is not built into the C language itself, so mechanisms for dealing with concurrent access must be built on top of the language; ACCESS_ONCE() is one such mechanism.
Consider the following example:
int foo(int x)
{
x = x*x;
x = x>>6;
printf("result is %d\n", x);
return x;
}
int main()
{
int x;
int conti;
int *p_value = (int *)malloc(sizeof(int));
for (;;) {
x = *p_value;
conti = foo(x);
if (conti)
break;
}
return 0;
}
gcc -O2 test.c -S -o test.asm
main:
.LFB39:
.cfi_startproc
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
movl $4, %edi
call malloc
movl (%rax), %ebx
imull %ebx, %ebx
sarl $6, %ebx
.p2align 4,,10
.p2align 3
.L4:
xorl %eax, %eax
movl %ebx, %edx
movl $.LC0, %esi
movl $1, %edi
call __printf_chk
testl %ebx, %ebx
je .L4
xorl %eax, %eax
popq %rbx
.cfi_def_cfa_offset 8
ret
The actual scenario may be more complicated than above. But the result is similar. To prevent the optimization, we could introduce the ACCESS_ONCE() macro.
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
Then
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
int foo(int x)
{
x = x*x;
x = x>>6;
printf("result is %d\n", x);
return x;
}
int main()
{
int x;
int conti;
int *p_value = (int *)malloc(sizeof(int));
for (;;) {
x = ACCESS_ONCE(*p_value);
conti = foo(x);
if (conti)
break;
}
return 0;
}
gcc -O2 test.c -S -o test.asm
main:
.LFB39:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
pushq %rbx
.cfi_def_cfa_offset 24
.cfi_offset 3, -24
movl $4, %edi
subq $8, %rsp
.cfi_def_cfa_offset 32
call malloc
movq %rax, %rbp
.p2align 4,,10
.p2align 3
.L4:
movl 0(%rbp), %ebx
xorl %eax, %eax
movl $.LC0, %esi
movl $1, %edi
imull %ebx, %ebx
sarl $6, %ebx
movl %ebx, %edx
call __printf_chk
testl %ebx, %ebx
je .L4
addq $8, %rsp
.cfi_def_cfa_offset 24
xorl %eax, %eax
As it happens, an optimized-out access is not the only peril that this code could encounter. Some processor architectures (x86, for example) are not richly endowed with registers; on such systems, the compiler must make careful choices regarding which values to keep in registers if it is to generate the highest-performing code. Specific values may be pushed out of the register set, then pulled back in later. The address is accessed twice actually, and this may introcue inconsistence into kernel.
There are two other macro to achieve same functionality, READ_ONCE() and WRITE_ONCE(), the implementation is similar with ACCESS_ONCE
Another expample is:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4444.html
while (tmp = atomic_load_explicit(a, memory_order_relaxed))
do_something_with(tmp);
The compiler would be permitted to unroll it as follows:
while (tmp = atomic_load_explicit(a, memory_order_relaxed))
do_something_with(tmp);
do_something_with(tmp);
do_something_with(tmp);
do_something_with(tmp);
}
This would be unacceptable for real-time applications, which need the value to be reloaded from a on each iteration, unrolled or not. The volatile qualifier prevents this transformation. For example, consider the following loop.
while (tmp = READ_ONCE(a))
do_something_with(tmp);
This loop could still be unrolled, but the read would also need to be unrolled, for example, like this:
for (;;) {
if (!(tmp = READ_ONCE(a)))
break;
do_something_with(tmp);
if (!(tmp = READ_ONCE(a)))
break;
do_something_with(tmp);
if (!(tmp = READ_ONCE(a)))
break;
do_something_with(tmp);
if (!(tmp = READ_ONCE(a)))
break;
do_something_with(tmp);
}
Let's look at the implementation of READ_ONCE:
#define __READ_ONCE_SIZE \
({ \
switch (size) { \
case 1: *(__u8 *)res = *(volatile __u8 *)p; break; \
case 2: *(__u16 *)res = *(volatile __u16 *)p; break; \
case 4: *(__u32 *)res = *(volatile __u32 *)p; break; \
case 8: *(__u64 *)res = *(volatile __u64 *)p; break; \
default: \
barrier(); \
__builtin_memcpy((void *)res, (const void *)p, size); \
barrier(); \
} \
})
Why we need this __u8,u16,u32... here ?
At one time, gcc guaranteed that properly aligned accesses to machine-word-sized variables would be atomic. Although gcc _no longer_ documents this guarantee, there is still code in the Linux kernel that relies on it. These accesses could be modeled as non-volatile memory_order_relaxed accesses.
rcu_state
|
+-----------+-----------+----------+
| | | |
rcu_node rcu_node rcu_node rcu_note <,
\
| | | | mynode
/
rcu_data rcu_data rcu_data rcu_data /
rcu_state - rcu(sched/preemptible) rcu_sched, rcu_bh
rcu_node - Just RCU TREE Orginazition
rcu_data - PER-CPU Indicate a specific CPUs
Get it through raw_cpu_ptr(rsp->rda)
rcu_node is just for RCU Tree orginazition. We could get the topology from the
following member.
rcu_node->qsmask, qsmaskinit, grpmask
qsmask The mark that indicate the the CPUs or children nodes in this
node has not passed the quiescent state. Per bit indicate one CPU or
child node.
qsmaskinit Use to initialize the qsmask when a new grace period start.
grpmask The position in parent's qsmaskinit.
rcu_data->grpmask
grpmask The position in qsmaskinit of the node where local CPU belongs to.
A quiescent state is reported through:
^ -> rcu_report_qs_rsp
| -> rcu_report_qs_rnp
| ...
-> rcu_report_qs_rnp
rcu_report_qs_rdp
There are 3 types rcu in kernel.
We just talk abot sched and bh rcu in this section.
When all the cpu pass quiescent state, we say the system passes grace
period.
rcu_gp_in_progress() tell us whether a gp is in progress.
static int rcu_gp_in_progress(struct rcu_state *rsp)
{
return READ_ONCE(rsp->completed) != READ_ONCE(rsp->gpnum);
}
rcu_start_gp() will start a new grace period.
rcu_start_gp()
-> rcu_start_gp_advanced()
-> WRITE_ONCE(rsp->gp_flags, RCU_GP_FLAG_INIT);
rcu_gp_kthread will be wakeup and handle the grace-period start.
This is done in rcu_gp_init().
- record_gp_stall_check_time(rsp), calculate a new rsp->jiffies_stall which
will be used in rcu_stall detection.
- smp_store_release(&rsp->gpnum, rsp->gpnum + 1)
- iterate all the rnp and rdp:
- rnp->qsmask = rnp->qsmaskinit
- WRITE_ONCE(rnp->gpnum, rsp->gpnum);
- __note_gp_changes()
__note_gp_changes()
----
if (rdp->gpnum != rnp->gpnum || unlikely(READ_ONCE(rdp->gpwrap))) {
rdp->gpnum = rnp->gpnum;
need_gp = !!(rnp->qsmask & rdp->grpmask);
rdp->cpu_no_qs.b.norm = need_gp;
rdp->core_needs_qs = need_gp;
}
----
How to report a cpu has passed qs ?
SOFTIRQ_RCU context
rcu_process_callbacks() //iterate all the rsp
-> __rcu_process_callbacks()
-> rcu_check_quiescent_state()
rcu_check_quiescent_state()
----
if (!rdp->core_needs_qs)
return;
if (rdp->cpu_no_qs.b.norm)
return;
rcu_report_qs_rdp(rdp->cpu, rsp, rdp);
----
How to check a cpu has passed qs ?
bh qs
cpu tick irq context
update_process_times()
-> rcu_check_callbacks()
-> rcu_bh_qs() // user mode or !in_softirq
Note: local_bh_disable will increase SOFTIRQ part of preempt_count, in_softirq
will return true at the moment.
__do_softirq()
-> rcu_bh_qs() //after loop of invoking softirq action
scsi_host_queue_ready
---
if (scsi_host_in_recovery(shost))
return 0;
busy = atomic_inc_return(&shost->host_busy) - 1;
// We add the host_busy here atomically, but this is not\
// the final result. We could decrement host_busy next due to
// multiple reasons.
scsi_host_set_state(shost, SHOST_RECOVERY)
scsi_eh_wakeup
if (atomic_read(&shost->host_blocked) > 0) { ---
if (busy) if (atomic_read(&shost->host_busy) == shost->host_failed) {
goto starved; wake_up_process(shost->ehandler);
---
if (atomic_dec_return(&shost->host_blocked) > 0) The host_busy here is not the final result
goto out_dec; operation on host_busy is atomic, but the whole scsi_host_queue_ready is not atomic
}
if (shost->can_queue > 0 && busy >= shost->can_queue)
goto starved;
if (shost->host_self_blocked)
goto starved;
...
return 1;
starved:
spin_lock_irq(shost->host_lock);
if (list_empty(&sdev->starved_entry))
list_add_tail(&sdev->starved_entry, &shost->starved_list);
spin_unlock_irq(shost->host_lock);
out_dec:
atomic_dec(&shost->host_busy);
return 0;
---
The worst result could be:
scsi_host_queue_ready return due to busy >= shost->can_queue, and the
scsi_eh_wakeup doesn't wake up the EH kthread due to the race. And then
the requests in EH cannot be handled and shost->can_queue cannot be decremented
and IO hang comes up finally.
The fix could be:
scsi_host_queue_ready scsi_eh_scmd_add
--- ---
rcu_read_lock(); scsi_host_set_state(shost, SHOST_RECOVERY)
if (scsi_host_in_recovery(shost)) {
rcu_read_unlock();
return 0;
}
host_busy operations;
rcu_read_unlock();
synchronize_rcu();
shost->host_failed++;
scsi_eh_wakeup(shost);
---
Then scsi_eh_wakeup will not race with scsi_host_queue_ready any more due to the
SHOST_RECOVERY state.
scsi_eh_scmd_add is always invoked in process context, so I think this should be
OK.
The fix in upstream is as following:
scsi_host_queue_ready
-> scsi_dec_host_busy
---
rcu_read_lock();
atomic_dec(&shost->host_busy);
if (unlikely(scsi_host_in_recovery(shost))) {
spin_lock_irqsave(shost->host_lock, flags);
if (shost->host_failed || shost->host_eh_scheduled)
scsi_eh_wakeup(shost);
spin_unlock_irqrestore(shost->host_lock, flags);
}
rcu_read_unlock();
---
scsi_eh_scmd_add
-> scsi_host_set_state(shost, SHOST_RECOVERY)
-> call_rcu(&scmd->rcu, scsi_eh_inc_host_failed);
scsi_eh_inc_host_failed
---
spin_lock_irqsave(shost->host_lock, flags);
shost->host_failed++;
scsi_eh_wakeup(shost);
spin_unlock_irqrestore(shost->host_lock, flags);
---
Weird solution.