Synchronize Primitives

Atomic Operations

Percpu reference count
Sequence lock
ACCESS_ONCE Series
RCU

Atomic Operations


atomic operations and memory barrier

On x86, it will turn into a lock prefixed assembly instruction, like LOCK XADD.
Being a single instruction, it is non-interruptible. As an added "feauture", the lock prefix results in a full memory barrier:

IntelĀ® 64 and IA-32 Architectures Developer's Manual: Vol. 3A
8.1.2.2

Locked operations are atomic with respect to all other memory operations and all externally visible events. Only
instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize
data written by one processor and read by another processor

A memory barrier is in fact implemented as a dummy LOCK OR or LOCK AND in both the .NET and the JAVA JIT on x86/x64.
So you have a full fence on x86 as an added bonus, whether you like it or not. :)

On PPC, it is different. An LL/SC pair - lwarx & stwcx - with a subtraction inside can be used to load the memory operand into a register, subtract one, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted.
It also does not mean an automatic full fence.
This does not however compromise the atomicity of the counter in any way.

It just means that in the x86 case, you happen to get a fence as well, "for free".

On PPC, one can insert a full fence by emitting a (lw)sync instruction.

All in all, explicit memory barriers are not necessary for the atomic counter to work properly.

Percpu reference count


Reference counting is used by the kernel to know when a data structure is unused and can be freed.
Most of the time, reference counts are represented by an atomic_t variable, perhaps wrapped by a
structure like a kref. If references are added and removed frequently over an object's lifetime,
though, that atomic_t variable can become a performance bottleneck.

The frequent write access to the shared reference count variable could result in heavy cache line
bouncing between different cpus. On other hand, the atomic operations itself could introduce extra
overhead in parallel environment.

Usage

The percpu reference count was introduced to the 3.11 kernel created by Kent Overstreet. It is to
improve the scalability of reference count in parallel environment. Let's look at how does it work next.

Typical usage will involve embedding a percpu_ref structure within the data structure being tracked.
[1] The counter must be initialized with:
    int percpu_ref_init(struct percpu_ref *ref, percpu_ref_release *release);
    Where release() is the function to be called when the reference count drops to zero:
    The call to percpu_ref_init() will initialize the reference count to one.

[2] References are added and removed with:

    void percpu_ref_get(struct percpu_ref *ref);
    void percpu_ref_put(struct percpu_ref *ref);

    These two functions will operate on a per-cpu array of reference counters, so they will not cause
    cacheline bouncing across the system. But the question is how does percpu_ref_put() determine the
    reference count dropped to zero and invoke the realse callback?

[3] Drop to zero
    There is a basic practice in kernel, the reference count will not be zero as long as the initial
    one held. So percpu_ref_put() needn't to check it until someone decide to drop the initial reference.
    It's done though percpu_ref_kill().

    void percpu_ref_kill(struct percpu_ref *ref);

    After this call, the reference count degrades to the usual model with a single atomic_t counter.
    This atomic_t counter will be decremented and checked by percpu_ref_put().

Implementation

Kill

percpu_ref_kill_and_confirm()
  -> __percpu_ref_switch_mode()
---

    // A global lock here to protect the following operations.

    spin_lock_irqsave(&percpu_ref_switch_lock, flags);

    ref->percpu_count_ptr |= __PERCPU_REF_DEAD;
    __percpu_ref_switch_mode(ref, confirm_kill);
    percpu_ref_put(ref); //put the initial ref

    spin_unlock_irqrestore(&percpu_ref_switch_lock, flags)
---

__percpu_ref_switch_mode
  -> __percpu_ref_switch_to_atomic
---
    /* switching from percpu to atomic */
    ref->percpu_count_ptr |= __PERCPU_REF_ATOMIC;

    /*
     * Non-NULL ->confirm_switch is used to indicate that switching is
     * in progress.  Use noop one if unspecified.
     */
    ref->confirm_switch = confirm_switch ?: percpu_ref_noop_confirm_switch;

    percpu_ref_get(ref);    /* put after confirmation */
    call_rcu_sched(&ref->rcu, percpu_ref_switch_to_atomic_rcu);
---
All the get/put operations are under rcu_sched lock, the call_rcu_sched will
guarantee:
when percpu_ref_switch_to_atomic_rcu is invoked, __PERCPU_REF_ATOMIC_DEAD is
seen, then the get/put reference would only occur on atomic counter

percpu_ref_switch_to_atomic_rcu
---
    for_each_possible_cpu(cpu)
        count += *per_cpu_ptr(percpu_count, cpu);

    // PERCPU_COUNT_BIAS is (1LU << (BITS_PER_LONG - 1)).
    // This will ensure the atomic counter will not reach zero before this.
    // In addition, an extra reference is got in
    // percpu_ref_switch_to_atomic_rcu.

    atomic_long_add((long)count - PERCPU_COUNT_BIAS, &ref->count);

    percpu_ref_call_confirm_rcu(rcu);
---



Counting
Think about following scenarios:
A task getting a ref on one cpu, then this task is migrated on another cpu. The
held ref will be put there.

A request allocated on one cpu, a ref of q->q_usage_counter, then completion irq
associated with the request raises on another cpu and completes the request there.
The ref held by the request will be put on a different cpu.

In above cases, the get/put operations are balanced across the whole system, but not a single cpu.
A more complicated scenario:

---
percpu [active]
cpu0    cpu1    cpu2    cpu3
1       2       1       1

atomic
1 + PERCPU_COUNT_BIAS
---

percpu_ref_switch_to_atomic

---
percpu
cpu0    cpu1    cpu2    cpu3
1       2       1       1

atomic [active]
6
---

percpu_ref_put_many 1 on cpu3

---
percpu
cpu0    cpu1    cpu2    cpu3
1       2       1       1

atomic [active]
5
---

percpu_ref_switch_to_percpu

---
percpu
cpu0    cpu1    cpu2    cpu3
0       0       0       0

atomic [active]
5 + PERCPU_COUNT_BIAS
---
Is it OK ?
Certainly!!!
The zero check would only happen when atomic mode. The counts left in atomic counter will still work when switch to atomic mode again.



sched rcu
Why is the sched rcu is used here instead of common rcu ?

commit a4244454df1296e90cc961c1b636b1176ef0d9a0
percpu-refcount: use RCU-sched insted of normal RCU
    
rcu_read_[un]lock() for the preemptible RCU implementation -
CONFIG_TREE_PREEMPT_RCU, chosen when CONFIG_PREEMPT - are slightly
more expensive than preempt_disable/enable().

In a contrived microbench which repeats the followings,

 - percpu_ref_get()
 - copy 32 bytes of data into percpu buffer
 - percpu_put_get()
 - copy 32 bytes of data into percpu buffer

rcu_read_[un]lock() used in percpu_ref_get/put() makes it go slower by
about 15% when compared to using sched-RCU.

As the RCU critical sections are extremely short, using sched-RCU
shouldn't have any latency implications.  Convert to RCU-sched.

Sequence lock


Sequence lock is used in linux kernel for the read almost data that must be seen in a consistent state by readers. However, unlike the reader-writer lock, the readers don't exclude the writer, but the reader _must_ retry the operation if they detect activity from the concurrent writer.
In fact, in linux kernel, the sequence lock is just a combination of a sequence count and a spinlock which is used to execlude concurrent writers. So sequence lock is nearly equal to sequence count.

static inline void write_seqlock(seqlock_t *sl)
{
	spin_lock(&sl->lock);
	write_seqcount_begin(&sl->seqcount);
}

static inline void write_sequnlock(seqlock_t *sl)
{
	write_seqcount_end(&sl->seqcount);
	spin_unlock(&sl->lock);
}
Following, let's focus on the implementation of sequence count.
The basic idea of seqence count is:
The writer will update the sequence count before and after its updating operation. So if the sequence count is a even value, there is no writer ongoing. If it is a odd value, a writer is updating it. In the view of readers, it need to snapshot the value of seqence count. if get a odd value, it indicates a writing is in progress, if two snapshots differ, it indicates a concurrent update has completed and reader may get inconsistent data and need to discard it.
read_seqcount_begin()
    -> raw_read_seqcount_begin()
        -> __read_seqcount_begin()
        >>>>
repeat:
    ret = READ_ONCE(s->sequence);
    if (unlikely(ret & 1)) {
        cpu_relax();
        goto repeat;
    } // ensure the concurrent update is completed.
    return ret;
        >>>>
        -> smp_rmb() 
    //smp_rmb() has two destinations:
    //1. ensure all read operations is after __read_seqcount_begin() to sharp
    //the boundary of read critical section.
    //2. ensure all the read operations see the updating from the other cpu.
read_seqcount_begin() ensure to return an even value, then it could ensure a concurrent updating has been completed at least when read_seqcount_begin() returns.
write_seqcount_begin()

updateing

write_seqcount_end()        read_seqcount_begin()
                            reading  
                            read_seqcount_retry()        
Most important, this could avoid the scenario below:
          context A                  context B                         
	write_seqcount_begin();
	updating ...
	                          
		                     read_seqcount_begin()
			             reading
				     read_seqcount_retry()
	write_seqcount_end();

But this could introduce other issue, look at the following scenario.
          context A                  context B                         
	write_seqcount_begin();
	updating ...
	                  ---preempt--->           
		                     __read_seqcount_begin()
			                 while (seq & 1)
				             cpurelax();
	write_seqcount_end();
If the context B is irq context, a live lock comes up. If context B is a normal task, it has to use up its time slice (ideal time in cfs) then come back to context A to update the sequence count to even value. But if context B is a FIFO task, live lock again.
To fix this, a irq/preempt disable is needed here.
The classic usage example in linux kernel, the timekeeping.
void update_wall_time(void)
{
    >>>>
    raw_spin_lock_irqsave(&timekeeper_lock, flags);
    >>>>
    clock_set |= accumulate_nsecs_to_secs(tk);

    write_seqcount_begin(&tk_core.seq);
    /*
     * Update the real timekeeper.
     *
     * We could avoid this memcpy by switching pointers, but that
     * requires changes to all other timekeeper usage sites as
     * well, i.e. move the timekeeper pointer getter into the
     * spinlocked/seqcount protected sections. And we trade this
     * memcpy under the tk_core.seq against one before we start
     * updating.
     */
    timekeeping_update(tk, clock_set);
    memcpy(real_tk, tk, sizeof(*tk));
    /* The memcpy must come last. Do not put anything here! */
    write_seqcount_end(&tk_core.seq);
out:
    raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
    if (clock_set)
        /* Have to call _delayed version, since in irq context*/
        clock_was_set_delayed();

}

ktime_t ktime_get_raw(void)
{
    struct timekeeper *tk = &tk_core.timekeeper;
    unsigned int seq;
    ktime_t base;
    u64 nsecs;

    do {
        seq = read_seqcount_begin(&tk_core.seq);
        base = tk->tkr_raw.base;
        nsecs = timekeeping_get_ns(&tk->tkr_raw);

    } while (read_seqcount_retry(&tk_core.seq, seq));

    return ktime_add_ns(base, nsecs);
}

ACCESS_ONCE Series


There are a lot of ACCESS_ONCE macro in the kernel, what are they for ?
https://lwn.net/Articles/508991/ Its functionality has been actually well described by its name. Its purpose is to ensure the address passed by the parameter to be accessed exactly once by the final assembly code generated by the compiler. We may suspect that would the address be accessed twice or non ? Actually, yes. The optimizaion option of the C compiler could do that.
If not given reasons to the contrary, assume that there is only one thread of execution in the address space of the program it is compiling. Concurrency is not built into the C language itself, so mechanisms for dealing with concurrent access must be built on top of the language; ACCESS_ONCE() is one such mechanism.
Consider the following example:

int foo(int x)
{
    x = x*x;
    x = x>>6;
    printf("result is %d\n", x);
    return x;
}

int main()
{
    int x;
    int conti;
    int *p_value = (int *)malloc(sizeof(int));

     for (;;) {
        x = *p_value;
        conti = foo(x);
        if (conti)
            break;
     }

    return 0;
}
gcc -O2 test.c -S -o test.asm
main:
.LFB39:
    .cfi_startproc
    pushq    %rbx
    .cfi_def_cfa_offset 16
    .cfi_offset 3, -16
    movl    $4, %edi
    call    malloc
    movl    (%rax), %ebx
    imull    %ebx, %ebx
    sarl    $6, %ebx
    .p2align 4,,10
    .p2align 3
.L4:
    xorl    %eax, %eax
    movl    %ebx, %edx
    movl    $.LC0, %esi
    movl    $1, %edi
    call    __printf_chk
    testl    %ebx, %ebx
    je    .L4
    xorl    %eax, %eax
    popq    %rbx
    .cfi_def_cfa_offset 8
    ret
The actual scenario may be more complicated than above. But the result is similar. To prevent the optimization, we could introduce the ACCESS_ONCE() macro.
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
Then
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
int foo(int x)
{
    x = x*x;
    x = x>>6;
    printf("result is %d\n", x);
    return x;
}

int main()
{
    int x;
    int conti;
    int *p_value = (int *)malloc(sizeof(int));

     for (;;) {
        x = ACCESS_ONCE(*p_value);
        conti = foo(x);
        if (conti)
            break;
     }

    return 0;
}
gcc -O2 test.c -S -o test.asm
main:
.LFB39:
    .cfi_startproc
    pushq    %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    pushq    %rbx
    .cfi_def_cfa_offset 24
    .cfi_offset 3, -24
    movl    $4, %edi
    subq    $8, %rsp
    .cfi_def_cfa_offset 32
    call    malloc
    movq    %rax, %rbp
    .p2align 4,,10
    .p2align 3
.L4:
    movl    0(%rbp), %ebx
    xorl    %eax, %eax
    movl    $.LC0, %esi
    movl    $1, %edi
    imull    %ebx, %ebx
    sarl    $6, %ebx
    movl    %ebx, %edx
    call    __printf_chk
    testl    %ebx, %ebx
    je    .L4
    addq    $8, %rsp
    .cfi_def_cfa_offset 24
    xorl    %eax, %eax
As it happens, an optimized-out access is not the only peril that this code could encounter. Some processor architectures (x86, for example) are not richly endowed with registers; on such systems, the compiler must make careful choices regarding which values to keep in registers if it is to generate the highest-performing code. Specific values may be pushed out of the register set, then pulled back in later. The address is accessed twice actually, and this may introcue inconsistence into kernel.

There are two other macro to achieve same functionality, READ_ONCE() and WRITE_ONCE(), the implementation is similar with ACCESS_ONCE
Another expample is: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4444.html
	while (tmp = atomic_load_explicit(a, memory_order_relaxed))
		do_something_with(tmp);
The compiler would be permitted to unroll it as follows:
	while (tmp = atomic_load_explicit(a, memory_order_relaxed))
		do_something_with(tmp);
		do_something_with(tmp);
		do_something_with(tmp);
		do_something_with(tmp);
	}
This would be unacceptable for real-time applications, which need the value to be reloaded from a on each iteration, unrolled or not. The volatile qualifier prevents this transformation. For example, consider the following loop.
	while (tmp = READ_ONCE(a))
		do_something_with(tmp);
This loop could still be unrolled, but the read would also need to be unrolled, for example, like this:
	for (;;) {
		if (!(tmp = READ_ONCE(a)))
			break;
		do_something_with(tmp);
		if (!(tmp = READ_ONCE(a)))
			break;
		do_something_with(tmp);
		if (!(tmp = READ_ONCE(a)))
			break;
		do_something_with(tmp);
		if (!(tmp = READ_ONCE(a)))
			break;
		do_something_with(tmp);
	}
Let's look at the implementation of READ_ONCE:
#define __READ_ONCE_SIZE						\
({									\
	switch (size) {							\
	case 1: *(__u8 *)res = *(volatile __u8 *)p; break;		\
	case 2: *(__u16 *)res = *(volatile __u16 *)p; break;		\
	case 4: *(__u32 *)res = *(volatile __u32 *)p; break;		\
	case 8: *(__u64 *)res = *(volatile __u64 *)p; break;		\
	default:							\
		barrier();						\
		__builtin_memcpy((void *)res, (const void *)p, size);	\
		barrier();						\
	}								\
})
Why we need this __u8,u16,u32... here ?
At one time, gcc guaranteed that properly aligned accesses to machine-word-sized variables would be atomic. Although gcc _no longer_ documents this guarantee, there is still code in the Linux kernel that relies on it. These accesses could be modeled as non-volatile memory_order_relaxed accesses.

RCU

Rcu Tree

                    rcu_state                 
                       |
    +-----------+-----------+----------+
    |           |           |          |
 rcu_node    rcu_node    rcu_node   rcu_note <,  
                                               \ 
    |           |           |          |      mynode
                                               /
 rcu_data    rcu_data   rcu_data    rcu_data  /

rcu_state - rcu(sched/preemptible) rcu_sched, rcu_bh    
rcu_node  - Just RCU TREE Orginazition
rcu_data  - PER-CPU Indicate a specific CPUs
                               Get it through raw_cpu_ptr(rsp->rda)
rcu_node is just for RCU Tree orginazition. We could get the topology from the following member.
rcu_node->qsmask, qsmaskinit, grpmask
qsmask           The mark that indicate the the CPUs or children nodes in this
                 node has not passed the quiescent state. Per bit indicate one CPU or
                 child node.
qsmaskinit       Use to initialize the qsmask when a new grace period start.
grpmask          The position in parent's qsmaskinit.

rcu_data->grpmask
grpmask          The position in qsmaskinit of the node where local CPU belongs to.
A quiescent state is reported through:
   ^                       -> rcu_report_qs_rsp
   |               -> rcu_report_qs_rnp
   |            ...
           -> rcu_report_qs_rnp
   rcu_report_qs_rdp

How does RCU run

There are 3 types rcu in kernel.

We just talk abot sched and bh rcu in this section.

Grace period and Quiescent state

When all the cpu pass quiescent state, we say the system passes grace period.
rcu_gp_in_progress() tell us whether a gp is in progress.

static int rcu_gp_in_progress(struct rcu_state *rsp)
{
    return READ_ONCE(rsp->completed) != READ_ONCE(rsp->gpnum);
}
rcu_start_gp() will start a new grace period.
rcu_start_gp()
    -> rcu_start_gp_advanced()
        -> WRITE_ONCE(rsp->gp_flags, RCU_GP_FLAG_INIT);
rcu_gp_kthread will be wakeup and handle the grace-period start.
This is done in rcu_gp_init().
 - record_gp_stall_check_time(rsp), calculate a new rsp->jiffies_stall which
   will be used in rcu_stall detection.
 - smp_store_release(&rsp->gpnum, rsp->gpnum + 1)
 - iterate all the rnp and rdp:
   - rnp->qsmask = rnp->qsmaskinit
   - WRITE_ONCE(rnp->gpnum, rsp->gpnum);
   - __note_gp_changes()

__note_gp_changes()
----
    if (rdp->gpnum != rnp->gpnum || unlikely(READ_ONCE(rdp->gpwrap))) {
        rdp->gpnum = rnp->gpnum;
        need_gp = !!(rnp->qsmask & rdp->grpmask);
        rdp->cpu_no_qs.b.norm = need_gp;
        rdp->core_needs_qs = need_gp;
    }
----



How to report a cpu has passed qs ?
SOFTIRQ_RCU context
rcu_process_callbacks() //iterate all the rsp
    -> __rcu_process_callbacks()
        -> rcu_check_quiescent_state()
rcu_check_quiescent_state()
----
    if (!rdp->core_needs_qs)
        return;

    if (rdp->cpu_no_qs.b.norm)
        return;

    rcu_report_qs_rdp(rdp->cpu, rsp, rdp);
----

How to check a cpu has passed qs ?
bh qs
cpu tick irq context
update_process_times()
    -> rcu_check_callbacks()
        -> rcu_bh_qs() // user mode or !in_softirq

Note: local_bh_disable will increase SOFTIRQ part of preempt_count, in_softirq
will return true at the moment.

__do_softirq()
    -> rcu_bh_qs() //after loop of invoking softirq action

RCU Usage

scsi_host_busy

scsi_host_queue_ready
---
    if (scsi_host_in_recovery(shost))
        return 0;

    busy = atomic_inc_return(&shost->host_busy) - 1;

    // We add the host_busy here atomically, but this is not\
    // the final result. We could decrement host_busy next due to
    // multiple reasons.
                                                              scsi_host_set_state(shost, SHOST_RECOVERY)    
                                                                     scsi_eh_wakeup
    if (atomic_read(&shost->host_blocked) > 0) {                     ---
        if (busy)                                                    if (atomic_read(&shost->host_busy) == shost->host_failed) {
            goto starved;                                                wake_up_process(shost->ehandler);
                                                                     ---
        if (atomic_dec_return(&shost->host_blocked) > 0)             The host_busy here is not the final result
            goto out_dec;                                            operation on host_busy is atomic, but the whole scsi_host_queue_ready is not atomic
    }

    if (shost->can_queue > 0 && busy >= shost->can_queue)
        goto starved;
    if (shost->host_self_blocked)
        goto starved;
    ...
    return 1;

starved:
    spin_lock_irq(shost->host_lock);
    if (list_empty(&sdev->starved_entry))
        list_add_tail(&sdev->starved_entry, &shost->starved_list);
    spin_unlock_irq(shost->host_lock);
out_dec:
    atomic_dec(&shost->host_busy);
    return 0;
---

The worst result could be:
scsi_host_queue_ready return due to busy >= shost->can_queue, and the
scsi_eh_wakeup doesn't wake up the EH kthread due to the race. And then
the requests in EH cannot be handled and shost->can_queue cannot be decremented
and IO hang comes up finally.

The fix could be:
scsi_host_queue_ready                        scsi_eh_scmd_add
---                                          ---
    rcu_read_lock();                         scsi_host_set_state(shost, SHOST_RECOVERY)
    if (scsi_host_in_recovery(shost)) {   
        rcu_read_unlock();
        return 0;
    }
    host_busy operations;
    rcu_read_unlock();
                                             synchronize_rcu();
                                             shost->host_failed++;
                                             scsi_eh_wakeup(shost);
---
Then scsi_eh_wakeup will not race with scsi_host_queue_ready any more due to the
SHOST_RECOVERY state.
scsi_eh_scmd_add is always invoked in process context, so I think this should be
OK.

The fix in upstream is as following:
scsi_host_queue_ready
  -> scsi_dec_host_busy
  ---
    rcu_read_lock();
    atomic_dec(&shost->host_busy);
    if (unlikely(scsi_host_in_recovery(shost))) {
        spin_lock_irqsave(shost->host_lock, flags);
        if (shost->host_failed || shost->host_eh_scheduled)
            scsi_eh_wakeup(shost);
        spin_unlock_irqrestore(shost->host_lock, flags);
    }
    rcu_read_unlock();
  ---

scsi_eh_scmd_add
  -> scsi_host_set_state(shost, SHOST_RECOVERY)
  -> call_rcu(&scmd->rcu, scsi_eh_inc_host_failed);

scsi_eh_inc_host_failed
---
    spin_lock_irqsave(shost->host_lock, flags);
    shost->host_failed++;
    scsi_eh_wakeup(shost);
    spin_unlock_irqrestore(shost->host_lock, flags);
---
Weird solution.