Container

cgroup

data structure

basis
life cycle

cgroup kernfs

Files
Hierarchy

cgroup tasks

data structure
attach a task

cgroup

data structure

basis

cgroup

cgroup carries tasks and controller information, constructs a hierarchy and
act as the interface that interacts with users through kernfs.

css, cgroup_subsys_state

As the name, css is a data structure for a cgroup-subsystem pair.
Take blkio subsystem as example,
blkcg_css_alloc()
---
    mutex_lock(&blkcg_pol_mutex);

    if (!parent_css) {
        blkcg = &blkcg_root;
    } else {
        blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
    }

    for (i = 0; i < BLKCG_MAX_POLS ; i++) {
        struct blkcg_policy *pol = blkcg_policy[i];
        struct blkcg_policy_data *cpd;

        cpd = pol->cpd_alloc_fn(GFP_KERNEL);
        blkcg->cpd[i] = cpd;
        cpd->blkcg = blkcg;
        cpd->plid = i;
        if (pol->cpd_init_fn)
            pol->cpd_init_fn(cpd);
    }

    list_add_tail(&blkcg->all_blkcgs_node, &all_blkcgs);
    mutex_unlock(&blkcg_pol_mutex);

    return &blkcg->css;

---

css looks like the vfs inode in filesystem. Most of time, the filesystem
would embed vfs inode into their own inode structure.

css/blkcg contains the information to control the io of one cgroup

css_set, set of css's and tasks that controlled by them

In the other word, css_set is a collection of cgroups from different hierarchy
Look at the task_struct, one task is linked into the css_set and also get
the information of css through css_set
task_struct
---
#ifdef CONFIG_CGROUPS
    /* Control Group info protected by css_set_lock: */
    struct css_set __rcu        *cgroups;
    /* cg_list protected by css_set_lock and tsk->alloc_lock: */
    struct list_head        cg_list;
#endif
---

blkcg_css()
  -> task_css(current, io_cgrp_id);
    -> task_css_check(task, subsys_id, false)
      -> task_css_set_check((task), (__c))->subsys[(subsys_id)]
        -> rcu_dereference_check((task)->cgroups,                \

css_set_move_task()
---
        cgroup_move_task(task, to_cset);
        list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks :
                                 &to_cset->tasks);
---

And here are some critial questions about the data structures

Why do we need the css_set ?

The answer is cgroup V1 and its special hierarchy.

    cpu-cg-root        '    mem-cg-root          '   blkio-cg-root
     /      \          '       /      \          '        /      \
cpu-cg-0  cpu-cg-1     ' mem-cg-0  mem-cg-1      ' blkio-cg-0 blkio-cg-1
            /  \       '              /  \       '               /  \
     cpu-cg-2  cpu-cg-3'       mem-cg-2  mem-cg-3'      blkio-cg-2  blkio-cg-3
                       '                         '
In cgroup V1, different subsystem has different hierarchy, and 
 - every cgroup can only have one subsystem
 - every task can be attached to different hierarchy

Due to this, css_set is created to bind these tasks and css's together.

When cgroup V2 is introduced, to be compatible with V1, css_set is still needed.

Is css really per-cgroup-subsystem ?

The answer is NO !
In cgroup V2, a cgroup would share its parent's css if the subsystem is not enabled.

                  root (cpu, pid, io, mem)
                /      \
     cg0 (cpu, pid)   cg1 (io, mem)
                      / \
                     /   \
                    t0 t1 cg2 (io)
                          /\
                         t2 t3

t0&t1 -> cpu: cg.root pid: cg.root, io: cg1, mem: cg1
t2&t3 -> cpu: cg.root pid: cg.root, io: cg2, mem: cg1

Is cgroup and css-set one to one ?

The answer is NO !

In cgroup V1, there could be multiple hierarchy, and then
a cgroup can contains tasks from different css_set.
Look at following example,

      cpu-root            mem-root         blkio-root
      /     \             /     \          /      \
    cg0    cg1          cg2     cg3      cg4     cg5
           (t0, t1)    (t0)     (t1)    (t0)     (t1)

t0 is in css-set (cpu-cg1, mem-cg2, blkio-cg4)
t1 is in css-set (cpu-cg1, mem-cg3, blkio-cg5)

In this case, cg1 are associated with two css-sets.

In fact, the case above is the 'flexibility' provided by
cgroup V1. However, do we really need this ? Most of time,
t0 and t1 are in same shape in different hierarchy as following

      cpu-root       mem-root         blkio-root
      /     \        /     \          /      \
    cg0    cg1     cg2     cg3      cg4     cg5
    (t0)   (t1)    (t0)     (t1)    (t0)     (t1)

There is no such issue in cgroup V2 which has a uinified hierarchy.

relationship

To make clear the life cycle of components in cgroup, we must know the
reference relationship between them.

                     task
                   /
            css_set ----- css
                   \      /
                    cgroup

task and css_set

(1) task.cgroups points css_set
(2) task.cg_list is linked into css_set.tasks

This relationships between then are setup in
css_set_move_task()
---
        cgroup_move_task(task, to_cset);
        list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks :
                                 &to_cset->tasks);
---

cgroup and css

(1) css->cgroup
(2) cgroup->subsys[] -> css

css_create()
---
    css = ss->css_alloc(parent_css);
    ...
    init_and_link_css(css, ss, cgrp);
    ...
    online_css()
---

css_set and cgroup,css

(1) cgroup->cset_links <--> cgrp_cset_link <--> cgrp_links<-css_set
(2) cgroup->e_csets[] -> css_set->e_cset_node[]
(3) css_set->subsys[] -> css

Both of the relationships above are built in find_css_set()

find_css_set is to setup a new css_set based on old_cset with one cgroup updated.
The first step is to setup a css template based on the old cset and new cgroup.

find_css_set()
  -> find_existing_css_set()
---
    for_each_subsys(ss, i) {
        if (root->subsys_mask & (1UL << i)) {
            /*
             * @ss is in this hierarchy, so we want the
               ^^^^^^^^^^^^^^^^^^^^^^^^
             * effective css from @cgrp. If one css is not enabled in the cgroup,
             * use it's ancestor's
             */
            template[i] = cgroup_e_css_by_mask(cgrp, ss)
        } else {
            /*
             * @ss is not in this hierarchy, so we don't want
             * to change the css.
             */
            template[i] = old_cset->subsys[i];
        }
    }
---

find_css_set()
---

    memcpy(cset->subsys, template, sizeof(cset->subsys))  (3)

    ...
    spin_lock_irq(&css_set_lock);
    /* Add reference counts and links from the new css_set. */
    list_for_each_entry(link, &old_cset->cgrp_links, cgrp_link) {
        struct cgroup *c = link->cgrp;

        if (c->root == cgrp->root)
            c = cgrp;

        link_css_set(&tmp_links, cset, c); (1)

    }
    ...
    for_each_subsys(ss, ssid) {
        struct cgroup_subsys_state *css = cset->subsys[ssid];

        list_add_tail(&cset->e_cset_node[ssid],
                  &css->cgroup->e_csets[ssid]); (2)

        css_get(css);
    }
---

life cycle

css_set

create

find_css_set()
---
    refcount_set(&cset->refcount, 1);
---

get

When move a task into the cgroup,
cgroup_migrate_execute()
---
    spin_lock_irq(&css_set_lock);
    list_for_each_entry(cset, &tset->src_csets, mg_node) {
        list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list) {
            get_css_set(to_cset);
            to_cset->nr_tasks++;
            css_set_move_task(task, from_cset, to_cset, true);
            from_cset->nr_tasks--;
            put_css_set_locked(from_cset);
        }
    }
    spin_unlock_irq(&css_set_lock);
---

put

When move task out of the cgroup,
See cgroup_migrate_execute()
When task exits,
release_task()
  -> cgroup_release()
     ---
        spin_lock_irq(&css_set_lock);
        css_set_skip_task_iters(task_css_set(task), task);
        list_del_init(&task->cg_list);
        spin_unlock_irq(&css_set_lock);
     ---

__put_task_struct()
  -> cgroup_free()
    -> put_css_set()

When move task from one cgroup to another, or in the other word,
move task from one css_set to another, due to there could be multiple
tasks to be moved when echo pid to cgroup.proc, to guarantee 'all or nothing',
the whole process is split into 4 steps,
(1) prepare the source css_set, get_css_set is invoked there
(2) prepare the destination css_set with find_css_set,
    get_css_set is invoked or initial css_set is created,
(3) do the real migration, get the reference of destination css_set
    and put reference of source css_set
(4) finish the migration, put_css_set_locked all of the source and destination
    css_set. See cgroup_migrate_finish()

cgroup

cgroup's reference is based on cgroup.self which is a css

create

get

When a new cgroup is create
cgroup_create()
---
    cgroup_get_live(parent);
---

When cgroup is linked with css
init_and_link_css()
  -> cgroup_get_live(cgrp)
 
When cgroup is linked with css_set
link_css_set()
---
    if (cgroup_parent(cgrp)) //non-root cgroup
        cgroup_get_live(cgrp);
---

put

Base reference,
cgroup_rmdir()
  -> cgroup_destroy_locked()
    -> percpu_ref_kill(&cgrp->self.refcnt)

When css_set is freed,
put_css_set_locked()
---
    list_for_each_entry_safe(link, tmp_link, &cset->cgrp_links, cgrp_link) {
        list_del(&link->cset_link);
        list_del(&link->cgrp_link);
        if (cgroup_parent(link->cgrp))
            cgroup_put(link->cgrp);
        kfree(link);
    }
---

When css is freed,
css_free_rwork_fn()
---
    if (ss) {
        /* css free path */
        struct cgroup_subsys_state *parent = css->parent;
        int id = css->id;

        ss->css_free(css);
        cgroup_idr_remove(&ss->css_idr, id);
                cgroup_put(cgrp);

        if (parent)
            css_put(parent);
    }
---

css

There are two reference count working for css
(1) css.refcnt a percpu_ref
(2) css.online_cnt

create

get

refcnt
find_css_set()
---
    for_each_subsys(ss, ssid) {
        struct cgroup_subsys_state *css = cset->subsys[ssid];

        list_add_tail(&cset->e_cset_node[ssid],
                  &css->cgroup->e_csets[ssid]);
        css_get(css);
    }
---
online_cnt:
online_css()
---
        atomic_inc(&css->online_cnt);
        if (css->parent)
            atomic_inc(&css->parent->online_cnt);
---

put

online_cnt:
cgroup_destroy_locked()
---
    for_each_css(css, ssid, cgrp)
        kill_css(css);
        ---
        css_get(css);
        percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn);
        ---
---


This is the percpu_ref confirm callback not drained callback.
                       ^^^^^^^^^^^^^^^      ^^^^^^^^^^^^^^^^
confirm callback means all of the cpu has seen the DEAD flags on percpu_ref.
The drain callback of css is css_release()

css_killed_ref_fn()
---
    if (atomic_dec_and_test(&css->online_cnt)) {
        INIT_WORK(&css->destroy_work, css_killed_work_fn);
        queue_work(cgroup_destroy_wq, &css->destroy_work);
    }
---
refcnt
put_css_set_locked()
---
    for_each_subsys(ss, ssid) {
        list_del(&cset->e_cset_node[ssid]);
        css_put(cset->subsys[ssid]);
    }
---

css_killed_work_fn()
---
    do {
        offline_css(css);
        css_put(css);
        /* @css can't go away while we're holding cgroup_mutex */
        css = css->parent;
    } while (css && atomic_dec_and_test(&css->online_cnt));
---

In summary

                 .-------------------.
                /                    v
     task -> css_set <-> cgroup <-> css
                                [1]
-> : hold reference
[1]: cgroup holds the css.online_css, and they are bound together.

Think of following cases,
(1) Task A in cg0 issues IO
    At the moment, the task hold reference of css_set, and css_set holds
    reference of cgroup, cgroup hold reference to css, so the blkcg(css)
    will not be gone during this.
(2) cgroup writeback control
    At the moment, the css info is carried by writeback kworker, so it
    need to hold reference to the css to keep it alive.
    refer to cgwb_create() which hold reference of memcg_css and blkcg_css

Do we need css_get during cgroup file read/write method ?

Look at the comment in cgroup_file_write()
    /*
     * kernfs guarantees that a file isn't deleted with operations in
     * flight, which means that the matching css is and stays alive and
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     * doesn't need to be pinned.  The RCU locking is not necessary
       ^^^^^^^^^^^^^^^^^^^^^^^^^
     * either.  It's just for the convenience of using cgroup_css().
     */

How does cgroup-core implement it ?
Look at the code of cgroup_destroy_locked()
---
    /* initiate massacre of all css's */
    for_each_css(css, ssid, cgrp)
        kill_css(css);

    /* clear and remove @cgrp dir, @cgrp has an extra ref on its kn */
    css_clear_dir(&cgrp->self);
    kernfs_remove(cgrp->kn);
    ...
    percpu_ref_kill(&cgrp->self.refcnt);
---

The css's percpu ref and online_cnt will be put after kill_css, and
kernfs_remove() is invoked after that. How to guarantee the css isn't freed
during cgroup file read/write method where css is referenced ?

The magic is cgroup_mutex
cgroup_rmdir()
---
    cgrp = cgroup_kn_lock_live(kn, false);
    ---
        if (drain_offline)
            cgroup_lock_and_drain_offline(cgrp);
        else
            mutex_lock(&cgroup_mutex);
    ---

    ret = cgroup_destroy_locked(cgrp);

    cgroup_kn_unlock(kn);
---

The cgroup_destroy_locked() is invoked under cgroup_mutex().
Then let's look at the
css_killed_work_fn()
---
    mutex_lock(&cgroup_mutex);

    do {
        offline_css(css);
        css_put(css);
        /* @css can't go away while we're holding cgroup_mutex */
        css = css->parent;
    } while (css && atomic_dec_and_test(&css->online_cnt));

    mutex_unlock(&cgroup_mutex);
---

The cgroup_mutex guarantees the css won't be offlined before
cgroup_destroy_locked() returns.

cgroup kernfs

In this section, we would like to know how does cgroup interact with users in userland through kernfs.
Let's start from the root of cgroup fs which is initialized in cgroup_init

cgroup_init()
---
    BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files));
    BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));
    ...
    BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0));
---

These are the key of the cgroup kernfs, files and hierarchy

Files

cgroup_base_files is an array of cftype which defines the control file in cgroup directorys,
such as 'cgroup.type', 'cgroup.procs'. cgroup_init_cftypes mainly asigns a kernfs_ops for every cftype.

static struct kernfs_ops cgroup_kf_ops = {
    .atomic_write_len    = PAGE_SIZE,
    .open            = cgroup_file_open,
    .release        = cgroup_file_release,
    .write            = cgroup_file_write,
    .poll            = cgroup_file_poll,
    .seq_start        = cgroup_seqfile_start,
    .seq_next        = cgroup_seqfile_next,
    .seq_stop        = cgroup_seqfile_stop,
    .seq_show        = cgroup_seqfile_show,
};

cgroup_kf_ops is a wrapper which translates the semantics from kernfs to cgroup.

cgroup_file_write()
---
    struct cgroup *cgrp = of->kn->parent->priv;
    struct cftype *cft = of->kn->priv;

    //parent, namely the directory, stands for the cgroup
    //file is the attibutes in the cgroup directory

    ...
    if (cft->write)
        return cft->write(of, buf, nbytes, off);

    rcu_read_lock();
    css = cgroup_css(cgrp, cft->ss);
    rcu_read_unlock();

    if (cft->write_u64) {
        unsigned long long v;
        ret = kstrtoull(buf, 0, &v);
        if (!ret)
            ret = cft->write_u64(css, cft, v);
    } else if (cft->write_s64) {
        long long v;
        ret = kstrtoll(buf, 0, &v);
        if (!ret)
            ret = cft->write_s64(css, cft, v);
    }
---

The kernfs would provide two guarantees regarding to a file

read/write on the file is atomic

For the read,
kernfs_fop_open()
  -> seq_open(file, &kernfs_seq_ops)

kernfs_fop_read()
  -> seq_read()
    -> kernfs_seq_start()
      -> mutex_lock(&of->mutex)
    -> kernfs_seq_stop()
      -> mutex_unlock(&of->mutex)

kernfs_fop_write()
---
    mutex_lock(&of->mutex);
    if (!kernfs_get_active(of->kn)) {
        mutex_unlock(&of->mutex);
        len = -ENODEV;
        goto out_free;
    }

    ops = kernfs_ops(of->kn);
    if (ops->write)
        len = ops->write(of, buf, len, *ppos);
    else
        len = -EINVAL;

    kernfs_put_active(of->kn);
    mutex_unlock(&of->mutex);
---

the file isn't deleted with operations in flight

kernfs_get_active()
---
    if (!atomic_inc_unless_negative(&kn->active))
        return NULL;
---

void kernfs_put_active(struct kernfs_node *kn)
{
    int v;

    if (unlikely(!kn))
        return;

    if (kernfs_lockdep(kn))
        rwsem_release(&kn->dep_map, _RET_IP_);
    v = atomic_dec_return(&kn->active);
    if (likely(v != KN_DEACTIVATED_BIAS))
        return;

    wake_up_all(&kernfs_root(kn)->deactivate_waitq);
}

__kernfs_remove()
---
    /*
     * Kill all of its children
     */
    pos = NULL;
    while ((pos = kernfs_next_descendant_post(pos, kn)))
        if (kernfs_active(pos))
            atomic_add(KN_DEACTIVATED_BIAS, &pos->active);

    /* deactivate and unlink the subtree node-by-node */
    do {
        pos = kernfs_leftmost_descendant(kn);
        kernfs_get(pos);

        if (kn->flags & KERNFS_ACTIVATED)
            kernfs_drain(pos);

        ---
            /* but everyone should wait for draining */
            wait_event(root->deactivate_waitq,
                   atomic_read(&kn->active) == KN_DEACTIVATED_BIAS);
        ---

        else
            WARN_ON_ONCE(atomic_read(&kn->active) != KN_DEACTIVATED_BIAS);

        ...

        kernfs_put(pos);
    } while (pos != kn);

---

Hierarchy

Currently, there are two different kinds of cgroup hierarchy in kernel,

cgroup V1

    cpu-cg-root        '    mem-cg-root          '   blkio-cg-root
     /      \          '       /      \          '        /      \
cpu-cg-0  cpu-cg-1     ' mem-cg-0  mem-cg-1      ' blkio-cg-0 blkio-cg-1
            /  \       '              /  \       '               /  \
     cpu-cg-2  cpu-cg-3'       mem-cg-2  mem-cg-3'      blkio-cg-2  blkio-cg-3
                       '                         '

cgroup V2

                     cg-dfl-root (cpu, mem, blkio)
                          /               \
                    cg-0(cpu, mem)     cg-1 (cpu, mem, blkio)
                                         /          \            
                                    cg-2 (blkio)   cg-3(blkio)

cgroupfs

To understand a filesystem, we must to know how to mkfs, mount, and use it.

mkfs

cgroup_init()
  -> cgroup_setup_root(&cgrp_dfl_root)
---
    kf_sops = root == &cgrp_dfl_root ?
        &cgroup_kf_syscall_ops : &cgroup1_kf_syscall_ops;

    root->kf_root = kernfs_create_root(kf_sops,
                       KERNFS_ROOT_CREATE_DEACTIVATED |
                       KERNFS_ROOT_SUPPORT_EXPORTOP |
                       KERNFS_ROOT_SUPPORT_USER_XATTR,
                       root_cgrp);
    root_cgrp->kn = root->kf_root->kn;
    ...

    /*
       Add the cgroup_base_files into the kernfs_root
       Refer to __kernfs_create_file, kernfs_new_node create a new kernfs_node,
       and kernfs_add_one will add this new node into its parent's children
       rb_tree which indexed by name hash.
    */


    ret = css_populate_dir(&root_cgrp->self);

    kernfs_root is likely to the superblock of the normal disk filesystem.
    There are two important fields in it,
     - kernfs_node, it could be deemed as the root inode on disk
     - kernfs_syscall_ops, vfs op -> kernfs op -> syscall op
                                                  ^^^^^^^^^^

mount

kernfs_fill_super
---

    // This could be deemed as loading root inode

    mutex_lock(&kernfs_mutex);
    inode = kernfs_get_inode(sb, info->root->kn);
    mutex_unlock(&kernfs_mutex);
    ...
    /* instantiate and link root dentry */
    root = d_make_root(inode);
    ...
    sb->s_root = root;
---

use it

Let's look at the inode operations of kernfs

const struct inode_operations kernfs_dir_iops = {
    .lookup        = kernfs_iop_lookup,
    .permission    = kernfs_iop_permission,
    .setattr    = kernfs_iop_setattr,
    .getattr    = kernfs_iop_getattr,
    .listxattr    = kernfs_iop_listxattr,

    .mkdir        = kernfs_iop_mkdir,
    .rmdir        = kernfs_iop_rmdir,
    .rename        = kernfs_iop_rename,
};

There is no .create and .unlink, this means that you cannot create/remove
regular files.

Look into the kernfs_iop_mkdir
kernfs_iop_mkdir()
---
    struct kernfs_node *parent = dir->i_private;
    struct kernfs_syscall_ops *scops = kernfs_root(parent)->syscall_ops;
    int ret;

    if (!scops || !scops->mkdir)
        return -EPERM;

    if (!kernfs_get_active(parent))
        return -ENODEV;

    ret = scops->mkdir(parent, dentry->d_name.name, mode);
    kernfs_put_active(parent);
    return ret;
---
kernfs_syscall_ops is provided by the backend of kernfs.

static struct kernfs_syscall_ops cgroup_kf_syscall_ops = {
    .show_options        = cgroup_show_options,
    .mkdir            = cgroup_mkdir,
    .rmdir            = cgroup_rmdir,
    .show_path        = cgroup_show_path,
};

cgroup_mkdir()
---
    cgrp = cgroup_create(parent, name, mode);
    if (IS_ERR(cgrp)) {
        ret = PTR_ERR(cgrp);
        goto out_unlock;
    }

    /*
     * This extra ref will be put in cgroup_free_fn() and guarantees
     * that @cgrp->kn is always accessible.
     */
    kernfs_get(cgrp->kn);
    ...
    ret = css_populate_dir(&cgrp->self);
    ...
    /* let's create and online css's */
    kernfs_activate(cgrp->kn);
---
As expected, it creates a cgroup

V1 and V2

Let's look at the difference between V1 and V2

mkfs

V2:
cgroup_init()
  -> cgroup_setup_root(&cgrp_dfl_root, 0);
  ---
    kf_sops = root == &cgrp_dfl_root ?
        &cgroup_kf_syscall_ops : &cgroup1_kf_syscall_ops;
    ...
    ret = rebind_subsystems(root, ss_mask);
  ---
V1
cgroup1_get_tree()
  -> cgroup1_root_to_use()
  ---
    root = kzalloc(sizeof(*root), GFP_KERNEL);
    ctx->root = root;
    init_cgroup_root(ctx);
    ret = cgroup_setup_root(root, ctx->subsys_mask);
  ---

The difference:
(1) root of cgroup1 is dynamically allocated
(2) all of the subsystem are bound to default root, namely V2, by default,
    when the mount cgroup1, the required subsystems are rebound to new root.
    cgroup_init_subsys
    ---
        ss->root = &cgrp_dfl_root;
    ---
    When root of cgroup1 is created, the corresponding subsystem to hierarchy
    is rebound to that root.
(3) V1 and V2 have different syscall ops,
    Look into cgroup_kf_syscall_ops and cgroup1_kf_syscall_ops, the only
    difference is cgroup1 has .rename callback.

mount

The biggest difference is, during mount cgroup V1, a new root is allocated and subsystem is rebound.

use it

cgroup_mkdir()
  -> cgroup_create()
  ---

    /*
     * On the default hierarchy, a child doesn't automatically inherit
     * subtree_control from the parent.  Each is configured manually.
     */

    if (!cgroup_on_dfl(cgrp))
        cgrp->subtree_control = cgroup_control(cgrp);
  ---

cgroup tasks

data structure

There are 3 components in cgroup, tasks, subsystem and cgroup


        root
        /  \
     cg0   cg1 (cpu, io, mem)
            |
            +----------+
            |-css_cpu  |- task0
            |-css_io   |- task1
            |_css_mem  |- task2
                       |- ...

css : cgroup_subsys_state, per-subsystem/per-cgroup state
css_set : a structure holding pointers to a set of css

blkcg_css()
---
    return task_css(current, io_cgrp_id);
      -> task->cgroups->subsys[]
---

Why we need css_set ?


                  root (cpu, pid, io, mem)
                /      \
     cg0 (cpu, pid)   cg1 (io, mem)
                      / \
                     /   \
                    t0 t1 cg2 (io)
                          /\
                         t2 t3

How many css in the diagram above ?
 - root-cpu, root-pid, root-io, root-mem
 - cg0-cpu, cg0-pid
 - cg1-io, cg1-mem
 - cg2-io

t0 and t1 are associated with css (root-cpu, root-pid, cg1-io, cg1-mem)
t2 and t3 are associated with css (root-cpu, root-pid, cg1-mem, cg2-io)

Because the cgroup2 and cgroup1 both exist in kernel, a task could be both
of them.
                 net-root
                 /     \
             ncg0     ncg1
                       /\
                     t3  t4

Therefore the t3 has a css set:
(root-cpu, root-pid, cg1-mem, cg2-io, ncg1-net)

This is the css_set, and could be shared by all of the tasks that have the
same cgroup assignment.

Which cgroup css set (root-cpu, root-pid, cg1-mem, cg2-io, ncg1-net) belongs to ?

The answer is cg2 and ncg1 even though this css set shares the css of
root cgroup.

(css_set and cgroup will be linked together in link_css_set())
Look into find_existing_css_set()
---
    for_each_subsys(ss, i) {
        if (root->subsys_mask & (1UL << i)) { //>
            /*
             * @ss is in this hierarchy, so we want the
             * effective css from @cgrp.
             */
            template[i] = cgroup_e_css_by_mask(cgrp, ss);
        } else {
            /*
             * @ss is not in this hierarchy, so we don't want
             * to change the css.
             */
            template[i] = old_cset->subsys[i];
        }
    }

    key = css_set_hash(template);
    hash_for_each_possible(css_set_table, cset, hlist, key) {
        if (!compare_css_sets(cset, old_cset, cgrp, template))
            continue;

        /* This css_set matches what we need */
        return cset;
    }

    /* No existing cgroup group matched */
    return NULL;

---

In summary, the relationship between cgroup, css, css-set and task is as following,

         cgrp_dfl_root

         cg-root (cpu, io, mem)     cg-root-pid (pid)
         /     \                    /     \
    t0 t1 t2   cg0 (io, mem)    t0 t1     cg1
               / \                       /    \
             t3  t4                     t2 t3 t4

t3 and t4 belongs to a css-set [cg-root:(cpu), cg0:(io, mem), cg1:(pid)], called css-set-A
                                \_____ _____/
                                      v
                               this is a css
t2 belongs to a css-set [cg-root:(cpu, io, mem), cg1:(pid)], called css-set-B

And we could get following diagram,

         css-set-A     .--- cg-root
                   \  /
                    \/
         css-set-B-./\----- cg0
                    \ \
                     \ \
                      '-'-- cg1

Note : (1) css-set-A use css cg-root:(cpu) but it belongs to cg0
       (2) multiple to multiple relationship between css-set and cgroup
           is due to both cgroup v2 and v2 exists in kernel.
       (2) cg-root is actually the cgrp_dfl_root, as well as the root of
           cgroup v2

attach task

What need to be done to attach a task to a cgroup ?