Container

cgroup

cgroup


data structure


basis

And here are some critial questions about the data structures

relationship


To make clear the life cycle of components in cgroup, we must know the
reference relationship between them.

                     task
                   /
            css_set ----- css
                   \      /
                    cgroup

life cycle


In summary
                 .-------------------.
                /                    v
     task -> css_set <-> cgroup <-> css
                                [1]
-> : hold reference
[1]: cgroup holds the css.online_css, and they are bound together.

Think of following cases,
(1) Task A in cg0 issues IO
    At the moment, the task hold reference of css_set, and css_set holds
    reference of cgroup, cgroup hold reference to css, so the blkcg(css)
    will not be gone during this.
(2) cgroup writeback control
    At the moment, the css info is carried by writeback kworker, so it
    need to hold reference to the css to keep it alive.
    refer to cgwb_create() which hold reference of memcg_css and blkcg_css
Do we need css_get during cgroup file read/write method ?
Look at the comment in cgroup_file_write()
    /*
     * kernfs guarantees that a file isn't deleted with operations in
     * flight, which means that the matching css is and stays alive and
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     * doesn't need to be pinned.  The RCU locking is not necessary
       ^^^^^^^^^^^^^^^^^^^^^^^^^
     * either.  It's just for the convenience of using cgroup_css().
     */

How does cgroup-core implement it ?
Look at the code of cgroup_destroy_locked()
---
    /* initiate massacre of all css's */
    for_each_css(css, ssid, cgrp)
        kill_css(css);

    /* clear and remove @cgrp dir, @cgrp has an extra ref on its kn */
    css_clear_dir(&cgrp->self);
    kernfs_remove(cgrp->kn);
    ...
    percpu_ref_kill(&cgrp->self.refcnt);
---

The css's percpu ref and online_cnt will be put after kill_css, and
kernfs_remove() is invoked after that. How to guarantee the css isn't freed
during cgroup file read/write method where css is referenced ?

The magic is cgroup_mutex
cgroup_rmdir()
---
    cgrp = cgroup_kn_lock_live(kn, false);
    ---
        if (drain_offline)
            cgroup_lock_and_drain_offline(cgrp);
        else
            mutex_lock(&cgroup_mutex);
    ---

    ret = cgroup_destroy_locked(cgrp);

    cgroup_kn_unlock(kn);
---

The cgroup_destroy_locked() is invoked under cgroup_mutex().
Then let's look at the
css_killed_work_fn()
---
    mutex_lock(&cgroup_mutex);

    do {
        offline_css(css);
        css_put(css);
        /* @css can't go away while we're holding cgroup_mutex */
        css = css->parent;
    } while (css && atomic_dec_and_test(&css->online_cnt));

    mutex_unlock(&cgroup_mutex);
---

The cgroup_mutex guarantees the css won't be offlined before
cgroup_destroy_locked() returns.

cgroup kernfs


In this section, we would like to know how does cgroup interact with users in userland through kernfs.
Let's start from the root of cgroup fs which is initialized in cgroup_init

cgroup_init()
---
    BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files));
    BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));
    ...
    BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0));
---
These are the key of the cgroup kernfs, files and hierarchy

Files


cgroup_base_files is an array of cftype which defines the control file in cgroup directorys,
such as 'cgroup.type', 'cgroup.procs'. cgroup_init_cftypes mainly asigns a kernfs_ops for every cftype.

static struct kernfs_ops cgroup_kf_ops = {
    .atomic_write_len    = PAGE_SIZE,
    .open            = cgroup_file_open,
    .release        = cgroup_file_release,
    .write            = cgroup_file_write,
    .poll            = cgroup_file_poll,
    .seq_start        = cgroup_seqfile_start,
    .seq_next        = cgroup_seqfile_next,
    .seq_stop        = cgroup_seqfile_stop,
    .seq_show        = cgroup_seqfile_show,
};
cgroup_kf_ops is a wrapper which translates the semantics from kernfs to cgroup.
cgroup_file_write()
---
    struct cgroup *cgrp = of->kn->parent->priv;
    struct cftype *cft = of->kn->priv;

    //parent, namely the directory, stands for the cgroup
    //file is the attibutes in the cgroup directory

    ...
    if (cft->write)
        return cft->write(of, buf, nbytes, off);

    rcu_read_lock();
    css = cgroup_css(cgrp, cft->ss);
    rcu_read_unlock();

    if (cft->write_u64) {
        unsigned long long v;
        ret = kstrtoull(buf, 0, &v);
        if (!ret)
            ret = cft->write_u64(css, cft, v);
    } else if (cft->write_s64) {
        long long v;
        ret = kstrtoll(buf, 0, &v);
        if (!ret)
            ret = cft->write_s64(css, cft, v);
    }
---
The kernfs would provide two guarantees regarding to a file

Hierarchy


Currently, there are two different kinds of cgroup hierarchy in kernel,

cgroupfs

To understand a filesystem, we must to know how to mkfs, mount, and use it.

V1 and V2

Let's look at the difference between V1 and V2

cgroup tasks


data structure

There are 3 components in cgroup, tasks, subsystem and cgroup


        root
        /  \
     cg0   cg1 (cpu, io, mem)
            |
            +----------+
            |-css_cpu  |- task0
            |-css_io   |- task1
            |_css_mem  |- task2
                       |- ...

css : cgroup_subsys_state, per-subsystem/per-cgroup state
css_set : a structure holding pointers to a set of css

blkcg_css()
---
    return task_css(current, io_cgrp_id);
      -> task->cgroups->subsys[]
---
Why we need css_set ?

                  root (cpu, pid, io, mem)
                /      \
     cg0 (cpu, pid)   cg1 (io, mem)
                      / \
                     /   \
                    t0 t1 cg2 (io)
                          /\
                         t2 t3

How many css in the diagram above ?
 - root-cpu, root-pid, root-io, root-mem
 - cg0-cpu, cg0-pid
 - cg1-io, cg1-mem
 - cg2-io

t0 and t1 are associated with css (root-cpu, root-pid, cg1-io, cg1-mem)
t2 and t3 are associated with css (root-cpu, root-pid, cg1-mem, cg2-io)

Because the cgroup2 and cgroup1 both exist in kernel, a task could be both
of them.
                 net-root
                 /     \
             ncg0     ncg1
                       /\
                     t3  t4

Therefore the t3 has a css set:
(root-cpu, root-pid, cg1-mem, cg2-io, ncg1-net)

This is the css_set, and could be shared by all of the tasks that have the
same cgroup assignment.

Which cgroup css set (root-cpu, root-pid, cg1-mem, cg2-io, ncg1-net) belongs to ?

The answer is cg2 and ncg1 even though this css set shares the css of
root cgroup.

(css_set and cgroup will be linked together in link_css_set())
Look into find_existing_css_set()
---
    for_each_subsys(ss, i) {
        if (root->subsys_mask & (1UL << i)) { //>
            /*
             * @ss is in this hierarchy, so we want the
             * effective css from @cgrp.
             */
            template[i] = cgroup_e_css_by_mask(cgrp, ss);
        } else {
            /*
             * @ss is not in this hierarchy, so we don't want
             * to change the css.
             */
            template[i] = old_cset->subsys[i];
        }
    }

    key = css_set_hash(template);
    hash_for_each_possible(css_set_table, cset, hlist, key) {
        if (!compare_css_sets(cset, old_cset, cgrp, template))
            continue;

        /* This css_set matches what we need */
        return cset;
    }

    /* No existing cgroup group matched */
    return NULL;

---

In summary, the relationship between cgroup, css, css-set and task is as following,
         cgrp_dfl_root

         cg-root (cpu, io, mem)     cg-root-pid (pid)
         /     \                    /     \
    t0 t1 t2   cg0 (io, mem)    t0 t1     cg1
               / \                       /    \
             t3  t4                     t2 t3 t4

t3 and t4 belongs to a css-set [cg-root:(cpu), cg0:(io, mem), cg1:(pid)], called css-set-A
                                \_____ _____/
                                      v
                               this is a css
t2 belongs to a css-set [cg-root:(cpu, io, mem), cg1:(pid)], called css-set-B

And we could get following diagram,

         css-set-A     .--- cg-root
                   \  /
                    \/
         css-set-B-./\----- cg0
                    \ \
                     \ \
                      '-'-- cg1

Note : (1) css-set-A use css cg-root:(cpu) but it belongs to cg0
       (2) multiple to multiple relationship between css-set and cgroup
           is due to both cgroup v2 and v2 exists in kernel.
       (2) cg-root is actually the cgrp_dfl_root, as well as the root of
           cgroup v2

attach task

What need to be done to attach a task to a cgroup ?