VFS

readahead

writeback
BH dcache pagecache fs misc

readahead

concepts

implementation

(Quote from the comment in mm/readahead.c)
The fields in struct file_ra_state represent the most-recently readahead
attempt.


                        |<----- async_size ---------|
     |------------------- size -------------------->|
     |==================#===========================|
     ^start             ^page marked with PG_readahead


pipelining

To overlap application thinking time and disk I/O time, we do `readahead pipelining':

Do not wait until the application consumed all  readahead pages and stalled on the
missing page at readahead_index;  Instead, submit an asynchronous readahead I/O as
soon as there are only async_size pages left in the readahead window.


Normally async_size will be equal to size, for maximum pipelining.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


ondemand_readahead
---
    if ((offset == (ra->start + ra->size - ra->async_size) ||
         offset == (ra->start + ra->size))) {
        ra->start += ra->size;
        ra->size = get_next_ra_size(ra, max_pages);
        ra->async_size = ra->size; //for maximum pipelining
        goto readit;
    }
---


ra_submit
---
    return __do_page_cache_readahead(mapping, filp,
                    ra->start, ra->size, ra->async_size);
---

__do_page_cache_readahead
---
    for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
        pgoff_t page_offset = offset + page_idx;

        if (page_offset > end_index)
            break;

        page = xa_load(&mapping->i_pages, page_offset);
        ...

        page = __page_cache_alloc(gfp_mask);
        if (!page)
            break;
        page->index = page_offset;
        list_add(&page->lru, &page_pool);

        if (page_idx == nr_to_read - lookahead_size)
            SetPageReadahead(page);

        nr_pages++;
    }

    if (nr_pages)
        read_pages(mapping, filp, &page_pool, nr_pages, gfp_mask);

---


generic_file_buffered_read
---
        if (PageReadahead(page)) {
            page_cache_async_readahead(mapping,
                    ra, filp, page,
                    index, last_index - index);
        }
---

REQ_RAHEAD

When to set this flag ?

In block layer, REQ_RAHEAD is not important.

fs metadata

Since the fs metadata is through the pagecache of the block device,
does it could use this readahead ?
Let's take some examples from ext4

So we know the fs metadata doesn't use the readahead directly, the fs could implement its own readahead stuff.

writeback

dirty balance

Look at the comment of balance_dirty_pages

balance_dirty_pages() must be called by processes which are generating dirty
data.  It looks at the number of dirty pages in the machine and will force
                                                                ^^^^^^^^^^
the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If we're over `background_thresh' then the writeback threads are woken to
perform some writeout.
In normal case, the writeback mode could avoid the application to be blocked,
but what if the case when there is continuous writting IO ? ^^^^^^^^^^^^^^^^^^^^^^
                        pagecache
               write    +-------+
  application ------->  | Dirty |
                        +-------+
                        | Dirty |
                        +-------+
                        | Dirty |                  __________
                        +-------+  writeback      /         /|
                        | Dirty | ----------->   /_________/ |
                        +-------+                |         | /
                                                 |_________|/ 

When there is dirty balance mechanism here, the bw of application writting is
actually equal to the disk bw. So regarding to latency, the writeback here is
not so helpful.

But combined with delayed allocation, writeback could avoid fragment.
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Anyway, let's look at how does it work ?
generic_perform_write
---
    do {
        ...
        offset = (pos & (PAGE_SIZE - 1));
        bytes = min_t(unsigned long, PAGE_SIZE - offset,
                        iov_iter_count(i));

again:
        ...
        status = a_ops->write_begin(file, mapping, pos, bytes, flags,
                        &page, &fsdata);
        if (unlikely(status < 0))
            break;

        if (mapping_writably_mapped(mapping))
            flush_dcache_page(page);

        copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
        flush_dcache_page(page);

        status = a_ops->write_end(file, mapping, pos, bytes, copied,
                        page, fsdata);
        if (unlikely(status < 0))
            break;
        copied = status;

        cond_resched();

        iov_iter_advance(i, copied);
        ...
        pos += copied;
        written += copied;

        balance_dirty_pages_ratelimited(mapping);
    } while (iov_iter_count(i));
---

dirty timeout

queue_io is used to move the timeout dirty inode from the wb->b_dirty or wb->b_dirty_time to wb->b_io.

move_expired_inodes
---
    if ((flags & EXPIRE_DIRTY_ATIME) == 0)
        older_than_this = work->older_than_this;
    else if (!work->for_sync) {
        expire_time = jiffies - (dirtytime_expire_interval * HZ);
        older_than_this = &expire_time;
    }
    while (!list_empty(delaying_queue)) {
        inode = wb_inode(delaying_queue->prev);

        // If the dirty time is before older_than_this, it will be moved.
                                ^^^^^^^^^^^^^^^^^^^^^^

        if (older_than_this &&
            inode_dirtied_after(inode, *older_than_this))
            break;
        list_move(&inode->i_io_list, &tmp);
        ...
    }
---
So how to set the older_than_this ?

hand page to writeback

per-queue wb

The bdi is per-queue instead of per-fs.

#STEP 0
blk_alloc_queue_node
---
    q->backing_dev_info = bdi_alloc_node(gfp_mask, node_id);
    if (!q->backing_dev_info)
        goto fail_split;

    q->stats = blk_alloc_queue_stats();
    if (!q->stats)
        goto fail_stats;

    q->backing_dev_info->ra_pages = VM_READAHEAD_PAGES;
    q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
    q->backing_dev_info->name = "block";
    q->node = node_id;

    timer_setup(&q->backing_dev_info->laptop_mode_wb_timer,
            laptop_mode_timer_fn, 0);
---

#STEP 1
__blkdev_get
---
    if (!bdev->bd_openers) {
        first_open = true;
        bdev->bd_disk = disk;
        bdev->bd_queue = disk->queue;
        bdev->bd_contains = bdev;
        bdev->bd_partno = partno;

        if (!partno) {
            ret = -ENXIO;
            bdev->bd_part = disk_get_part(disk, partno);
            if (!bdev->bd_part)
                goto out_clear;

            ret = 0;
            if (disk->fops->open) {
                ret = disk->fops->open(bdev, mode);
                ...
            }

            if (!ret) {
                bd_set_size(bdev,(loff_t)get_capacity(disk)<<9);
                set_init_blocksize(bdev);
            }

            ...
        } else {
            struct block_device *whole;
            whole = bdget_disk(disk, 0);
            ret = -ENOMEM;
            if (!whole)
                goto out_clear;
            BUG_ON(for_part);
            ret = __blkdev_get(whole, mode, 1);
            if (ret)
                goto out_clear;
            bdev->bd_contains = whole;
            bdev->bd_part = disk_get_part(disk, partno);
            if (!(disk->flags & GENHD_FL_UP) ||
                !bdev->bd_part || !bdev->bd_part->nr_sects) {
                ret = -ENXIO;
                goto out_clear;
            }
            bd_set_size(bdev, (loff_t)bdev->bd_part->nr_sects << 9);
            set_init_blocksize(bdev);
        }

        if (bdev->bd_bdi == &noop_backing_dev_info)
            bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
---

#STEP 2
mount_dev
---
    s = sget(fs_type, test_bdev_super, set_bdev_super, flags | SB_NOSEC,
         bdev);
---

static int set_bdev_super(struct super_block *s, void *data)
{
    s->s_bdev = data;
    s->s_dev = s->s_bdev->bd_dev;
    s->s_bdi = bdi_get(s->s_bdev->bd_bdi);

    return 0;
}

#STEP 3
balance_dirty_pages_ratelimited
  -> inode_to_bdi
  ---
    sb = inode->i_sb;
#ifdef CONFIG_BLOCK
    if (sb_is_blkdev_sb(sb))
        return I_BDEV(inode)->bd_bdi;
#endif
    return sb->s_bdi;
  ---


writeback(legacy)

wb_do_writeback


The main difference between such writeback method is how many pages that can be
written out.
For example,


BH

Historically, a buffer_head was used to map a single block within a page, and of
course as the unit of I/O through the filesystem and block layers.
Nowadays the basic I/O unit is the bio, and buffer_heads are used for

bh to bio

Look at the submit_bh_wbc to know the basic steps:

    ---
    bio = bio_alloc(GFP_NOIO, 1);

    if (wbc) {
        wbc_init_bio(wbc, bio);
        wbc_account_io(wbc, bh->b_page, bh->b_size);
    }

    bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
    bio_set_dev(bio, bh->b_bdev);
    bio->bi_write_hint = write_hint;

    bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
    BUG_ON(bio->bi_iter.bi_size != bh->b_size);

    bio->bi_end_io = end_bio_bh_io_sync;
                      -> bh->b_end_io(bh, !bio->bi_status);
    bio->bi_private = bh;

    /* Take care of bh's that straddle the end of the device */
    guard_bio_eod(op, bio);

    if (buffer_meta(bh))
        op_flags |= REQ_META;
    if (buffer_prio(bh))
        op_flags |= REQ_PRIO;
    bio_set_op_attrs(bio, op, op_flags);

    submit_bio(bio);
    ---

BH state

enum bh_state_bits defines the state of a bh and is contained in bh->b_state
And there are 3 marcos to define the set, clear and test operations about this state. They are defined in include/linux/buffer_head.h


Let's look at how to use them in fs.

dcache


dcache, dentry cache, directory entry cache.

A dentry's core job is to represent a directory or file in filesystem and cache
the mapping between the file/directory and the associated inode. This inode
contains the core operations of filesystem.
The dentries encode the fs tree structure, the name of files. The main part of a dentry is :

path_walking

The path walking is mainly done in link_path_walk, let's look at the skeleton of it.

    for(;;) {
        ...
        hash_len = hash_name(nd->path.dentry, name);

        hash_name will calculate the length and hash of the path component.
        hash_len = len << 32 | hash
        hash value is calculated based on pointer of parent dentry and entry name.

        ...
        nd->last.hash_len = hash_len;
        nd->last.name = name;
        nd->last_type = type;

        nd->last is the name component we walking currently.
        link_path_walk will leave the last component of the path for do_last

        name += hashlen_len(hash_len);
        if (!*name)
            goto OK;
        /*
         * If it wasn't NUL, we know it was '/'. Skip that
         * slash, and continue until no more slashes.
         */
        do {
            name++;
        } while (unlikely(*name == '/'));
        if (unlikely(!*name)) {
            ...
        } else {
            /* not the last component */
            err = walk_component(nd, WALK_FOLLOW | WALK_MORE);
        }
        ...
    }
walk_component will mainly do 3 things:
1. try to get the dentry in the cache
lookup_fast
  -> __d_lookup(&nd->path.dentry, &nd->last)
    -> get hash list by d_hash(name->hash)
2. if not in cache, try to get it from fs
lookup_slow
  -> __lookup_slow
    -> d_alloc_parallel // allocate dentry
    -> inode->i_op->lookup

    // this will cause some io to get in the filesystem metadata of directory
    // and inode.

3. follow_managed
   mountpoint will be resolved here.

locks_of_dentry

Refer to Documentation/filesystems/path-lookup.txt
dcache is used to speed up the looking up of inode associated with a path name. this look up could come from multiple cores concurrently and frequently, so the lock mechanism is very important. Let's look into the lock of denty cache next and find out the how it promote the performance.

In Documentation/filesystems/path-lookup.txt, it always says "would like to do path walking without taking locks or reference
counts of intermediate dentries along the path.", why ?

Look into the path lookup process,
    [0]            [2]                    [4]
   +---+      +---------+            +-----------+
   |   v      |         v            |           v
/home/will/Desktop/wangjianchao/source_code/linux-stable/Makefile 
       |      ^         |            ^           |           ^
       +------+         +------------+           +-----------+
          [1]                 [3]                     [5]

[0]  dentry of "/", "home"
[1]  dentry of "home", "will"
[2]  dentry of "will", "Desktop"
[3]  dentry of "Desktop", "wangjianchao"
[4]  dentry of "wangjianchao", "source_code"
[5]  dentry of "source_code", "Makefile"

walk_component will be executed for [0] ~ [5], and lookup_fast will be invoked
every time. At the moment, the component dentry's d_lock has to be locked to
serialize the accessing to the dentry.
__d_lookup
---
        spin_lock(&dentry->d_lock);
        if (dentry->d_parent != parent)
            goto next;
        if (d_unhashed(dentry))
            goto next;

        if (!d_same_name(dentry, parent, name))
            goto next;

        dentry->d_lockref.count++;
        found = dentry;
        spin_unlock(&dentry->d_lock);
---

The contending on the lock of dentry of "home", "will" and "Desktop" should be
very high. On the system of a lot of cores, the dentry cache could become a
scalability problem with workload which perform lot of lookup.
Currently, there are two path walking modes: The 'storing to shared data' means, in ref-walk, it need to: To kill them, rcu-walk does as following: Who would write_seqcount the dentry->d_seq ? There are two points that why seqlock is better than spinlock in almost-read scenario.

lookup in parellel

There are two parts of dentry look up, the fast path and slow path.
Let's look at the slow path here.
Quote from here

https://lwn.net/Articles/685108/

All directory operations are done with the inode mutex (i_mutex) held, which prevents anything else 
from touching that directory. But the most common operation, lookup, is non-destructive, so there is
no real conceptual reason to stop it from happening in parallel.
The typical scenario could be

CPU0     CPU1      CPU2     CPU3     CPU4
T0       T1        T2       T3       T4
  \      \         |        /        /
    \      \       |      /        /
      \      \     |    /        /
               
              /var/log/ 
          T0  T1  T2  T3  T4

If all of the dentries of T0 ~ T1 happen to be not in memory, all of them have
to invoke lookup_slow.

If the lock here is a mutex, the performance will very bad.
Then a rw_semaphore is introduced to replace the mutex.
static struct dentry *lookup_slow(const struct qstr *name,
                  struct dentry *dir,
                  unsigned int flags)
{
    struct inode *inode = dir->d_inode;
    struct dentry *res;
    inode_lock_shared(inode);
      -> down_read(&inode->i_rwsem); 
    res = __lookup_slow(name, dir, flags);
    inode_unlock_shared(inode);
    return res;
}

But there is a problem: the mutex currently protects the directory entry (dentry). A lookup operation can cause dentries to be created, which can lead to races if two dentries are created for the same name.
How to handle this ?
Look at the d_alloc_parallel.
---
    struct hlist_bl_head *b = in_lookup_hash(parent, hash);

    // alloc a dentry structrue here.

    struct dentry *new = d_alloc(parent, name);
 
retry:
    rcu_read_lock();
    r_seq = read_seqbegin(&rename_lock);

    // look up the a dentry with (parent, name) in hash cache
    // there could be some one create a same one concurrently.

    dentry = __d_lookup_rcu(parent, name, &d_seq);
    if (unlikely(dentry)) {
        ...

    // anything changes on the dentry ?

        if (read_seqcount_retry(&dentry->d_seq, d_seq)) {
            rcu_read_unlock();
            dput(dentry);
            goto retry;
        }
        rcu_read_unlock();
        dput(new);
        return dentry;
    }
    if (unlikely(read_seqretry(&rename_lock, r_seq))) {
        rcu_read_unlock();
        goto retry;
    }

    hlist_bl_lock(b);

    // A spin lock here.
    // So there could be only one entering this critical section,
    // namely, only one of concurrent lookups with same parent and name pair
    // could add its dentry on the hash cache, the others have to wait. When
    // they come in this critical section, a dentry with same name and parent
    // pair has been there.
    // At the moment, there are 2 cases:
    //  - the dentry is in lookup, indicating inode->i_op->lookup is ongoing.
    //    we have to wait.
    //  - otherwise, the lookup has been completed, we could return this dentry
    //    directly.

    hlist_bl_for_each_entry(dentry, node, b, d_u.d_in_lookup_hash) {
        if (dentry->d_name.hash != hash)
            continue;
        if (dentry->d_parent != parent)
            continue;
        if (!d_same_name(dentry, parent, name))
            continue;
        hlist_bl_unlock(b);
        /* now we can try to grab a reference */
        if (!lockref_get_not_dead(&dentry->d_lockref)) {
            rcu_read_unlock();
            goto retry;
        }

        rcu_read_unlock();
        /*
         * somebody is likely to be still doing lookup for it;
         * wait for them to finish
         */
        spin_lock(&dentry->d_lock);
        d_wait_lookup(dentry);
        if (unlikely(dentry->d_name.hash != hash))
            goto mismatch;
        if (unlikely(dentry->d_parent != parent))
            goto mismatch;
        if (unlikely(d_unhashed(dentry)))
            goto mismatch;
        if (unlikely(!d_same_name(dentry, parent, name)))
            goto mismatch;
        /* OK, it *is* a hashed match; return it */
        spin_unlock(&dentry->d_lock);
        dput(new);
        return dentry;
    }
    rcu_read_unlock();
    /* we can't take ->d_lock here; it's OK, though. */
    new->d_flags |= DCACHE_PAR_LOOKUP; // dentry in-lookup is set here.
    new->d_wait = wq;
    hlist_bl_add_head_rcu(&new->d_u.d_in_lookup_hash, b);
    hlist_bl_unlock(b);
    return new;

---

pagecache

lifecycle of pagecache


grow pagecache
grow pagecache
pagecache_get_page
  -> find_get_entry
  -> __page_cache_alloc
  -> add_to_page_cache_lru
    -> __add_to_page_cache_locked
    -> lru_cache_add
      -> __lru_cache_add
      ---
        struct pagevec *pvec = &get_cpu_var(lru_add_pvec);

        get_page(page);
        if (!pagevec_add(pvec, page) || PageCompound(page))

             ^^^^^^^^^^^ [1]

            __pagevec_lru_add(pvec);

           ^^^^^^^^^^^^^^^^^^^^^^^^ [2]

        put_cpu_var(lru_add_pvec);
      ---

The interesting thing here is that the page would be added into per-cpu pagevec
first. If per-cpu pagevec is full, __pagevec_lru_add would drain the pages into
lru list one time.

__pagevec_lru_add
  -> __pagevec_lru_add_fn
  ---
    SetPageLRU(page);
    smp_mb();

    if (page_evictable(page)) {
        lru = page_lru(page);
    } else {
        ...
    }

    add_page_to_lru_list(page, lruvec, lru);
  ---

When drain pages into lru, we need to select lru list for it. This is done by page_lru

static __always_inline enum lru_list page_lru(struct page *page)
{
    enum lru_list lru;

    if (PageUnevictable(page))
        lru = LRU_UNEVICTABLE;
    else {
        lru = page_lru_base_type(page);
        if (PageActive(page))
            lru += LRU_ACTIVE;
    }
    return lru;
}

enum lru_list {
    LRU_INACTIVE_ANON = LRU_BASE,
    LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
    LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
    LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
    LRU_UNEVICTABLE,
    NR_LRU_LISTS
};

pagecache in meminfo

fs misc


The truth of page lock

block_read_full_page(), __block_write_full_page() and __block_write_full_page()_all will create buffer_heads for page.

create_page_buffers()
    -> create_empty_buffers()
        -> attach_page_buffers()
            -> SetPagePrivate()
            -> set_page_private()
The truth of the page lock and bh lock
In the process of read operations
do_generic_file_read()
    -> page_cache_sync_readahead() // if page is not present
        -> ondemand_readahead()
            -> ra_submit()
                -> __do_page_cache_readahead()
                    -> read_pages()
                        -> mapping->a_ops->readpages()
                           ext4_mpage_readpages()
                            -> add_to_page_cache_lru()
                                -> __set_page_locked() // page is locked ------> Here
                            -> block_read_full_page() // if page has buffers
                                -> lock_buffer() // the buffer_head is locked -----> Here
                                -> mark_buffer_async_read()
                                    //bh->b_end_io = end_buffer_async_read
                                -> submit_bh()
    -> !PageUptodate() && !trylock_page() , go to page_not_up_to_date
    -> lock_page_killable()
    -> if PageUptodate(), unlock_page() and goto page_ok
Where the page and buffer_head are unlocked ?
static void end_buffer_async_read(struct buffer_head *bh, int uptodate)
{
    unsigned long flags;
    struct buffer_head *first;
    struct buffer_head *tmp;
    struct page *page;
    int page_uptodate = 1;

    BUG_ON(!buffer_async_read(bh));

    page = bh->b_page;
    if (uptodate) {
        set_buffer_uptodate(bh);
    } else {
        clear_buffer_uptodate(bh);
        buffer_io_error(bh, ", async page read");
        SetPageError(page);
    }

    /*
     * Be _very_ careful from here on. Bad things can happen if
     * two buffer heads end IO at almost the same time and both
     * decide that the page is now completely done.
     */
    first = page_buffers(page);
    local_irq_save(flags);
    bit_spin_lock(BH_Uptodate_Lock, &first->b_state);
    clear_buffer_async_read(bh);
    unlock_buffer(bh);
    tmp = bh;
    do {
        if (!buffer_uptodate(tmp))
            page_uptodate = 0;
        if (buffer_async_read(tmp)) {
            BUG_ON(!buffer_locked(tmp)); //This could prove that the buffer_head is locked during the read process
            goto still_busy;
        }
        tmp = tmp->b_this_page;
    } while (tmp != bh);
    bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
    local_irq_restore(flags);

    /*
     * If none of the buffers had errors and they are all
     * uptodate then we can set the page uptodate.
     */
    if (page_uptodate && !PageError(page))
        SetPageUptodate(page);
    unlock_page(page);
    return;

still_busy:
    bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
    local_irq_restore(flags);
    return;
}
If all the buffer in that page are uptodate, the page will be set uptodate and unlocked.
We could see that the page and buffer_head both are locked during the process of read operations. The lock ensure the page exclusive because the device need write data into the page through DMA. What's about the write operations ?
The write process is divided into two parts.
1> write the user data into page cache
generic_perform_write()
    -> a_ops->write_begin()
       ext4_write_begin()
        -> grab_cache_page_write_begin()
            -> pagecache_get_page()
                -> find_get_entry()
                If get 
                -> lock_page() //page is locked
                otherwise
                -> add_to_page_cache_lru()
                    -> __set_page_locked() //page is locked'
        -> unlock_page()
        -> ext4_journal_start() // About why does unlock_page() before the
    ext4_journal_start(), please refer to the comment in ext4_write_begin()
        -> lock_page() // the page is relocked
        -> wait_for_stable_page()

    -> iov_iter_copy_from_user_atomic()

    -> a_ops->write_begin()
        -> block_write_end()
            -> __block_commit_write()
                -> set_buffer_uptodate()
                -> mark_buffer_dirty()
                -> SetPageUptodate() // if no partial
        -> unlock_page() // page is unlocked
The page lock will ensure the page exclusive from other operations when the user data is being copied into it. 2> write back the dirty page
ext4_writepages()
    -> blk_start_plug()
    -> write_cache_pages() //Go here when in journal mode because this mode
    does not support delayed allocation. We use this branch to demonstrate the
    page and bh lock because I really didn't find the where does the lock_page
    locate.
        -> lock_page() -------> Here
        -> wait_on_page_writeback() when PageWriteback() // keep the write back atomic
        -> clear_page_dirty_for_io()
        -> __writepage()
            -> ext4_writepage()
                -> ext4_bio_write_page()
                    -> set_page_writeback() //very important
                    -> set_buffer_async_write()
                    -> io_submit_add_bh()
                        -> ext4_io_submit()
                        -> io_submit_init_bio()
                            // set bi_end_io = ext4_end_bio()
                    -> unlock_page() ----> Here
                    
    -> blk_finish_plug()


ext4_end_bio()
    -> ext4_finish_bio()
        -> clear_buffer_async_write()
        -> end_page_writeback() // !under_io
            -> test_clear_page_writeback()
            -> wake_up_page(page, PG_writeback);

The page is not locked during the write operations. But the writeback flag is set to ensure the atomicity of the operations on the page