IO-URING

Submission
Blocking
Linking
completion

Poll is not polling
io-wq is not wq

Submission


SYSCALL io_uring_enter
  -> io_submit_sqes //under a mutex ctx->uring_lock
---
    struct io_submit_state state, *statep = NULL;
    struct io_kiocb *link = NULL;
    int i, submitted = 0;

    ...
    /* make sure SQ entry isn't read before tail */
    nr = min3(nr, ctx->sq_entries, io_sqring_entries(ctx));

    if (!percpu_ref_tryget_many(&ctx->refs, nr))
        return -EAGAIN;

    if (nr > IO_PLUG_THRESHOLD) { IO_PLUG_THRESHOLD = 2//
        io_submit_state_start(&state, nr);

        // blk_start_plug is invoked here
        // Even on high speed device, plug and merge is welcome
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^
        // because the overhead of io path could be shared by multiple IO

        statep = &state;
    }

    ctx->ring_fd = ring_fd;
    ctx->ring_file = ring_file;

    for (i = 0; i < nr; i++) {
        const struct io_uring_sqe *sqe;
        struct io_kiocb *req;
        int err;

        sqe = io_get_sqe(ctx);
        ...
        req = io_alloc_req(ctx, statep);

        // Try to allocate reqs in batch


        err = io_init_req(ctx, req, sqe, statep);
        io_consume_sqe(ctx); //only update ctx->cached_sq_head here
        /* will complete beyond this point, count as submitted */
        submitted++;
        ...
        err = io_submit_sqe(req, sqe, &link);
    }
    ...
    if (statep)
        io_submit_state_end(&state);

        // blk_finish_plug



     /* Commit SQ ring head once we've consumed and submitted all SQEs */

    io_commit_sqring(ctx);
---

Blocking


io_uring supports real async IO for both direct IO and buffered IO.


// Code about link is ignored

io_submit_sqe
  -> io_queue_sqe
    -> __io_queue_sqe
    ---
    ret = io_issue_sqe(req, sqe, true);

      -> io_read_prep()
        -> io_prep_rw()
        ---
        if (force_nonblock)
            kiocb->ki_flags |= IOCB_NOWAIT;
        if (ctx->flags & IORING_SETUP_IOPOLL) {
            if (!(kiocb->ki_flags & IOCB_DIRECT) ||
                !kiocb->ki_filp->f_op->iopoll)
                return -EOPNOTSUPP;
                                                        
            kiocb->ki_flags |= IOCB_HIPRI; // do polling in block layer
            kiocb->ki_complete = io_complete_rw_iopoll;
            req->result = 0;
            req->iopoll_completed = 0;
        }
        ---
      -> io_read()


    /*
     * We async punt it if the file wasn't marked NOWAIT, or if the file
     * doesn't support non-blocking read/write attempts
     */
    if (ret == -EAGAIN && (!(req->flags & REQ_F_NOWAIT) ||
        (req->flags & REQ_F_MUST_PUNT))) {
        if (io_arm_poll_handler(req)) {
            if (linked_timeout)
                io_queue_linked_timeout(linked_timeout);
            goto exit;
        }
punt:
        io_req_init_async(req);
        ...
        /*
         * Queued up for async execution, worker will release
         * submit reference when the iocb is actually submitted.
         */
        io_queue_async_work(req);
        goto exit;
    }

    ---
There are two points here,

Linking


Linked sqes provide a way to describe dependencies between a sequence
of sqes within the greater submission ring, where each sqe execution
depends on the successful completion of the previous sqe.
submit

     sqe(REQ_F_LINK) -> sqe(REQ_F_LINK) -> sqe(REQ_F_LINK) -> sqe

io_submit_sqe
---
    if (*link) {
        struct io_kiocb *head = *link;
        ...
        if (io_alloc_async_ctx(req))
            return -EAGAIN;

        ret = io_req_defer_prep(req, sqe);
        ...
        list_add_tail(&req->link_list, &head->link_list);

        /* last request of a link, enqueue the link */

        if (!(req->flags & (REQ_F_LINK | REQ_F_HARDLINK))) {
            io_queue_link_head(head);

            ---
            if (unlikely(req->flags & REQ_F_FAIL_LINK)) {
                io_cqring_add_event(req, -ECANCELED);
                io_double_put_req(req);
            } else
                io_queue_sqe(req, NULL);
            ---

            *link = NULL;
        }
    } else {
        ...
        if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) {
            req->flags |= REQ_F_LINK_HEAD;
            INIT_LIST_HEAD(&req->link_list);

            if (io_alloc_async_ctx(req))
                return -EAGAIN;

            ret = io_req_defer_prep(req, sqe);
            if (ret)
                req->flags |= REQ_F_FAIL_LINK;
            *link = req;
        } else {
            io_queue_sqe(req, sqe);
        }
    }
---
completion
    io_free_req()
    ---
    io_req_find_next(req, &nxt);
    __io_free_req(req);


    if (nxt)
        io_queue_async_work(nxt);

    ---

io_req_find_next()
  -> io_fail_links() //REQ_F_FAIL_LINK
     it will fail all of the linked sqes
  -> io_req_link_next()
     get next req and set REQ_F_LINK_HEAD

completion


How does the userland application know its IO has been completed ?
There are two points here,

poll is not polling



io_arm_poll_handler()
---
    ipt.pt._qproc = io_async_queue_proc;

    ret = __io_arm_poll_handler(req, &apoll->poll, &ipt, mask,
                    io_async_wake);
---
It will install a wait_queue_entry on the wait_queue_head_t of underlying file
with a wake up function io_async_wake

tcp_poll
  -> sock_poll_wait()
    -> poll_wait(filp, &sock->wq.wait, p)
      -> p->_qproc(filp, wait_address, p)
         io_async_queue_proc()
           -> __io_queue_proc()
           ---
            pt->error = 0;
            poll->head = head;
            add_wait_queue(head, &poll->wait);
           ---
         
tcp_data_ready()
  -> sk->sk_data_ready()
     sock_def_readable
  ---
    rcu_read_lock();
    wq = rcu_dereference(sk->sk_wq);
    if (skwq_has_sleeper(wq))
        wake_up_interruptible_sync_poll(&wq->wait, EPOLLIN | EPOLLPRI |
                        EPOLLRDNORM | EPOLLRDBAND);
    ...
    rcu_read_unlock();
  ---

The wakeup callback will be invoked at this moment, in our case, it is io_async_wake
io_async_wake
  -> __io_async_wake
  ---
    list_del_init(&poll->wait.entry);

    tsk = req->task;

    req->result = mask;
    init_task_work(&req->task_work, func);

    ret = task_work_add(tsk, &req->task_work, true);
    ...
    wake_up_process(tsk);
  ---

The interesting thing is that the task is still executed by the original task who
arm the poll handler. task_work_run has been hooked on

io-wq is not wq


LWN io-wq

Hashed work


A call to io_wq_enqueue() adds the work to the queue for future execution.
The io_wq_enqueue_hashed() variant, instead, is one of the reasons for the
creation of new mechanism; it guarantees that no two jobs enqueued with the
same val will run concurrently. If an application submits multiple buffered
I/O requests for a single file, they should not be run concurrently or they
are likely to just block each other via lock contention. Buffered I/O on
different files can and should run concurrently, though. "Hashed" work entries
make it easy for io_uring to manage that concurrency in an optimal way.
And comment about the previous async_list to optimize sequential read/write IO
commit 31b515106428b9717d2b6475b6f6182cf231b1e6
Author: Jens Axboe 
Date:   Fri Jan 18 22:56:34 2019 -0700

    io_uring: allow workqueue item to handle multiple buffered requests
    
    Right now we punt any buffered request that ends up triggering an
    -EAGAIN to an async workqueue. This works fine in terms of providing
    async execution of them, but it also can create quite a lot of work
    queue items. For sequentially buffered IO, it's advantageous to
    serialize the issue of them. For reads, the first one will trigger a
    read-ahead, and subsequent request merely end up waiting on later pages
    to complete. For writes, devices usually respond better to streamed
    sequential writes.
    
    Add state to track the last buffered request we punted to a work queue,
    and if the next one is sequential to the previous, attempt to get the
    previous work item to handle it. We limit the number of sequential
    add-ons to the a multiple (8) of the max read-ahead size of the file.
    This should be a good number for both reads and wries, as it defines the
    max IO size the device can do directly.
    
    This drastically cuts down on the number of context switches we need to
    handle buffered sequential IO, and a basic test case of copying a big
    file with io_uring sees a 5x speedup.
    
    Reviewed-by: Hannes Reinecke 
    Signed-off-by: Jens Axboe 

Let's look at how does the hash work implement in io-wq