RDMA

Concepts

A Transfer Example
Software ROCE

Concepts

Upper Level Protocols:
Categorize ULPs into different categories:
Network
• IPoIB (IP over IB)
• SDP (Socket Direct Protocol)
• WSD (Winsock Direct for windows)
Storage
• NFS over RDMA
• SCSI RDMA Protocol (SRP)
• iSCSI Extensions for RDMA (iSER)
Computing – Clustering
• MPI (Message Passing Interface)
IB Software Transport Interface
Channel Interface (CI)
A combination of hardware, firmware, and software which provides services to the host

Verbs
Operations which a CI is expected to perform

Queue Pair (QP)
Represents communications endpoint, like a socket
Consists of a SEND Queue and a RECEIVE Queue


Work Request Element (WQE)
requests communications operation


Completion Queue (CQ)
provides completed operation status


Memory Regions: system memory is “registered” to allow access to local and remote channel 
adapters to source/sink communications data


A Transfer Example

requester QP's SQ Logic (L) in CA X will read a 5KB message from its local memory and send it to the target responder QP's RQ Logic (Q) in CA Y.
                           CQ                   CQ
                           SQ                   RQ 
                           QP                   QP
                         CA  X       ---      CA  Y

upon receipt of the message data, the responder QP's RQ Logic will use the top entry of (WQE) posted to its RQ to determine where to write
the incomming data in its CA's local memory.

Step 1Posting the Message Receive Request
the software application in CA Y has posted a WR prior to the receipt of the first request packet of the Send operation this WR is supplied to the CA Y's QP SQ by executing Post Receive Request Verb call. It need following parameter: CQ CQ SQ RQ -> WRE (Receive) QP QP CA X --- CA Y Step 2Posting the Message Send Request
the software application in the CA X post a WR on the QP's SQ through Post Send Request Verb call CQ CQ WRE (Send) <- SQ RQ -> WRE (Receive) QP QP CA X --- CA Y Step 3'Send first' Request Packet Send
Construct the packet based on the PMTU and the scatter buffer list in WRE. Send to CA Y through underlayer port. PMTU 2KB CQ CQ WRE (Send) <- SQ RQ -> WRE (Receive) PSN 101 ePSN 100 QP QP CA X ---> CA Y Send First 2KB PSN==100 Upon the QP RQ in CA Y receive this Send, it will check whether there is a WRE on RQ, if not, it will send a RNR ACK to the CA X's QP SQ. Send a positive ACK. Write the data in the send first packet based on the WRE on the RQ to the CA Y's local memory. Step 4First ACK Packet returned
PMTU 2KB CQ CQ WRE (Send) <- SQ RQ -> WRE (Receive) PSN 101 ePSN 101 QP QP CA X <--- CA Y First ACK PSN==100 Step 5"Send Middle" Request Packet Sent and Ack returned
Noted: the SQ's logci doesn't wait for the ACK for the just issued request packet to arrive before it launches the next request into fabrics. PMTU 2KB CQ CQ WRE (Send) <- SQ RQ -> WRE (Receive) PSN 102 ePSN 101 QP QP CA X ---> CA Y Send Middle 2KB PSN==101 Once the CA Y receives the send middle packet: if the packet's PSN is greater than the ePSN, send back a PSN Sequence Error NAK if the packets' PSN fall within the range of the PSNs for the request packets that were previously received, the RQ logic doesn't rewrite the packet data payload to memory, but it does send back a positive ACK PMTU 2KB CQ CQ WRE (Send) <- SQ RQ -> WRE (Receive) PSN 102 ePSN 102 QP QP CA X <--- CA Y ACK PSN==101 Step 6"Send Last" Request Packet Sent
PMTU 2KB CQ CQ WRE (Send) <- SQ RQ -> WRE (Receive) PSN 103 ePSN 102 QP QP CA X ---> CA Y Send Last 1KB PSN==102 All packets of the message send operation have now been received and written to CA Y's local memory. The RQ update the ePSN and wait for the arrival of first request packet of next message transfer operation. The top WQE of the RQ is retired and a CQE is posted on the CQ assocaited with the RQ (an intterrupt could be triggered at the moment) Send a positive ACK PMTU 2KB CQ CQ -> CQE WRE (Send) <- SQ RQ PSN 103 ePSN 103 QP QP CA X <--- CA Y ACK PSN==102 Step 7Final ACK returned
PMTU 2KB CQE <- CQ CQ -> CQE SQ RQ PSN 103 ePSN 103 QP QP CA X --- CA Y

Software ROCE

component of rxe

rxe_task

rxe_task is an asynchronous executing machines.
The asynchronous context is based on tasklet which is rarely used now.

Why does use tasklet instead of workqueue ?

The rxe_task is initialized through rxe_init_task.
rxe_do_task will ensure the task func to be non-reentrant.

(between sync and async context)

This is achieved through the task state which is synchronized through state_lock.

There are 3 task state


When rxe_do_task finds the task state is BUSY, it will set it to ARMED, then return.
When BUSY instance of rxe_do_task return from the task func and finds the task
state is ARMED, it will execute the task func again.

rxe udp tunnel

https://lwn.net/Articles/614348/
Why UDP? Just about any network interface out there has hardware support for UDP at this
point, handling details like checksumming. UDP adds just enough information (port numbers,
in particular) to make the routing of encapsulated packets easy. UDP can also be made to
work with protocols like Receive Side Scaling (RSS) and the Equal-cost multipath routing
protocol (ECMP) to improve performance in highly connected settings. The advantages of UDP
tunneling are enough that some developers think it's going to become nearly ubiquitous in
the coming years.

rxe udp tunnel is setup:
rxe_module_init
  -> rxe_net_init
    -> rxe_net_ipv4_init
      -> rxe_setup_udp_tunnel // port ROCE_V2_UDP_DPORT
        -> udp_sock_create
        -> setup_udp_tunnel_sock //encap_rcv rxe_udp_encap_recv


all the packets sent to ROCE_V2_UDP_DPORT port will be handled by this socket.


The encap_rcv callback is invoked:

udp_queue_rcv_skb
---
    if (static_key_false(&udp_encap_needed) && up->encap_type) {
        int (*encap_rcv)(struct sock *sk, struct sk_buff *skb);

        encap_rcv = READ_ONCE(up->encap_rcv);
        if (encap_rcv) {
            int ret;

            /* Verify checksum before giving to encap */
            if (udp_lib_checksum_complete(skb))
                goto csum_error;

            ret = encap_rcv(sk, skb);
            if (ret <= 0) {
                __UDP_INC_STATS(sock_net(sk),
                        UDP_MIB_INDATAGRAMS,
                        is_udplite);
                return -ret;
            }
        }

        /* FALLTHROUGH -- it's a UDP Packet */
    }
---

rxe_udp_encap_recv will deliver the skb to rxe core through rxe_rcv

rxe framework


Verbs 
               post send                      post recevie
-------------------|------------------------------|-------
                   v [1]                          v [1]
                rxe_qp.sq                     rxe_qp.rq
                   | [2]                          | [2]
                   v                              v
rxe_task        rxe_requester  rxe_completer   rxe_responder
                (SQ Logic)        ^             (RQ Logic)
                  [3]             |                ^
                   | loopback ?   | [6]            |
                   +---------> rxe_rcv ------------'
                   | [4]          ^
                   v              | [5]
-----------------------------------------------------------
                         ip stack


[1]: producer, producer_addr/advance_producer under sq/rq.sq/rq_lock
[2]: consumer, req_next_wqe
[3]: construct UDP packets through init_req_packet and fill_packet
[4]: rxe_send->ip_local_out
[5]: rxe udp tunnel
[6]: to completer or responder, is based on whether the package is a request