OCFS2

Disk Format

O2CB Journal Replay
Cache Consistency

Disk Format

Basis


System Files


Ocfs2 Cluster Stack


OCFS2 is a shard-disk cluster filesystem.
It need to feel the state of the cluster and serialize the concurrent accessing to shared disk.
o2cb is such a component which is built-in of ocfs2 kernel module.

     +-----------------------------------+       +-----------------------------------+
     |              ocfs2                |       |              ocfs2                |
     +-------------------------+---------+       +---------+-------------------------+
     |       dlm glue          |         |       |         |        dlm glue         |
     +------------------+------+         |  OR   |         +-------------------------+
     |       o2dlm      | o2hb |         |       |   bdev  |                          
     +------+-----------+------+         |       |         |     dlm | /dev/ocfs2_control                     
     | o2nm |   o2net   |      bdev      |       |         |                          
     +------+-----------+----------------+       +---------+                          

Nodes


When setup a ocfs2 cluster, we need the ocfs2 to know the information of cluster

/etc/ocfs2/cluster.conf

node:
        name = node0
        cluster = mycluster
        number = 0
        ip_address = 10.1.0.100
        ip_port = 7777

node:
        name = node1
        cluster = mycluster
        number = 1
        ip_address = 10.1.0.101
        ip_port = 7777

node:
        name = node2
        cluster = mycluster
        number = 2
        ip_address = 10.1.0.102
        ip_port = 7777

cluster:
        name = mycluster
        heartbeat_mode = local
        node_count = 4
We install these informations to kernel through configfs
+---- cluster
      +---- test
            +---- fence_method
            +---- heartbeat
            +---- idle_timeout_ms
            +---- keepalive_delay_ms
            +---- node
            +     +---- node0
            +     +     +---- ipv4_address
            +     +     +---- ipv4_port
            +     +     +---- local
            +     +     +---- num
            +     +---- node1
            +           +---- ipv4_address
            +           +---- ipv4_port
            +           +---- local
            +           +---- num
            +---- reconnect_delay_ms

The information is maintained by o2nm_node.
Regarding to the progress of setup, please refer to o2nm_node_ipv4_address/ipv4_port/num/local_store. Noted ! There can be only one cluster per node
o2nm_cluster_group_make_group()
---
    if (o2nm_single_cluster)
        return ERR_PTR(-ENOSPC);
---

Heartbeat


ocfs2 heartbeat is a way to let the nodes in a cluster to know each other is alive.
It works through read and write a shared disk.

Net

Setup


To setup the connection between nodes in the cluster, we need to configure node ipv4 address and port firstly.
This is done through configfs when setup the cluster. Refer to Nodes

The connection is built up when heartbeat starts and the node knows its siblings alive.

Purpose


To know the purpose of o2net, we need to get what's messages does it handle,

We can track the message handler register to know this, namely, caller of o2net_register_handler,
Heartbeat
 - O2HB_NEGO_TIMEOUT_MSG
 - O2HB_NEGO_APPROVE_MSG

O2DLM
 - DLM_MASTER_REQUEST_MSG
 - DLM_UNUSED_MSG1          
 - DLM_ASSERT_MASTER_MSG 
 - DLM_CREATE_LOCK_MSG      
 - DLM_CONVERT_LOCK_MSG  
 - DLM_PROXY_AST_MSG      
 - DLM_UNLOCK_LOCK_MSG      
 - DLM_DEREF_LOCKRES_MSG 
 - DLM_MIGRATE_REQUEST_MS
 - DLM_MIG_LOCKRES_MSG      
 - DLM_QUERY_JOIN_MSG      
 - DLM_ASSERT_JOINED_MSG 
 - DLM_CANCEL_JOIN_MSG      
 - DLM_EXIT_DOMAIN_MSG      
 - DLM_MASTER_REQUERY_MSG
 - DLM_LOCK_REQUEST_MSG  
 - DLM_RECO_DATA_DONE_MSG
 - DLM_BEGIN_RECO_MSG      
 - DLM_FINALIZE_RECO_MSG 
 - DLM_QUERY_REGION      
 - DLM_QUERY_NODEINFO      
 - DLM_BEGIN_EXIT_DOMAIN_
 - DLM_DEREF_LOCKRES_DONE

Keepalive


A node which has given up on connecting to a majority
of nodes who are still heartbeating will fence itself.

o2net_process_message()
  -> o2net_sc_postpone_idle()
o2net_check_handshake()
  -> o2net_sc_reset_idle_timer()
  ---
    o2net_sc_cancel_delayed_work(sc, &sc->sc_keepalive_work);
    o2net_sc_queue_delayed_work(sc, &sc->sc_keepalive_work,
              msecs_to_jiffies(o2net_keepalive_delay()));
    o2net_set_sock_timer(sc);
    mod_timer(&sc->sc_idle_timeout,
           jiffies + msecs_to_jiffies(o2net_idle_timeout()));
  ---

When delayed kwork sc_keepalive_work is fired, it will send out a keepalive message,
when the sc_idle_timeout is expired, it says a nodes has lost connection with us.

Init when boot


The system service o2cb would setup the ocfs2 configrations when system boot.

o2cb.init.sh

online()
  -> online_o2cb()
    -> register_cluster_o2cb()
      -> o2cb register-cluster cluster
    -> start_global_heartbeat_o2cb()
      -> o2cb start-heartbeat cluster
When the /etc/ocfs2/cluster.conf is setup properly, o2cb would setup the configrations based on it.

Node Down


Except for provding communication capability (o2net), the main target of o2cb is
to feel the state of nodes in cluster. This is provided by o2hb.
Let's look at the callback of O2HB_NODE_DOWN_CB

Journal Replay


When a node is down, there maybe some journal has not been checkpointed
to shared disk in its per-node log file.
So,

Cache Consistency


    +----+      +----+     +----+ 
    | N0 |      | N1 |     | N2 |
    +----+      +----+     +----+
       |          |          |
       |          |          |
       -----------+-----------
                +---+
                | D |
                +---+
There are two key points need to do to keep cache consistency