Redundancy and Raft
This document describes the Raft-based redundancy, replication, and failover system implemented in the Garak db service. It covers architecture, data flow, configuration, HTTP endpoints, how to reproduce and validate failover, troubleshooting, and next steps.
Scope: Applies to the code in db/ (notably db/src/raft/*, db/src/oplog.rs, db/src/node.rs, db/src/handlers/*).
Quick links:
- Core implementation: db/src/raft
- Oplog: db/src/oplog.rs
- Node disk writes: db/src/node.rs
- Main server: db/src/main.rs
- Failover test script: db/scripts/test-failover.sh
- Simple cluster test: db/scripts/test-raft-cluster.sh
High-level summary
- Garak uses an internal Raft implementation to provide leader-based replication for writes.
- All client write handlers require the node to be the leader (or the cluster to be in a
degraded modefor very small clusters) to accept writes. - Writes follow: local disk write → local oplog → Raft proposal → replication to followers → commit → followers apply to disk using internal (non-proposing) write functions.
- The system supports 1-2 node clusters with a degraded mode (allows writes when quorum cannot be reached), and automatic leader election and leader transfer support.
Design details & data flow
1) Client request path (write)
- Client sends write (e.g.,
/insert,/set,/delete) to any node. - The HTTP handlers check authorization and then call into
database/nodeAPIs. node::write_node()(ornode::delete_node()) does:- Check
raft::can_accept_writes()(returns true only for leader or degraded mode) - Call
write_node_internal()which writes JSON bytes into.datfiles and updates index.ifiles (file-based storage). This is the physical disk persist. - Write an entry into the
oplogfile (human/diagnostic persistent log) usingoplog::write_oplog(). oplog::write_oplog()appends a time-prefixed JSON line to the localoplogfile and then callsraft::propose_oplog().raft::propose_oplog()wraps the oplog entry into aRaftEntryData::Oplogand callsRaftServer::propose().
- Check
2) Raft proposal & replication
RaftServer::propose()constructs aRaftLogEntrywith the current term and next index and appends it to the leader's Raft log.- If in normal mode (not degraded), it calls
try_replicate_and_commit()which sendsAppendEntriesRPCs to other nodes and waits (synchronously in this implementation) for quorum. send_append_entries()serializesAppendEntriesRequestcontaining theentriesvector (these entries include the oplog write payload) and POSTs to/raft/append_entrieson followers.- The follower's
/raft/append_entrieshandler runshandle_append_entries()which:- Validates term, updates leader/term if necessary, resets election timeout
- Ensures log consistency (
prev_log_index,prev_log_term) and deletes conflicting entries - Appends the new entries to its Raft log
- Advances
commit_indexbased onleader_commitand callsapply_committed_entries()
3) Applying committed entries
apply_committed_entries()incrementslast_appliedup tocommit_indexand for each entry callsapply_entry().apply_entry()inspectsRaftEntryData:- For
Oplog(...)::Writeit callsnode::write_node_internal()on the follower (not the publicwrite_node), so followers write to disk without re-proposing to Raft. - For deletes, calls
node::delete_node_internal(). - For
CreateIndexentries, it performsindexing::registry::write_index().
- For
- This ensures the state machine is kept consistent across nodes in the order of Raft commits.
Degraded Mode
- For small clusters (1-2 nodes) the implementation supports a
degraded modeif quorum can't be reached. - Behavior: the leader will enter degraded mode (logs show
Entering DEGRADED MODE) and will apply writes locally and increment adegraded_writes_countmetric. - When network/connectivity restores and heartbeats show quorum, leader logic attempts to exit degraded mode and resume normal replication.
- Note: while in degraded mode the node updates its commit state to allow followers to catch up later (implementation detail: commit index updated when applying entries in degraded mode).
Leader election & transfer
- Nodes run periodic
tick()(background thread) which checks for election timeouts. - When election starts,
start_election()sendsRequestVoteRPCs; nodes grant votes if the candidate's log is at least as up-to-date as theirs. - When a node becomes leader, it sends initial heartbeats (AppendEntries with no-op entries).
- There is a
transfer_leadershipRPC to request a leader to step down and transfer leadership to a caught-up node; this is used for primary bias behavior.
HTTP endpoints (Raft & admin)
Raft endpoints (protected by
MASTER_TOKENheader in the existing server routing):POST /raft/append_entries→AppendEntriesRequest→AppendEntriesResponsePOST /raft/request_vote→RequestVoteRequest→RequestVoteResponsePOST /raft/transfer_leadership→TransferLeadershipRequest→TransferLeadershipResponsePOST /raft/divergence_check→DivergenceCheckRequest→DivergenceCheckResponseGET /raft/status→ JSON status (term, role, leader, commit index, degraded mode stats)GET /is_primary→ JSON{ raft_enabled, is_primary, node_id }(unauthenticated; lightweight primary check)
Example usage:
curl -s http://127.0.0.1:7001/is_primary | jq .Client-visible data endpoints (authenticated via Bearer tokens or
MASTER_TOKENfor admin):POST /insert- insert nodePOST /get- read nodePOST /set- set fieldPOST /delete- delete nodePOST /inc- incrementPOST /add_to_set/pull- set operationsPOST /create_indexandPOST /reindex- indexing ops (index creation is proposed through Raft)
Configuration & env vars
- The cluster uses environment variables to configure node identity and cluster membership. Key vars:
GARAK_SERVER_ID– string ID for this node (e.g.node1)RAFT_CLUSTER_NODES– comma-separated list ofid@urlentries (e.g.node1@http://host1:7001,node2@http://host2:7002)MASTER_TOKEN– administrative token used for Raft and admin endpointsPORT– HTTP listen portDATABASE_LOCATION– filesystem root for node data (default~/honeycomb/default)RAFT_PRIMARY_BIAS– optional bias flag to favor this node as primary (leadership transfer behavior)
How to run a local 2-node cluster (quick)
- Build release binary in
db/:
cd db
cargo build --release
- Start node1:
DATABASE_LOCATION=/tmp/garak-node1 \
MASTER_TOKEN=test-token \
PORT=7001 \
GARAK_SERVER_ID=node1 \
RAFT_CLUSTER_NODES="node1@http://127.0.0.1:7001,node2@http://127.0.0.1:7002" \
RAFT_PRIMARY_BIAS=1 \
./target/release/garak
- Start node2:
DATABASE_LOCATION=/tmp/garak-node2 \
MASTER_TOKEN=test-token \
PORT=7002 \
GARAK_SERVER_ID=node2 \
RAFT_CLUSTER_NODES="node1@http://127.0.0.1:7001,node2@http://127.0.0.1:7002" \
./target/release/garak
- Example test scripts (already present in repo):
db/scripts/test-raft-cluster.sh— basic cluster and replication verificationdb/scripts/test-failover.sh— full failover test (write, kill primary, confirm secondary elections and writes)
Reproducing the failover test manually
- Start both nodes as above.
- Send an insert to node1:
curl -s -X POST -H "Authorization: test-token" -H "Content-Type: application/json" \
-d '{"silo":"test","shard":"s1","store":"items","data":{"name":"failover-test","value":12345}}' \
http://127.0.0.1:7001/insert
- Confirm replication by reading from node2:
curl -s -X POST -H "Authorization: test-token" -H "Content-Type: application/json" \
-d '{"silo":"test","shard":"s1","store":"items","node_id":1}' \
http://127.0.0.1:7002/get | jq .
- Kill node1 (primary) and wait for node2 to become leader. Check:
kill -9 <node1-pid>
curl -H "Authorization: test-token" http://127.0.0.1:7002/raft/status | jq .
- Read the previously written data from node2. Also attempt a write to node2 to verify availability.
Test scripts
db/scripts/test-raft-cluster.sh— quick cluster verification used during development.db/scripts/test-failover.sh— reproduces the full failover sequence and asserts the expected results. Run it fromdb/.
Troubleshooting
- If followers report
AppendEntrieserrors likeConnection refused, verify the follower'sMASTER_TOKENandRAFT_CLUSTER_NODESconfiguration. - If replication seems to apply locally on leader but not on followers:
- Confirm
leader_commitis being updated; degraded-mode logic was updated to setcommit_indexwhen applying entries in degraded mode, so followers can catch up when connectivity returns.
- Confirm
- If handlers return 500s or panic, check the handler changes – write handlers now return leader-redirect 503 when appropriate.
Limitations & known gaps
- Snapshotting / InstallSnapshot:
InstallSnapshotRPC is not wired into the cold-sync mechanism. The Raft log can grow without compaction. - Log compaction: No automatic snapshot and truncation logic yet. The Raft log is file-based and needs compaction after snapshots are introduced.
- Divergence resolution: There is a
resolve_divergence/divergence_checkmechanism written, but more robust, automatic reconciliation may be needed for certain conflict scenarios. - Performance: This is a minimal in-process Raft implementation intended for correctness and simplicity; a production deployment may prefer a battle-tested Raft library.
Next steps / Recommendations
- Wire
InstallSnapshotRPC to the existing cold sync and snapshot mechanism. - Implement log compaction and snapshot generation after snapshot install path exists.
- Harden error paths and add metrics around replication latency and success rates.
- Add tests to CI to run the
test-failover.shscript in a controlled environment (or port to a unit test harness that simulates network partitions).
If you'd like, I can:
- Convert this to the repo's docs style and add cross-links to specific code lines.
- Add example curl snippets for each endpoint into the doc.
- Add a short troubleshooting playbook for common failures (token mismatch, unreachable nodes, file permissions).