Redundancy and Raft

This document describes the Raft-based redundancy, replication, and failover system implemented in the Garak db service. It covers architecture, data flow, configuration, HTTP endpoints, how to reproduce and validate failover, troubleshooting, and next steps.

Scope: Applies to the code in db/ (notably db/src/raft/*, db/src/oplog.rs, db/src/node.rs, db/src/handlers/*).

Quick links:

Core implementation: db/src/raft
Oplog: db/src/oplog.rs
Node disk writes: db/src/node.rs
Main server: db/src/main.rs
Failover test script: db/scripts/test-failover.sh
Simple cluster test: db/scripts/test-raft-cluster.sh

High-level summary

Garak uses an internal Raft implementation to provide leader-based replication for writes.
All client write handlers require the node to be the leader (or the cluster to be in a degraded mode for very small clusters) to accept writes.
Writes follow: local disk write → local oplog → Raft proposal → replication to followers → commit → followers apply to disk using internal (non-proposing) write functions.
The system supports 1-2 node clusters with a degraded mode (allows writes when quorum cannot be reached), and automatic leader election and leader transfer support.

Design details & data flow

1) Client request path (write)

Client sends write (e.g., /insert, /set, /delete) to any node.
The HTTP handlers check authorization and then call into database / node APIs.
node::write_node() (or node::delete_node()) does:
- Check raft::can_accept_writes() (returns true only for leader or degraded mode)
- Call write_node_internal() which writes JSON bytes into .dat files and updates index .i files (file-based storage). This is the physical disk persist.
- Write an entry into the oplog file (human/diagnostic persistent log) using oplog::write_oplog().
- oplog::write_oplog() appends a time-prefixed JSON line to the local oplog file and then calls raft::propose_oplog().
- raft::propose_oplog() wraps the oplog entry into a RaftEntryData::Oplog and calls RaftServer::propose().

2) Raft proposal & replication

RaftServer::propose() constructs a RaftLogEntry with the current term and next index and appends it to the leader's Raft log.
If in normal mode (not degraded), it calls try_replicate_and_commit() which sends AppendEntries RPCs to other nodes and waits (synchronously in this implementation) for quorum.
send_append_entries() serializes AppendEntriesRequest containing the entries vector (these entries include the oplog write payload) and POSTs to /raft/append_entries on followers.
The follower's /raft/append_entries handler runs handle_append_entries() which:
- Validates term, updates leader/term if necessary, resets election timeout
- Ensures log consistency (prev_log_index, prev_log_term) and deletes conflicting entries
- Appends the new entries to its Raft log
- Advances commit_index based on leader_commit and calls apply_committed_entries()

3) Applying committed entries

apply_committed_entries() increments last_applied up to commit_index and for each entry calls apply_entry().
apply_entry() inspects RaftEntryData:
- For Oplog(...)::Write it calls node::write_node_internal() on the follower (not the public write_node), so followers write to disk without re-proposing to Raft.
- For deletes, calls node::delete_node_internal().
- For CreateIndex entries, it performs indexing::registry::write_index().
This ensures the state machine is kept consistent across nodes in the order of Raft commits.

Degraded Mode

For small clusters (1-2 nodes) the implementation supports a degraded mode if quorum can't be reached.
Behavior: the leader will enter degraded mode (logs show Entering DEGRADED MODE) and will apply writes locally and increment a degraded_writes_count metric.
When network/connectivity restores and heartbeats show quorum, leader logic attempts to exit degraded mode and resume normal replication.
Note: while in degraded mode the node updates its commit state to allow followers to catch up later (implementation detail: commit index updated when applying entries in degraded mode).

Leader election & transfer

Nodes run periodic tick() (background thread) which checks for election timeouts.
When election starts, start_election() sends RequestVote RPCs; nodes grant votes if the candidate's log is at least as up-to-date as theirs.
When a node becomes leader, it sends initial heartbeats (AppendEntries with no-op entries).
There is a transfer_leadership RPC to request a leader to step down and transfer leadership to a caught-up node; this is used for primary bias behavior.

HTTP endpoints (Raft & admin)

Raft endpoints (protected by MASTER_TOKEN header in the existing server routing):
- POST /raft/append_entries → AppendEntriesRequest → AppendEntriesResponse
- POST /raft/request_vote → RequestVoteRequest → RequestVoteResponse
- POST /raft/transfer_leadership → TransferLeadershipRequest → TransferLeadershipResponse
- POST /raft/divergence_check → DivergenceCheckRequest → DivergenceCheckResponse
- GET /raft/status → JSON status (term, role, leader, commit index, degraded mode stats)
- GET /is_primary → JSON { raft_enabled, is_primary, node_id } (unauthenticated; lightweight primary check)
Example usage:
```
curl -s http://127.0.0.1:7001/is_primary | jq .
```
Client-visible data endpoints (authenticated via Bearer tokens or MASTER_TOKEN for admin):
- POST /insert - insert node
- POST /get - read node
- POST /set - set field
- POST /delete - delete node
- POST /inc - increment
- POST /add_to_set / pull - set operations
- POST /create_index and POST /reindex - indexing ops (index creation is proposed through Raft)

Configuration & env vars

The cluster uses environment variables to configure node identity and cluster membership. Key vars:
- GARAK_SERVER_ID – string ID for this node (e.g. node1)
- RAFT_CLUSTER_NODES – comma-separated list of id@url entries (e.g. node1@http://host1:7001,node2@http://host2:7002)
- MASTER_TOKEN – administrative token used for Raft and admin endpoints
- PORT – HTTP listen port
- DATABASE_LOCATION – filesystem root for node data (default ~/honeycomb/default)
- RAFT_PRIMARY_BIAS – optional bias flag to favor this node as primary (leadership transfer behavior)

How to run a local 2-node cluster (quick)

Build release binary in db/:

cd db
cargo build --release

Start node1:

DATABASE_LOCATION=/tmp/garak-node1 \
MASTER_TOKEN=test-token \
PORT=7001 \
GARAK_SERVER_ID=node1 \
RAFT_CLUSTER_NODES="node1@http://127.0.0.1:7001,node2@http://127.0.0.1:7002" \
RAFT_PRIMARY_BIAS=1 \
./target/release/garak

Start node2:

DATABASE_LOCATION=/tmp/garak-node2 \
MASTER_TOKEN=test-token \
PORT=7002 \
GARAK_SERVER_ID=node2 \
RAFT_CLUSTER_NODES="node1@http://127.0.0.1:7001,node2@http://127.0.0.1:7002" \
./target/release/garak

Example test scripts (already present in repo):
- db/scripts/test-raft-cluster.sh — basic cluster and replication verification
- db/scripts/test-failover.sh — full failover test (write, kill primary, confirm secondary elections and writes)

Reproducing the failover test manually

Start both nodes as above.
Send an insert to node1:

curl -s -X POST -H "Authorization: test-token" -H "Content-Type: application/json" \
  -d '{"silo":"test","shard":"s1","store":"items","data":{"name":"failover-test","value":12345}}' \
  http://127.0.0.1:7001/insert

Confirm replication by reading from node2:

curl -s -X POST -H "Authorization: test-token" -H "Content-Type: application/json" \
  -d '{"silo":"test","shard":"s1","store":"items","node_id":1}' \
  http://127.0.0.1:7002/get | jq .

Kill node1 (primary) and wait for node2 to become leader. Check:

kill -9 <node1-pid>
curl -H "Authorization: test-token" http://127.0.0.1:7002/raft/status | jq .

Read the previously written data from node2. Also attempt a write to node2 to verify availability.

Test scripts

db/scripts/test-raft-cluster.sh — quick cluster verification used during development.
db/scripts/test-failover.sh — reproduces the full failover sequence and asserts the expected results. Run it from db/.

Troubleshooting

If followers report AppendEntries errors like Connection refused, verify the follower's MASTER_TOKEN and RAFT_CLUSTER_NODES configuration.
If replication seems to apply locally on leader but not on followers:
- Confirm leader_commit is being updated; degraded-mode logic was updated to set commit_index when applying entries in degraded mode, so followers can catch up when connectivity returns.
If handlers return 500s or panic, check the handler changes – write handlers now return leader-redirect 503 when appropriate.

Limitations & known gaps

Snapshotting / InstallSnapshot: InstallSnapshot RPC is not wired into the cold-sync mechanism. The Raft log can grow without compaction.
Log compaction: No automatic snapshot and truncation logic yet. The Raft log is file-based and needs compaction after snapshots are introduced.
Divergence resolution: There is a resolve_divergence/divergence_check mechanism written, but more robust, automatic reconciliation may be needed for certain conflict scenarios.
Performance: This is a minimal in-process Raft implementation intended for correctness and simplicity; a production deployment may prefer a battle-tested Raft library.

Next steps / Recommendations

Wire InstallSnapshot RPC to the existing cold sync and snapshot mechanism.
Implement log compaction and snapshot generation after snapshot install path exists.
Harden error paths and add metrics around replication latency and success rates.
Add tests to CI to run the test-failover.sh script in a controlled environment (or port to a unit test harness that simulates network partitions).

If you'd like, I can:

Convert this to the repo's docs style and add cross-links to specific code lines.
Add example curl snippets for each endpoint into the doc.
Add a short troubleshooting playbook for common failures (token mismatch, unreachable nodes, file permissions).