Architecture

Garak is a purpose-built document database designed for high-performance, index-driven data access patterns.

Core Concepts

Nodes

The fundamental unit of storage in Garak is a node. Each node:

Represents a single document
Has a unique sequential node_id (u128¹)
Is stored in blocks on disk
Can only be mutated via its node_id

¹ Given the largest possible u64 is 18,446,744,073,709,551,615, while node IDs are stored as u128s internally, they are often coerced to u64 or similar values externally, as the change of a database actually reaching node IDs higher than the u64 max is unlikely. The properly represent u128 node IDs outside of Rust in languages like JS we'd need to use things like BigInts.

Stores

A store represents a collection of nodes, analogous to a collection or table in other databases. Stores contain:

Node data organized in blocks
Metadata about the store (item count, lock status)
Associated indexes for querying

const usersStore = new Garak.Store(db, "users");

Sharding

Garak implements physical data sharding, allowing complete data isolation at the filesystem level:

const tx = db.tx("tenant-id"); // Each shard is stored separately on disk

Shard Naming Rules:

Cannot contain . (dots)
Cannot contain / (forward slashes)
Cannot contain :: (double colons)
Cannot start with _ (underscore)

Shard Prefixes:

For multi-tenant scenarios where one database serves multiple products:

const db = new Garak.DB(url, token, "product-a");
const productA_TenantX = db.tx("tenant-x"); // Stored as: product-a/tenant-x

Silos

Silos are the highest level of logical division. They are used to separate multiple tenant applications that share a Garak instance. They are equivelant to something like database name in MongoDB or Postgres. When authenticating, tokens can be scoped to a particular silo.

Storage Architecture

Block-Based Storage

Garak uses a block-based storage system where nodes are stored in fixed-size blocks:

Maximum block size: 256KB ^{1, 2}
Maximum node size: 25KB
When a block reaches capacity, a new block is automatically created
Each node can only exist in one block

¹ This is subject to change, and Garak is architected to allow this number to change on existing databases without impact.

² Blocks may exceed this limit as Garak prefers to over-size a block than to move a node to a new block in the event that an update increases the size of a node such that the total block size exceeds this maxium. Moving a node to a new block is not easy and not implemented. It is theoretically possible in extreme cases by creating a new block, then updating the `il` file to reference that block for one individual node.

File Structure

For a given store at path {silo}/{shard}/{store}/blocks/:

blocks/
├── il                    # Index Locator - maps node_id ranges to block numbers
├── 0.i                   # Index file for block 0
├── 00000000000000000000.dat  # Data file for block 0
├── 1.i                   # Index file for block 1
├── 00000000000000000001.dat  # Data file for block 1
└── ...

Index Locator (il) File

The il file enables fast lookups by mapping node_id ranges to block numbers:

Format: pairs of u128 values (32 bytes per entry)
[lowest_node_id (16 bytes), block_num (16 bytes)]

Example:
[0, 0]           # Block 0 contains nodes starting from 0
[1000, 1]        # Block 1 contains nodes starting from 1000
[2500, 2]        # Block 2 contains nodes starting from 2500

Index (.i) Files

Index files contain the location of each node within data files:

Format: 48 bytes per entry
- node_id: u128 (16 bytes)
- offset: u128 (16 bytes) - byte offset in .dat file
- length: u64 (8 bytes) - byte length of data
- block_num: u64 (8 bytes) - block number

Entries are sorted by node_id for binary search

Data (.dat) Files

Data files contain the raw JSON data for nodes, written sequentially:

Node data is stored as raw JSON bytes
No delimiters between nodes (positions tracked in .i file)
Maximum size: 256KB per file

Node Operations

Write Operation Flow

Determine Block: Check current block size
- If space available: write to current block
- If full: create new block and update il file
Write Data: Append JSON data to .dat file
Update Index: Write/update entry in .i file with location info
Write Oplog: Record operation for replication
Execute Side Effects: Update all relevant indexes

Read Operation Flow

Locate Block: Binary search il file to find block number
Find Position: Look up node_id in .i file for block
Read Data: Seek to offset in .dat file and read length bytes
Deserialize: Parse JSON and return

Indexing System

Garak provides several types of indexes, all maintained automatically as side effects of mutations. See the Indexes folder for further details on individual indexes.

All indexes are built using the following primitive structures:

SMSlice File Format

Each entry: 24 bytes
- key: u64 (8 bytes, little-endian) - the indexed value
- value: u128 (16 bytes, little-endian) - the node_id

Entries are sorted by key for efficient range queries
Binary search is used for lookups

Index Registry

All indexes are tracked in a centralized registry:

Location: {silo}/_/_index_registry.meta
Purpose: Maps field names to index definitions
Usage: Determines which indexes to update on field changes

Operation Log (Oplog)

The oplog records all write operations for replication:

Format

Each line: {timestamp};{json_operation}

Operation types:
- Write: { "Write": { silo, shard, store, node_id, data, is_new_node } }
- Delete: { "Delete": { silo, shard, store, node_id } }

Location

{DATABASE_LOCATION}/oplog

Usage

Replication: Secondary nodes tail the oplog to stay in sync
Replay: Can rebuild database state from oplog
Auditing: Complete history of all mutations

Data Flow

Insert Operation

Generate sequential node_id
Validate data size (< 25KB)
Write to appropriate block
Update index files
Increment store item count
Write to oplog
Update all relevant indexes
Rollback on index failure

Update Operation

Read existing node
Apply field mutation
Determine changed fields (including nested via dot notation)
Write updated data
Write to oplog
Update indexes for changed fields
Rollback on index failure

Delete Operation

Read existing node (for deindexing)
Write zero-length entry to index
Write to oplog
Remove from all indexes (failures ignored)

Concurrency Control

Garak uses file-level locking:

Shared locks: Multiple concurrent reads
Exclusive locks: Single writer, no readers
Lock granularity: Per file (index, data, metadata)

Master Lock

A master lock prevents writes during snapshots:

Location: {DATABASE_LOCATION}/.master_lock
Purpose: Coordinate snapshot creation
Check: Before every write operation

Performance Characteristics

Strengths

O(1) lookups by node_id: Direct index file lookup
O(log n) range queries: Binary search in sorted slices
Write amplification bounded: One write to data, one to index, one to oplog
No full collection scans: Impossible by design
Efficient caching: Node data and indexes separately cacheable

Trade-offs

No ad-hoc queries: Must define indexes upfront
Index maintenance overhead: Every write updates relevant indexes
Storage overhead: Indexes stored separately on disk
25KB node limit: Forces normalized data design
Block fragmentation: Updates don't reclaim space in blocks ¹

¹ There are plans to implement this in the future

Multi-Tenancy

Garak provides strong isolation guarantees:

// Physical separation via shards
const tenant1Tx = db.tx("tenant-1");
const tenant2Tx = db.tx("tenant-2");

// Different filesystem locations:
// .../tenant-1/...
// .../tenant-2/...

Architecture

Core Concepts​

Nodes​

Stores​

Sharding​

Silos​

Storage Architecture​

Block-Based Storage​

File Structure​

Index Locator (il) File​

Index (.i) Files​

Data (.dat) Files​

Node Operations​

Write Operation Flow​

Read Operation Flow​

Indexing System​

SMSlice File Format​

Index Registry​

Operation Log (Oplog)​

Format​

Location​

Usage​

Data Flow​

Insert Operation​

Update Operation​

Delete Operation​

Concurrency Control​

Master Lock​

Performance Characteristics​

Strengths​

Trade-offs​

Multi-Tenancy​