Skip to main content

Architecture

Garak is a purpose-built document database designed for high-performance, index-driven data access patterns.

Core Concepts

Nodes

The fundamental unit of storage in Garak is a node. Each node:

  • Represents a single document
  • Has a unique sequential node_id (u1281)
  • Is stored in blocks on disk
  • Can only be mutated via its node_id
1 Given the largest possible u64 is 18,446,744,073,709,551,615, while node IDs are stored as u128s internally, they are often coerced to u64 or similar values externally, as the change of a database actually reaching node IDs higher than the u64 max is unlikely. The properly represent u128 node IDs outside of Rust in languages like JS we'd need to use things like BigInts.

Stores

A store represents a collection of nodes, analogous to a collection or table in other databases. Stores contain:

  • Node data organized in blocks
  • Metadata about the store (item count, lock status)
  • Associated indexes for querying
const usersStore = new Garak.Store(db, "users");

Sharding

Garak implements physical data sharding, allowing complete data isolation at the filesystem level:

const tx = db.tx("tenant-id"); // Each shard is stored separately on disk

Shard Naming Rules:

  • Cannot contain . (dots)
  • Cannot contain / (forward slashes)
  • Cannot contain :: (double colons)
  • Cannot start with _ (underscore)

Shard Prefixes:

For multi-tenant scenarios where one database serves multiple products:

const db = new Garak.DB(url, token, "product-a");
const productA_TenantX = db.tx("tenant-x"); // Stored as: product-a/tenant-x

Silos

Silos are the highest level of logical division. They are used to separate multiple tenant applications that share a Garak instance. They are equivelant to something like database name in MongoDB or Postgres. When authenticating, tokens can be scoped to a particular silo.

Storage Architecture

Block-Based Storage

Garak uses a block-based storage system where nodes are stored in fixed-size blocks:

  • Maximum block size: 256KB 1, 2
  • Maximum node size: 25KB
  • When a block reaches capacity, a new block is automatically created
  • Each node can only exist in one block
1 This is subject to change, and Garak is architected to allow this number to change on existing databases without impact.
2 Blocks may exceed this limit as Garak prefers to over-size a block than to move a node to a new block in the event that an update increases the size of a node such that the total block size exceeds this maxium. Moving a node to a new block is not easy and not implemented. It is theoretically possible in extreme cases by creating a new block, then updating the `il` file to reference that block for one individual node.

File Structure

For a given store at path {silo}/{shard}/{store}/blocks/:

blocks/
├── il # Index Locator - maps node_id ranges to block numbers
├── 0.i # Index file for block 0
├── 00000000000000000000.dat # Data file for block 0
├── 1.i # Index file for block 1
├── 00000000000000000001.dat # Data file for block 1
└── ...

Index Locator (il) File

The il file enables fast lookups by mapping node_id ranges to block numbers:

Format: pairs of u128 values (32 bytes per entry)
[lowest_node_id (16 bytes), block_num (16 bytes)]

Example:
[0, 0] # Block 0 contains nodes starting from 0
[1000, 1] # Block 1 contains nodes starting from 1000
[2500, 2] # Block 2 contains nodes starting from 2500

Index (.i) Files

Index files contain the location of each node within data files:

Format: 48 bytes per entry
- node_id: u128 (16 bytes)
- offset: u128 (16 bytes) - byte offset in .dat file
- length: u64 (8 bytes) - byte length of data
- block_num: u64 (8 bytes) - block number

Entries are sorted by node_id for binary search

Data (.dat) Files

Data files contain the raw JSON data for nodes, written sequentially:

Node data is stored as raw JSON bytes
No delimiters between nodes (positions tracked in .i file)
Maximum size: 256KB per file

Node Operations

Write Operation Flow

  1. Determine Block: Check current block size
    • If space available: write to current block
    • If full: create new block and update il file
  2. Write Data: Append JSON data to .dat file
  3. Update Index: Write/update entry in .i file with location info
  4. Write Oplog: Record operation for replication
  5. Execute Side Effects: Update all relevant indexes

Read Operation Flow

  1. Locate Block: Binary search il file to find block number
  2. Find Position: Look up node_id in .i file for block
  3. Read Data: Seek to offset in .dat file and read length bytes
  4. Deserialize: Parse JSON and return

Indexing System

Garak provides several types of indexes, all maintained automatically as side effects of mutations. See the Indexes folder for further details on individual indexes.

All indexes are built using the following primitive structures:

SMSlice File Format

Each entry: 24 bytes
- key: u64 (8 bytes, little-endian) - the indexed value
- value: u128 (16 bytes, little-endian) - the node_id

Entries are sorted by key for efficient range queries
Binary search is used for lookups

Index Registry

All indexes are tracked in a centralized registry:

Location: {silo}/_/_index_registry.meta
Purpose: Maps field names to index definitions
Usage: Determines which indexes to update on field changes

Operation Log (Oplog)

The oplog records all write operations for replication:

Format

Each line: {timestamp};{json_operation}

Operation types:
- Write: { "Write": { silo, shard, store, node_id, data, is_new_node } }
- Delete: { "Delete": { silo, shard, store, node_id } }

Location

{DATABASE_LOCATION}/oplog

Usage

  • Replication: Secondary nodes tail the oplog to stay in sync
  • Replay: Can rebuild database state from oplog
  • Auditing: Complete history of all mutations

Data Flow

Insert Operation

1. Generate sequential node_id
2. Validate data size (< 25KB)
3. Write to appropriate block
4. Update index files
5. Increment store item count
6. Write to oplog
7. Update all relevant indexes
8. Rollback on index failure

Update Operation

1. Read existing node
2. Apply field mutation
3. Determine changed fields (including nested via dot notation)
4. Write updated data
5. Write to oplog
6. Update indexes for changed fields
7. Rollback on index failure

Delete Operation

1. Read existing node (for deindexing)
2. Write zero-length entry to index
3. Write to oplog
4. Remove from all indexes (failures ignored)

Concurrency Control

Garak uses file-level locking:

  • Shared locks: Multiple concurrent reads
  • Exclusive locks: Single writer, no readers
  • Lock granularity: Per file (index, data, metadata)

Master Lock

A master lock prevents writes during snapshots:

Location: {DATABASE_LOCATION}/.master_lock
Purpose: Coordinate snapshot creation
Check: Before every write operation

Performance Characteristics

Strengths

  • O(1) lookups by node_id: Direct index file lookup
  • O(log n) range queries: Binary search in sorted slices
  • Write amplification bounded: One write to data, one to index, one to oplog
  • No full collection scans: Impossible by design
  • Efficient caching: Node data and indexes separately cacheable

Trade-offs

  • No ad-hoc queries: Must define indexes upfront
  • Index maintenance overhead: Every write updates relevant indexes
  • Storage overhead: Indexes stored separately on disk
  • 25KB node limit: Forces normalized data design
  • Block fragmentation: Updates don't reclaim space in blocks 1
1 There are plans to implement this in the future

Multi-Tenancy

Garak provides strong isolation guarantees:

// Physical separation via shards
const tenant1Tx = db.tx("tenant-1");
const tenant2Tx = db.tx("tenant-2");

// Different filesystem locations:
// .../tenant-1/...
// .../tenant-2/...