"Notes on the book 'Large-Scale Distributed Storage Systems: Principles Analysis and Architectural Practice'"

"Large-Scale Distributed Storage Systems: Principles and Architecture in Practice" is a professional book that provides an in-depth introduction to distributed storage technology, systematically explaining the core principles, key technologies, and practical experiences of distributed storage systems. The book covers core concepts such as the theoretical foundation of distributed storage, consistency models, data sharding, replica management, and fault handling, and analyzes the storage system architectures of companies like Google, Amazon, and Facebook through real industrial cases. This book not only helps to deeply understand the essence and design philosophy of distributed storage but also provides practical architectural guidance and best practices for building highly available and high-performance large-scale storage systems.

Part One: Basics

Chapter 2: Single Machine Storage Systems

2.1 Hardware Basics

NUMA Architecture Optimization

"NUMA nodes can directly and quickly access local memory, and can also access memory from other NUMA nodes through NUMA interconnect modules. Accessing local memory is significantly faster than accessing remote memory."

Specific Performance Parameters
Operation TypeLatency
Access L1 Cache0.5 ns
Access L2 Cache7 ns
Memory Access100 ns
Sending 1MB Data over Gigabit Network10 ms
SATA Disk Seek10 ms
SSD Access Latency0.1~0.2 ms
Storage Hierarchy Architecture
LevelDRAMDiskSSD
Single Machine Level24GB, 100ns, 20GB/s4TB, 10ms, 300MB/s1TB, 0.1ms, 300MB/s
Rack Level1TB, 300µs, 100MB/s160TB, 11ms, 100MB/s40TB, 2ms, 100MB/s
Cluster Level30TB, 500µs, 10MB/s4.8PB, 12ms, 10MB/s1.2PB, 3ms, 10MB/s

2.2 Single Machine Storage Engines

2.2.1 Hash Storage Engine (Bitcask)

Data Structure Design:

"Data in Bitcask data files is written as individual write operations, with each record containing the following data items: primary key (key), value content (value), key length (key_sz), value length (value_sz), timestamp (timestamp), and crc checksum."

Memory Index Structure:

  • Hash table storage: file number (file_id), value position (value_pos), value length (value_sz)
  • Disk memory ratio: 32:1 (assuming average value size is 1KB, index size is 32 bytes)

Periodic Merging Mechanism:

"The merge operation scans all data in old data files and generates new data files. This merging essentially deletes multiple operations on the same key, retaining only the latest one."

Fast Recovery Technology:

"The index file is the result of dumping the in-memory hash index table to disk. The index file does not store specific value values, only the positions of the values."

2.2.2 B-Tree Storage Engine (MySQL InnoDB)

Page Organization Structure:

"MySQL InnoDB organizes data by pages, with each page corresponding to a node in the B+ tree. Leaf nodes store the complete data for each row, while non-leaf nodes store index information."

Buffer Management Algorithms:

  • LRU Algorithm: A linked list is formed based on the last access time of pages, evicting pages from the tail of the list.
  • LIRS Algorithm: The buffer pool is divided into two levels; data first enters the first level, and data accessed more than twice becomes hot data and enters the second level.

"The internal LRU linked list of InnoDB is divided into two parts: new sublist and old sublist, with the former occupying 5/8 and the latter occupying 3/8 by default."

2.2.3 LSM Tree Storage Engine (LevelDB)

Storage Structure Hierarchy:

  • Memory Layer: MemTable and immutable MemTable
  • Disk Layer: Current files, manifest files, operation log files, SSTable files

Write Process:

  1. Write to the operation log file
  2. Apply to MemTable
  3. When MemTable reaches its limit, freeze it as an immutable MemTable
  4. A background thread dumps the immutable MemTable to SSTable

Read Optimization:

"LevelDB needs to read each level's SSTable files and the MemTable in memory from old to new for each query. LevelDB optimizes this by first checking the MemTable in memory since it only supports random reads of single records."

Merge Strategy:

  • Minor Compaction: Dumping memory data to SSTable
  • Major Compaction: Multi-way merge, deleting invalid records

2.3 Data Models

2.3.1 File Model

POSIX Standard Interface:

Open/close: Open/close files, obtain file descriptors
Read/write: Read/write file data
Opendir/closedir: Open/close directories
Readdir: Traverse directories

NFS Concurrency Issues:

"If NFS clients A and B need to modify a certain file on the NFS server simultaneously, each client has a local cached copy of the file. If A modifies and submits first, followed by B, even if A and B modify different parts of the file, B's modification may overwrite A's."

2.3.2 Relational Model

SQL Query Execution Process:

  1. Retrieve all possible combinations of tuples from the relations in the FROM clause
  2. Remove tuples that do not meet the WHERE conditions
  3. Group by attributes according to GROUP BY
  4. Check each group against HAVING conditions
  5. Calculate result tuples according to SELECT specifications
  6. Sort according to ORDER BY attributes
2.3.3 Key-Value Model

Basic Operations:

  • Put: Save Key-Value pairs
  • Get: Read Key-Value pairs
  • Delete: Delete Key-Value pairs

Table Model Extended Operations:

  • Insert: Insert a row of data
  • Delete: Delete a row of data
  • Update: Update the entire row or specific columns
  • Get: Read the entire row or specific columns
  • Scan: Scan a range of data

2.4 Transactions and Concurrency Control

2.4.1 Transaction ACID Implementation

Atomicity Implementation:

"The atomicity of a transaction is primarily reflected in the modification of data; that is, either all operations are executed, or none are executed."

Isolation Level Implementation:

Isolation LevelDescription
Read Uncommitted (RU)Read uncommitted data
Read Committed (RC)Read committed data; subsequent reads in the same transaction may differ
Repeatable Read (RR)Repeatable reads; results of reads in the same transaction remain the same
Serializable (S)Serializable; the highest isolation level

Exception Type Analysis:

Exception TypeDescription
Lost Update (LU)First type of lost update
Dirty Reads (DR)Reading uncommitted data
Non-Repeatable Reads (NRR)Non-repeatable reads
Second Lost Updates (SLU)Second type of lost update
Phantom Reads (PR)Phantom reads
2.4.2 Concurrency Control Implementation

Database Lock Mechanism:

"Locks are divided into two types: read locks and write locks. Multiple read locks can be applied to the same element, but only one write lock is allowed, and write transactions will block read transactions."

Deadlock Handling Strategies:

  1. Timeout Rollback: Set a timeout for each transaction, automatically rolling back after timeout.
  2. Deadlock Detection: Detect circular dependencies and roll back certain transactions to eliminate the cycle.

Copy-On-Write (COW) Implementation:

  1. Copy: Copy all nodes along the path from leaf to root.
  2. Modify: Perform modifications on the copied nodes.
  3. Commit: Atomically switch the root node pointer.

Multi-Version Concurrency Control (MVCC) Implementation:

"InnoDB maintains two implicit columns for each row, one storing the 'time' the row was modified and the other storing the 'time' the row was deleted. Note that InnoDB does not store absolute time but the version number of the database system corresponding to that time."

Version Check Rules:

Operation TypeCheck Rule
SELECTRow's modification version number ≤ transaction number AND (deletion version number undefined OR deletion version number > transaction number)
INSERTRow's modification version number = transaction number
DELETERow's deletion version number = transaction number
UPDATECopy the original row, new row modification version number = transaction number

2.5 Fault Recovery

2.5.1 Operation Logs

Log Type Comparison:

Log TypeExampleDescription
UNDO Log<T, X, 5>Records the state before a transaction modification
REDO Log<T, X, 15>Records the state after a transaction modification
UNDO/REDO Log<T, X, 5, 15>Records both the state before and after modification
2.5.2 Redo Logs

Write Operation Process:

  1. Write REDO logs to disk log files in append mode.
  2. Apply REDO log modification operations to memory.
  3. Return success or failure of the operation.

Constraint Rules:

"Before modifying element X in memory, ensure that the operation logs related to this modification must be flushed to disk first."

2.5.3 Optimization Measures

Batch Commit Technology:

"REDO logs are first written to the storage system's log buffer. When one of the following conditions is met, multiple transaction operations in the log buffer are flushed to disk at once:

  • The amount of data in the log buffer exceeds a certain size, such as 512KB.
  • The time since the last flush to disk exceeds a certain duration, such as 10ms."

Checkpoint Implementation:

  1. Record "START CKPT" in the log file.
  2. Dump memory data to disk to form a checkpoint file.
  3. Record "END CKPT" in the log file.

Fault Recovery Process:

  1. Load the checkpoint file into memory.
  2. Read the "START CKPT" log replay point recorded in the checkpoint file.
  3. Replay the subsequent REDO logs.

2.6 Data Compression

2.6.1 Compression Algorithms

Huffman Coding Implementation:

"Prefix coding requires that the encoding of one character cannot be a prefix of another character. For example, if A, B, and C are encoded as: A: 0 B: 10 C: 110, then 01010 can only be translated as ABB."

LZ Series Algorithm Implementation:

"LZ77 is the first algorithm in the LZ series. For example, in the string ABCABCDABC, ABC appears three times. In the compressed information, only the first ABC needs to be saved, while the later two ABCs only need to store the position and length of the first occurrence of ABC."

Google Zippy Algorithm Optimization:

  1. Dictionary Optimization: Only save substrings of length 4; output matching information only if the matching length is ≥ 4.
  2. Data Block Division: Internally divide data into 32KB data blocks, compressing each block separately.
  3. Variable-Length Encoding: Use 2 bytes for matching lengths < 12 bytes and positions < 2048; otherwise, use more bytes.

BMDiff Algorithm Implementation:

"The BMDiff algorithm splits the data to be compressed into small segments of length b (default b=32). The dictionary of BMDiff saves the hash values of each small segment, so the hash table size required for a string of length N is N/b."

2.6.2 Columnar Storage

Storage Structure Comparison:

  • Row Storage: Complete data rows are stored in data pages, suitable for OLTP.
  • Column Storage: Values of the same column are stored together, suitable for OLAP.

Bitmap Index Implementation:

"The gender column has only two values, 'male' and 'female'. A bitmap index can be established for this column: the bitmap for 'male' is 100101, indicating that rows 1, 4, and 6 have the value 'male'; the bitmap for 'female' is 011010, indicating that rows 2, 3, and 5 have the value 'female'."


Chapter 3: Distributed Systems

3.1 Basic Concepts

3.1.1 Exception Handling

Detailed Analysis of Exception Types:

  1. Server Crash: Memory errors, power outages, etc., causing nodes to fail to operate normally.
  2. Network Exceptions: Message loss, out-of-order delivery, data errors, network partitions.
  3. Disk Failures: Disk damage, data errors.

Three-State Concept Implementation:

"In a distributed system, if a node initiates an RPC call to another node, the result of this RPC execution can have three states: 'success', 'failure', and 'timeout' (unknown state)."

3.1.2 Consistency

Strong Consistency Implementation:

"If A first writes a value to the storage system, the storage system guarantees that subsequent read operations by A, B, and C will return the latest value."

Eventual Consistency Implementation:

"If A first writes a value to the storage system, the storage system guarantees that if no subsequent write operations update the same value, the read operations by A, B, and C will 'eventually' read the latest value written by A."

Consistency Variants:

Consistency TypeDescription
Read-Write ConsistencyAfter client A writes, all subsequent operations by A can read the latest value.
Session ConsistencyGuarantees read-write consistency throughout the entire session.
Monotonic Read ConsistencyIf client A has read a certain value, it will not read an earlier value in subsequent reads.
Monotonic Write ConsistencyWrite operations by client A are completed in order.

3.3 Data Distribution

3.3.1 Hash Distribution

Traditional Hash Distribution Problems:

"When a server comes online or goes offline, the value of N changes, completely disrupting the data mapping, requiring almost all data to be redistributed."

Consistent Hashing Algorithm Implementation:

  1. Calculate the hash value for each server and map it to a circular range of 0 to 2^n.
  2. Calculate the hash value of the primary key of the object to be stored, also mapping it to the circle.
  3. Start from the data mapping position and search clockwise to find the first server node to distribute the data.

Location Information Maintenance Strategies:

Strategy TypeLocation InformationLookup Complexity
O(1) Location InformationRecord previous and next node positionsO(N)
O(logN) Location InformationMaintain a routing table of size nO(logN)
O(M) Location InformationMaintain the positions of all serversO(1)

Virtual Node Concept:

"Dynamo introduces the concept of virtual nodes to distribute the data that needs to be migrated across the entire cluster, so that each server only needs to migrate 1/N of the data volume."

3.3.2 Sequential Distribution

Subtable Partitioning Strategy:

"Bigtable splits a large table into ordered ranges based on the primary key, with each ordered range being a subtable."

Two-Level Index Structure:

  • Root Table: Maintains the location information of the Meta table.
  • Meta Table: Maintains the location information of the User table.
  • User Table: The actual data table.

Subtable Splitting and Merging:

"Subtable splitting occurs when a subtable becomes too large and exceeds a certain threshold, requiring it to be split into two subtables, thus allowing for load balancing across multiple storage nodes."

3.3.3 Load Balancing

Heartbeat Mechanism:

"Working nodes send heartbeat packets (periodically sent) to the master node with information related to node load, such as CPU, memory, disk, network resource usage, read/write counts, and read/write data volumes."

Migration Strategy:

  1. Switch data shard read/write services to other nodes.
  2. Select nodes to increase replicas, obtaining replica data from the original node and keeping it synchronized.
  3. Delete replicas on the original node.

3.4 Replication

3.4.1 Replication Protocol

Strong Synchronization Replication Process:

  1. The client sends a write request to the primary replica.
  2. The primary replica synchronizes the operation log to the backup replica.
  3. The backup replica replays the operation log and notifies the primary replica upon completion.
  4. The primary replica modifies its local state and notifies the client of a successful write after all operations are complete.

Asynchronous Replication Process:

  1. The primary replica does not need to wait for the backup replica's response.
  2. The client is informed of a successful write operation as soon as the local modification succeeds.
  3. The modification operation is pushed to other replicas through an asynchronous mechanism.

NWR Protocol Implementation:

"In the NWR protocol, multiple replicas are no longer distinguished as primary and backup. The client writes data to W replicas based on a certain strategy and reads from R replicas. As long as W + R > N, it can be guaranteed that at least one of the read replicas contains the latest update."

3.4.2 Consistency and Availability Trade-offs

Oracle DataGuard Three Modes:

ModeFeatures
Maximum Protection ModeStrong synchronization replication; write operations require the primary database to first synchronize operation logs to at least one backup database.
Maximum Performance ModeAsynchronous replication; write operations only need to succeed on the primary database.
Maximum Availability ModeUnder normal circumstances, it is equivalent to maximum protection mode; during network failures, it switches to maximum performance mode.

3.5 Fault Tolerance

3.5.1 Common Fault Statistics
Fault TypeFrequencyImpact Range
Power Distribution Device Failure0.5 times/yearMost machines lose power within 5 minutes.
Data Center Overheating

Comments

Pleaseto continueComments require admin approval before being visible

No comments yet. Be the first to comment!