Database Sharding Summary
Sharding is a database architecture strategy that horizontally partitions data across multiple servers or “shards.” Each shard contains a subset of the total data and operates as an independent database, allowing the system to distribute load and scale horizontally.
Key Concepts
- Horizontal Partitioning: Data is split based on specific values within a column (shard key)
- Shard Key Selection: Critical for even data distribution and query efficiency
- Distributed Queries: Queries may need to access multiple shards for complete results
- Data Rebalancing: Process of redistributing data when adding or removing shards
Benefits
- Improves scalability by distributing load across multiple machines
- Enhances performance by reducing index size and contention
- Increases fault tolerance when implemented with proper redundancy
- Allows for geographic distribution of data
Real-Life Examples
-
Instagram: Uses sharding based on user IDs to manage billions of photos and videos across thousands of servers.
-
MongoDB: Implements auto-sharding capabilities to distribute data across multiple machines, with automatic load balancing.
-
Google Bigtable: Shards data by row keys, enabling Google to handle petabytes of data across thousands of commodity servers.
-
Uber: Shards trip data geographically to optimize for local queries and manage their enormous real-time data processing needs.
-
Pinterest: Utilizes sharding with MySQL to handle over 100 million active users and billions of pins.
-
Shopify: Implements a multi-tenant architecture with sharded databases to support millions of online stores.
-
GitHub: Uses multiple MySQL shards to distribute repository data and handle high-volume developer activity.
Each of these implementations tailors sharding strategies to their specific workload patterns, query requirements, and scaling needs.
How to Shard (split) your data?
- What to shard by?
- Choosing a shard key
- Good
- High Cardinality
- Even distribution
- aligns with queries
- ex. userId, orderId
- Bad
- Low cardinality
- unevenly distributed
- queries require scatter-gather
- ex. bool flag like isPremium, createdAt eg. social media posts
- Good
- Example
- Choosing a shard key
- How to distribute data
- Range based -> good if data grows steadily
- Hash based -> hash(key) % numShards
- Issues with rebalancing
- Consistent hashing
- Places key and shards on virtual ring which effectively eliminates reshuffling
- Consistent hashing
- Issues with rebalancing
- Directory Based Sharding
- for each user, find the shard they belong to
- lookup table
- downside is that every request has extra latency
- creates single point of failure
- Almost never right answer in sys design
- Challenges
- Hot spots
- Celebrity problem
- Compound shard key ex. hash(userId) -> hash(userId + createdAt)
- Dedicated celebrity shard
- Celebrity problem
- Cross-shard operations
- ex. get popular posts aggreate all posts acorss shards
- cache result of expensive queries
- Denormalize data such that queries are quick
- copy data across shards so u only need to hit one db but then on writes u need to write to all the shards
- Maintaining consistency
- Ex. Alice and boba pay each other but live on diff shards
- 2 Phase commit
- In practice very slow/fragile
- Try not to do cross shard transactions
- Saga pattern
- Hot spots
- Wrap Up
- Storage
- Write throughput
- Read throughput
- Propose a shard key based on your access patterns
- Choose your distribution strategy
- Call out the trade-offs
- Address how you’ll handle growth