While the replication solutions presented can address horizontal scaling for high volumes of traffic, they do not address the needs of applications that store very large amounts of data This is because the database replicas each maintain the full set of data for the application, and there are limits to the disk space that can be provisioned to a database server.
To address this issue, an architecture that implements database sharding can be used With sharding, a single, large table is split into multiple smaller tables The process of splitting a table is referred to as partitioning. When stored across multiple servers, these smaller tables are referred to as shards The databases in your cluster each store one of the shards Together, the databases in the cluster constitute your full data set For example, if a human resources application stored employee records for 1,000 companies, but found that the dataset size was too big for a single table, then the records could be split into two shards representing 500 companies each
MySQL NDB Cluster is a solution that provides automatic sharding for MySQL The diagram below shows
an example of how a MySQL NDB Cluster would fit into the example web application. This diagram omits the load balancer, cloud firewall, and web server cluster as those remain the same:
Database Sharding for Large Data Sets
The MySQL NDB Cluster in this diagram provides three MySQL servers that accept read and write SQL commands from the web servers The number of these servers can be horizontally scaled to meet demand and provide failover These servers do not store data Instead, they update and retrieve records from a separate set of data nodes.
The data nodes store shards of the dataset and run the ndbd data node daemon process Each shard can have multiple replica nodes, and you can configure how many replicas per shard there should be. The total number of data nodes is equal to the number of shards multiplied by the number of replicas per shard In the diagram above, there are two shards, and two replicas per shard (a primary replica and a secondary replica), for a total of four data nodes Having multiple replicas per shard allows for recovery for failing nodes, and this recovery process is automatic (similar to MySQL Group Replication).
Using sharding can offer very high scaling of dataset size, but it can also make your application logic more complex. In particular, you need to carefully configure how your data is partitioned into multiple shards, because this decision impacts overall database performance Sharding requires a more in-depth understanding of your database and the underlying infrastructure.