Member-only story

Efficiently Splitting Secondary Index Across Shards

6 min readAug 30, 2019

In Big data systems, where data is distributed across many machines, uniform distribution and colocation (e.g. on same partition or host) of “similar” data is usually preferred. Though, it is generally not possible to predict the growth of data on each partition, we can do a better job while creating secondary and full-text indexes for existing data, by utilizing available statistics.

Problem

Uniform distribution of data by itself is not hard to achieve, one can distribute incoming documents in a round-robin fashion to all the machines. However uniform distribution is a challenge when also ensuring that related data is indexed together. A common example is a multi-tenant SaaS system, where data belonging to same customer may need to be stored together for various reasons.

In the absence of any data distribution guidance, many databases and search indexes will grow a partition of data (region in Hbase, shard in Solr, etc.) until it hits some specified limit and then split it into smaller partitions. Without any external intervention, this can result in varying sizes of partitions. And at read time, the uneven data distribution across the machines, will impact full table scans as the data on larger partitions will take longer to finish. Similarly, the point queries will cause hot spots on the larger partitions. Creating a secondary index without applying the data distribution knowledge, significantly degrades write performance during the initial creation of the index…

Efficiently Splitting Secondary Index Across Shards

Problem

Written by Nawab Iqbal

No responses yet