Solr Upgrade and Our Indexing Performance

Nawab Iqbal
6 min readMay 10, 2018

--

I joined Box Search team almost an year ago. After some features, I was given the opportunity to work on Solr upgrade. Our search clusters have been running on Solr 4 since it was the latest release. Usually there is no immediate ROI on an upgrade and it is tempting to postpone it in favor of other features which have more visibility to the user. However, one big drawback of not upgrading is that new features, improvements and fixes are only applied to recent releases. Sooner or later, everyone will have to upgrade to avail new features or at the very least for bugfixes (particularly the security patches). And if a company continues to build their product based on deprecated versions, the cost of moving to the latest version will continue to increase as more components will need updates during the upgrade.

I give a lot of credit to the Solr and Lucene open source community which are extremely helpful and responsive on the mailing lists.

My initial plan was as follows:

1. Go through our custom commits which we had done to solr 4 and port them over to solr 6. (There weren’t many.) Add some tests where needed.

2. Deploy to a test server and verify Solr CRUD operations

3. If #2 works, deploy to some production like hosts. Verify basic operations and performance with load proportional to production.

4. Deploy to the whole cluster with all of customer’s data. Verify all functions and monitor indexing and querying performance and relevance.

5. Reindex all clusters with Solr 6.

There are some steps which have a verification before moving forward, and it was difficult to estimate the time required for mitigations if something failed. For example, a lot of time was required to investigate the relevance and performance differences in step 4.

Going through our list of custom commits was easy. For most of them, I was able to find a similar location in Solr 6 code and apply the change in new code. These changes only exhibited small differences in behavior in special cases.

Updating the config was monotonous, but the errors were very helpful and guided me to the correct substitutes in the recent version. And for the most part, the behavior hasn’t changed drastically for our customers to notice (I will explain more about this in a future post) So, within couple of days I was able to produce a solrconfig and schema for our indexes which was compatible with Solr 6.5.1 and it worked in my test environment. I couldn’t have been more optimistic!

We rely on puppet for maintaining our solr clusters. Our Solr4 has been running on Java 7 so that needed to be upgraded to Java 8. I will spare you the tedious small changes which were needed to make the solr scripts to work as the ecosystem around our solr hosts expected. This included small things like: log configuration, metrics collection scripts (SOLR-9812) to relatively complex steps like: jenkins and artifact management, install destination on solr hosts (binary, index and config) and service scripts for remotely managing critical solr operations.

Our Infrastructure

Each production Solr host is a 32 core machine with 256 GB RAM and a 3 TB FIO drive (for storing index), with non-flash disk space for all other purposes. We are growing massively year over year, and the number of hosts in each cluster has increased by 50% during my one year at Box, primarily due to index size growth. At the moment, we use 41 machines like the above to hold our index so our total indexing capacity 123TB, though we leave some headroom for segments to grow and merge and usually expand our clusters when we have consumed more than 80 percent of this space. Each host runs 3 solr processes with one SolrCore on each process.

After verifying the Solr 6 on my test host, next step was verification on a production like system. I did the necessary puppet changes to upgrade a system with jdk8 and Solr 6. It is recommended to reindex all data when upgrading Solr, even though Solr supports deprecated index changes for one version. Luckily, our team has a mostly automated workflow for batch indexing all our documents and prepares the cluster from initialization to ready-to-serve query traffic in 4 days. On a new host, my target was to put as much data on this host to fill it to 75–80% disk use and then use it for verifying query performance. But something was wrong. The indexing was progressing at a creeping rate instead of what one would expect from Lucene. To be fair, our system is very different from what the above benchmark uses and reaching 35–45 GB per hour for one solr host has been our upper limit with current workflow. However, I was not getting even 10% of this target

After reading the solr logs (besides turning on the update handler logs, enabling infoStream was helpful), I found that Solr is creating hundreds of index-writing threads within a minute of indexing. Parallel processing is awesome if the tasks are totally or mostly independent. However, each Lucene indexing thread opens a new segment in memory. Hundreds of parallel indexing threads mean that 100s of segments are being written in parallel. Assuming a 20GB memory allotted for indexing and 1000 threads in parallel, each such segment will be 20MB on average. In practice I was seeing much smaller segments being flushed to the disk (as in-memory structure is fluffier than the segment structure). For faster batch indexing, we stop the commits during full-indexing and postpone the merging until the complete indexing finishes; after which we merge the segments to a manageable number (25 to 50) for good query performance. Before merging, our expected number of segments for 800GB of index is around 4000. However, with Solr6 it would have been anywhere between 20 to 50 thousands; slowing down the indexing process. This also opens more files at a time (each segment is flushed into 10 or so files unless useCompoundFile=true in solrconfig), more disk I/O and and also extends the duration of merging. Most probably, the indexing or merging process would have crashed if left to run for completion: due to memory issues, file connections or exceeding disk space (as unmerged index is very fluffy). After digging around I found that Lucene’s max indexing thread limit (default value 10) had been removed few versions ago. Elastic search had introduced a threadpool for managing the indexing interaction with Lucene but there was no such substitute in Solr. I opened this JIRA: https://issues.apache.org/jira/browse/SOLR-11504

For now, I have fixed it locally for me by introducing a limit on the Lucene indexing threadpool. Basically the feature which has been removed from Lucene. Ideally, I should contribute it back to Solr by introducing a threadpool in UpdateHandler code.

After the above change, I saw a decent improvement in the indexing speed and was ready to test in production. Meanwhile, some security patch required me to upgrade from Solr 6.5 to 6.6 and then eventually to Solr7.0.1 due to SOLR-11361. For whatever reason, Solr-11361 hindered my cores from properly loading most of the time, while for other early adapter folks in the community, it was only a harmless nuisance. Luckily almost all of my work for installation scripts and puppet worked for Solr7 without issues.

Disk I/O Throttle

While investigating the full-indexing speed, I went over recent Lucene changes which could impact indexing performance. I noticed that Lucene has introduced on an io throttling mechanism such that segment flushing doesn’t affect concurrent querying performance, since we cannot serve query from a cluster which is going through a full reindex, it made sense to work around the throttle (SOLR-11200)

I also gained about 25 to 40% improvement in indexing speed by modifying the concurrency mechanism in full-indexer. This can be a topic for a future post.

--

--

No responses yet