Solr Query Explain Plan
I have been working on Solr for few years and recently needed to figure out why some of our queries after Solr upgrade were showing very dissimilar results than production (Solr 4 based) systems. There is a request parameter `debug=query|explain|timing|true` which prints query parser’s output, time spent in each step of the process, score calculation and also which shard did the document come from (in case of a distributed query).
While most of the debug output is very easy to understand. I felt that the explain needs some explanation. Following is the explain section for a query which resulted in only one match. Lets try to understand it more:
explain: {
file_289380247558: " 2980.6357 = sum of: 948.12964 = max plus 0.01 times others of: 945.9705 = weight(name_exact:nawab.txt in 43) [SchemaSimilarity], result of: 945.9705 = score(doc=43,freq=1.0 = termFreq=1.0 ), product of: 60.0 = boost 19.578846 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 1.0 = docFreq 4.77612512E8 = docCount 0.80526584 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 1.8854527 = avgFieldLength 3.0 = fieldLength 195.58952 = weight(name_ngram:nawab.txt in 43) [SchemaSimilarity], result of: 195.58952 = score(doc=43,freq=1.0 = termFreq=1.0 ), product of: 10.0 = boost 19.558952 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 1.0 = docFreq 4.68205056E8 = docCount 1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.0 = parameter b (norms omitted for field) 20.325062 = weight(name_shingle:nawab txt in 43) [SchemaSimilarity], result of: 20.325062 = score(doc=43,freq=1.0 = termFreq=1.0 ), product of: 18.996464 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 2.0 = docFreq 4.4463024E8 = docCount 1.0699393 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 4.7607 = avgFieldLength 4.0 = fieldLength 2032.5062 = max plus 0.01 times others of: 2032.5062 = weight(name_shingle:nawab txt in 43) [SchemaSimilarity], result of: 2032.5062 = score(doc=43,freq=1.0 = termFreq=1.0 ), product of: 100.0 = boost 18.996464 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 2.0 = docFreq 4.4463024E8 = docCount 1.0699393 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 4.7607 = avgFieldLength 4.0 = fieldLength "
}
file_289380247558 is the id of the document
2980.6357 is the document score for the given query.
How is the score being reached? I assume there are some tools out there to format the above one liner into an easier to understand mathematical notation separated by arithmetic operators and nesting with parenthesis for precedence. There is this tool, but it didn’t work for me. Hence the motivation to write this article.
Below I have formatted the above json “map” (albeit with only one key) with some indentations: file_289380247558 is the id of resulting document. This is the value from the field which you specify as id field in solr configuration.
Notice that Solr is using “value = expression” format in explain plan. We are used to seeing (boost = 60) but in the following we have 60.0 = boost
file_289380247558: "
2980.6357 =
sum of:
948.12964 = max plus 0.01 times others of:
945.9705 = weight(name_exact:nawab.txt in 43) [SchemaSimilarity],
result of: 945.9705 = score(doc=43,freq=1.0 = termFreq=1.0 ),
product of: 60.0 = boost 19.578846 = idf,
computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))
from: 1.0 = docFreq 4.77612512E8 = docCount
0.80526584 = tfNorm, computed as
(freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength))
from: 1.0 = termFreq=1.0 1.2 = parameter k1
0.75 = parameter b 1.8854527 = avgFieldLength 3.0 = fieldLength
195.58952 = weight(name_ngram:nawab.txt in 43) [SchemaSimilarity],
result of: 195.58952 =
score(doc=43,freq=1.0 = termFreq=1.0 ), product of: 10.0 = boost
19.558952 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))
from: 1.0 = docFreq 4.68205056E8 = docCount 1.0 = tfNorm,
computed as (freq * (k1 + 1)) / (freq + k1) from: 1.0 = termFreq=1.0
1.2 = parameter k1 0.0 = parameter b (norms omitted for field)
20.325062 = weight(name_shingle:nawab txt in 43) [SchemaSimilarity],
result of: 20.325062 = score(doc=43,freq=1.0 = termFreq=1.0 ),
product of: 18.996464 = idf, computed as
log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 2.0 = docFreq
4.4463024E8 = docCount 1.0699393 = tfNorm,
computed as
(freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength))
from: 1.0 = termFreq=1.0 1.2 = parameter k1
0.75 = parameter b 4.7607 = avgFieldLength 4.0 = fieldLength 2032.5062 = max plus 0.01 times others of: 2032.5062 =
weight(name_shingle:nawab txt in 43) [SchemaSimilarity], result of: 2032.5062 =
score(doc=43,freq=1.0 = termFreq=1.0 ), product of: 100.0 = boost 18.996464 = idf,
computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 2.0 = docFreq
4.4463024E8 = docCount 1.0699393 = tfNorm, computed as
(freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
1.0 = termFreq=1.0 1.2 = parameter k1
0.75 = parameter b 4.7607 = avgFieldLength 4.0 = fieldLength "
Here is the reverse arithmetic:
2980.6537 = 948.12964 + 2032.5062
= 945.9705 + 0.01 * (195.58952 + 20.325062) + 2032.5062
Here is the explanation for each of the above number:
final document score = 2980.6537
score from query fields (qf) = 948.12964
score from phrase fields (pf) = 2032.5062
tie = 0.01
Within query fields and phrase fields, the score is calculated by
taking the max of all matches and adding sum of the remaining
matches after multiplying by tie.
Since there was only one match in the phrase fields, we do not see the
tie in action in that calculation.
Solr documentation explains ‘tie’ as:
“The tie parameter specifies a float value (which should be something much less than 1) to use as tiebreaker in DisMax queries.When a term from the user’s input is tested against multiple fields, more than one field may match. If so, each field will generate a different score based on how common that word is in that field (for each document relative to all other documents). The tie parameter lets you control how much the final score of the query will be influenced by the scores of the lower scoring fields compared to the highest scoring field.” Read more about tie here
There are 3 matches in query fields (name_exact, name_ngram, name_shingle) and one match in phrase field (name_shingle, which we also use for phrase queries)
My query was: nawab.txt and matched document’s name was: ‘alpha beta and nawab.txt’. Lets dive deeper into how the query executed against one field ‘name_exact’ :
945.9705 = weight(name_exact:nawab.txt in 43) [SchemaSimilarity],
result of: 945.9705 = score(doc=43,freq=1.0 = termFreq=1.0 ),
product of: 60.0 = boost 19.578846 = idf,
computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))
from: 1.0 = docFreq 4.77612512E8 = docCount
0.80526584 = tfNorm, computed as
(freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength))
from: 1.0 = termFreq=1.0 1.2 = parameter k1
0.75 = parameter b 1.8854527 = avgFieldLength 3.0 = fieldLength
43 is the Lucene’s internal identity for my document. This is different from solr level document id, which we have seen above (file_289380247558).
Solr uses following idf formula:
computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))
Explain mentions ‘SchemaSimilarity’ probably because I am using the default BM25 similarity. In the above snippet, this is the BM25 formula and default parameter values as used in Solr: (k1 = 1.2, b = 0.75)
computed as
(freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength))
from: 1.0 = termFreq=1.0 1.2 = parameter k1
0.75 = parameter b
So, the score of one field (name_exact) on the document (which has Lucene internal id 43), for running query nawab.txt, was based on following expressions:
score = boost * idf * tfNorm
945.9705 = 60 * 19.578846 * 0.80526584
boost was specified directly on the ‘name_exact’ field while querying (e.g. name_exact⁶⁰) so that is ok. idf also seems reasonable.
However ‘tfNorm’ had one component which caught my eye (fieldLength = 3). Here is the tfNorm snippet form name_exact match:
0.80526584 = tfNorm, computed as
(freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength))
from: 1.0 = termFreq=1.0 1.2 = parameter k1
0.75 = parameter b 1.8854527 = avgFieldLength 3.0 = fieldLength
As I mentioned above, my file name is “alpha beta and nawab.txt”. My field (name_exact) doesn’t break on dot, and it filters on stop words, so the length (3) was correct. However, it was different from what I found in Solr 4 calculation for fieldNorm (where it was calculated to be 4 with same schema and document name). This difference seems to be affecting the queries with filtered terms only. And this bugfix in field length calculation meant that we should readjust our field boosts to get results ranked similar to what we have been seeing in Solr 4.
Here is the comparable part from Solr 4’s explain plan for the same document:
3.680739 = (MATCH) weight(name_exact:nawab.txt^60.0 in 7544)
[DefaultSimilarity], result of: 3.680739 =
score(doc=7544,freq=1.0 = termFreq=1.0 ), product of: 0.3626161 =
queryWeight, product of: 60.0 = boost
20.301023 = idf(docFreq=1, maxDocs=482344894) 2.9769936E-4 = queryNorm
10.150512 = fieldWeight in 7544,
product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
20.301023 = idf(docFreq=1, maxDocs=482344894)
0.5 = fieldNorm(doc=7544)
You can notice that fieldNorm is 0.5 which is calculated as 1/sqrt(number of terms in field), see lengthNorm in ClassicSimilarity, which means that length is calculated as 4.