First, let's look at what happens when ELSA makes a query. ELSA uses Sphinx as its search engine, and it uses two types of Sphinx indexes. The first is a "temporary" index that ELSA initially creates for new logs which stores the fields (technically, attributes) of the events in RAM, deduplicated, using Sphinx's default "extern" docinfo. The other is the "permanent" form which stores the attributes using Sphinx's "inline" docinfo. Inline means that the attributes are something like a database table, where the name of the table is the keyword being searched, and all entries in the table correspond to the hits for that keyword.
So let's say we have log entries that look like this:
term1 term2 id=1, timestamp=x1, host=y, class=z
term1 term3 id=2, timestamp=x2, host=y, class=z
Sphinx's inline docinfo would store this as three total keywords, each with the list of attributes beneath it like a database table:
term1
id | timestamp | host | class
1 | x1 | y | z
2 | x2 | y | z
term2
id | timestamp | host | class
1 | x1 | y | z
term3
id | timestamp | host | class
2 | x2 | y | z
So when you query for +term1 +term2, Sphinx does a pseudo-SQL query like this:
SELECT * FROM term1 JOIN term2 WHERE term1.id=term2.id
Most terms are fairly rare, so the join is incredibly fast. However, consider a situation in which "term1" appeared in hundreds of millions of events. If your query includes "term1," then the number of "rows" in the "table" for that term could be millions or even billions, making that JOIN extremely expensive, especially if you've asked for the query to filter the results to specific time values or do a group by.
In addition to the slow querying, note that the disk required to store the Sphinx indexes is a function of the number of attributes it must store in these pseudo-tables. So, a very common term will incur a massive disk cost to store the large pseudo-table.
Below is the count of the one hundred most common terms in a test dataset of ten million events. You can think of each bar representing the number of rows in the pseudo-tables, so a query for 0 - (the two most common terms) would require a join across a pseudo-table with 45,355,729 rows multiplied with another with 33,907,455 rows. Note how quickly the hit count of a given term drops off.
Stopwords
This is where Sphinx stopwords save the day. Sphinx's indexer has an option to calculate the frequency with which keywords exist in data to be indexed. You can invoke this by adding --buildstops <outfile> <n> --buildfreqs to the indexing command and it will find the top n most frequent keywords and write them to outfile, along with the count for how many times the keyword appeared. This file can be referred to by a subsequent run of indexer, sans the stopword options, to ignore these n most frequent keywords. This will save a massive amount of disk space (expect savings of around 60% percent) and also guarantee that queries including the word won't take forever, because the index won't have any knowledge of them.However, this obviously means that the keywords can't be searched. To cope with this, ELSA has a configuration item in the elsa_web.conf file where you can specify a hash of stopwords. If a query attempts to search one of these keywords, then one of several things can happen:
- If some terms are stopwords and some are not, then the query will use the non-stopwords as the basis for the Sphinx search, and results will be filtered using the stopwords in ELSA.
- If all terms are stopwords, the query is run against the raw SQL database and Sphinx is not queried at all.
- If a query contains no keywords, just attributes (such as a query for just a class or a range query), the query will be run against the raw SQL database and not Sphinx.