What is ELASTICSEARCH
Elasticsearch is a search engine that is becoming more and more popular even in small and medium-sized companies.
Although big players like Netflix Microsoft, eBay, Facebook use it.
Its features can significantly improve the performance of any business.
This article shares 6 fundamental concepts of Elasticsearch that are worth knowing before using your systems.
In another article, we explained in brief what Elasticsearch is.
Here are the 6 non-trivial concepts useful for understanding Elasticsearch:
- Elastic Stack
- Two types of data sets
- The search score
- The model of data
- Planning of the shards
- Types of Node
Elastic Stack
Elasticsearch was initially developed as a standalone product.
Its only role was to provide a scalable search engine that could be used through any language.
So it was created with a distributed model at the core with a REST API interface for communication.
After an early adoption phase, new tools were invented to work with Elasticsearch.
KIBANA, for data visualization and analysis, and LOGSTASH, for log file collection, came along.
Currently, several tools are all developed under the care of the ELASTIC company:
- Elasticsearch: For searching,
- Kibana: Analysis and data visualization,
- Logstash: Server-side data processing pipeline,
- Beats: Single-use data loaders,
- Elastic Cloud: Hosting of Elasticsearch clusters,
- Machine Learning: Discover data patterns,
- APM: Application performance monitoring,
- Swiftype: One-click site search.
The number of tools is growing every year, enabling companies to achieve new goals and create new opportunities.
Two types of data sets
Basically, in Elasticsearch, you can index (i.e., store) any data.
But there are two classes, which substantially impact cluster configuration and management: static and temporal data sets.
Static data are datasets that can grow or change slowly, like a catalog or inventory of items.
They can be thought of as the data you store in regular databases, like blog posts, library books, orders, etc.
You index this data in Elasticsearch to allow for super-fast searches that ridicule the performance of traditional SQL databases.
On the other hand, you can store temporal datasets.
These can be events associated with a moment in time that typically grows rapidly, such as log files or metrics.
They are indexed in Elasticsearch for data analysis, pattern discovery, and system monitoring.
Depending on the type of data stored, you need to model the cluster differently.
For static data, you should choose a fixed number of indexes and shards.
They will not grow very fast, and you will always search through all the documents in the dataset.
For time-series data, you should choose rolling indexes with time limits.
It will be the recent data that will be queried most often, and eventually, you will need to delete or at least archive the obsolete documents to save space on your machines.
The search score
The primary purpose of Elasticsearch is to provide a search engine.
The goal is to provide the best matching documents.
But how does Elasticsearch know which ones are which?
For each search query, Elasticsearch calculates a relevance score.
The score is based on the tf-IDF algorithm, which stands for Term Frequency-Inverse Document Frequency.
This algorithm calculates two values:
- The first: term frequency, indicates how often a particular term is used in a document.
- The second: inverse document frequency tells how unique a given term is in all documents.
For example, if we have two documents:
To be or not to be, that is the question.
To be. I am. You are. He, she is.
The TF for the term “question” is
- for document 1: 1/10 (1 occurrence out of 10 terms)
- for document 2: 0/9 (0 occurrences out of 9 terms).
On the other hand, the IDF is calculated as a single value for an entire dataset.
It is a ratio of all documents to documents containing the searched term.
In our case it is:
log(2/1) = 0.301(2 -number of all documents, 1 -number of documents containing the term ‘question’).
Finally, the tf-idf score for both documents is calculated as the product of both values:
- document 1: 1/10 x 0.301 = 0.1 * 0.301 = 0.03
- document 2: 0/9 x 0,301 = 0 * 0,301 = 0,00
Now we see that document 1 has a relevance of value 0.03, while document 2 got 0.00.
Therefore document 1 will be published higher in the results list.
You can read the description of a practical example of using ‘search score’ in this article about Elasticsearch and WordPress.
Data Model
Elasticsearch has two performance advantages.
- It is horizontally scalable
- It is swift.
Where does this last feature come from?
It comes from the way the data is stored.
When you index a document, it goes through three steps:
- A Character filters
- A Tokenizer
- A Token filters
This is the way elastisearch normalize the document.
For example, a document like “To be or not to be, that is the question” is stored as “to be or not to be that is the questions”, and punctuation marks are removed, and all letters become lowercase.
But that’s not all.
It can also be stored as a question if the non-significant word filter is applied, which removes all common language terms like to, be, or, not, that, is, the.
There are lists of so-called ‘stop words’ in many languages.
This is the indexing part. But the same steps are applied when searching for documents.
The query is also filtered for characters, tokenized, and screened for tokens.
So Elasticsearch searches for documents with the normalized terms.
The fields in Elasticsearch are stored in an inverted index structure, and it makes the collection of matching documents very fast.
You can define specific filters for each field.
The definitions are grouped in structures called analyzers.
Depending on the goal, there are different analyzers to analyze a field.
So in a search phase, you can define which type of field you want to scan to get the desired results.
Applying these modes, Elasticsearch can provide results much faster than regular databases.
Planning of shards
One of the most frequently asked questions by Elasticsearch newbies is: how many shards and indexes should I have?
This question arises because the number of shards can only be set at the beginning of index creation.
So the answer depends on the dataset you intend to index.
The rule of thumb is that shards should consist of 20-40 GB of data.
The concept of ‘SHARDS’ comes from APACHE LUCENE (the search engine used under the hood of Elasticsearch).
Keeping in mind all the facilities and overhead Apache Lucene uses for inverted indexes and fast searches, it doesn’t make sense to have tiny shards, such as 100 MB or 1 GB.
We, like many other consultants, recommend a size of 20-40 GB.
You have to keep in mind that a shard cannot be further split and permanently resides on a single node, but,
shard of this size can easily be moved to other nodes or replicated, if needed, within a cluster.
Having this sharding capability offers a recommended tradeoff between speed and memory consumption.
Obviously, in any particular case, the performance metrics may show something different, so keep in mind that this is only a recommendation, but it is always possible to achieve other performance goals.
To know how many shards per index you should have, you can estimate it by indexing several documents in a temporary index and see how much memory they are consuming and how many of them you expect to have over a period of time (in a time series dataset) or globally (in a static dataset).
Don’t forget that even if you misconfigure the number of shards or indexes, you can always reindex the data in a new index with a different number of shards set.
Last but not least, remember that you can always run queries for multiple indexes at once.
For example, you can have rotating indexes for log-based data with daily retention and query for results for all days of the past month in one query.
Querying 30 indexes with one shard has the same performance impact as querying one index with 30 shards.
Node Types
Elasticsearch nodes can serve multiple roles.
By default, which is suitable for small clusters, they can serve all of them.
The roles we are talking about are:
- Master node
- Data node
- Ingest node
- Coordinating-only node.
Each role has its consequences:
- Master nodes are responsible for cluster-wide settings and changes, such as creating or deleting indexes, adding or removing nodes, and allocating shards to nodes.
Each cluster should consist of at least three master-eligible nodes, and in reality, it is not necessary to have more.
Of all the master-eligible nodes, one will be the master node, and its role is to perform cluster-wide actions. The other two nodes are needed exclusively for high availability.
Master nodes have low requirements on CPU, RAM, and disk storage. - Data nodes are used for data storage and search.
So they have high requirements on all resources: CPU, RAM, and disk.
The more data you have, the higher the expectations regarding resource allocation. - Import nodes (Ingest nodes) are used for preprocessing of documents before the actual indexing takes place.
They intercept bulk and index queries, apply transformations, and then transmit the documents back to the index or bulk API.
They require little disk, medium RAM, and high CPU. - Coordinating-only nodes are used as a load balancer for client queries.
They know where specific documents may reside and serve search requests only to those nodes.
They then perform “scatter & gatter” actions on the received results.
The requirements for them are little disk, medium or high RAM, and medium or high CPU.
Each node can perform one or more of the above roles.
Any node plays the coordination role.
To have a coordination-only node, you have to disable all other roles on it in the configuration.
Here’s some advice on the most common way to configure a cluster:
- Three primary nodes, not exposed to the world and maintaining cluster state and cluster settings.
- A couple of coordination-only nodes: they listen for external requests and act as intelligent load balancers for the entire cluster.
- A number of data nodes, depending on the needs of the dataset.
- A couple of import nodes if you are running an import pipeline and want to relieve other nodes of the impact of document preprocessing.
The specific numbers depend on your particular use case and should be sized based on performance testing.