Before everything, this article is just for my reference and content is mostly copied from the elastic search documentation. If you looking for a detailed documentation, this link might be a perfect place to start.
Any-who, Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.
Sample use cases might be
- You run an online web store where you allow your customers to search for products that you sell. You can use Elasticsearch to store your entire product catalog and inventory and provide search and autocomplete suggestions for them
- You run a price alerting platform which allows price-savvy customers to specify a rule like “I am interested in buying a specific electronic gadget and I want to be notified if the price of gadget falls below $X from any vendor within the next month”. In this case you can scrape vendor prices, push them into Elasticsearch and use its reverse-search (Percolator) capability to match price movements against customer queries and eventually push the alerts out to the customer once matches are found.
- You have analytics/business-intelligence needs and want to quickly investigate, analyze, visualize, and ask ad-hoc questions on a lot of data (think millions or billions of records)
Near Real Time (NRT)
Elasticsearch is a near real time search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.
A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes.
In a single cluster, you can have as many nodes as you want.
A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities.
By default, each node is set up to join a cluster named “elasticsearch”
An index is a collection of documents that have somewhat similar characteristics.
For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data.
In a single cluster, you can define as many indexes as you want.
A type can be defined within an index, you can define one or more types.
A type is a logical category/partition of your index whose semantics is completely up to you.
A document is a basic unit of information that can be indexed.
For example, you can have a document for a single customer, another document for a single product, and yet another for a single order.
Shards & Replicas
- Elasticsearch provides the ability to subdivide your index into multiple pieces called shards.
- When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent “index” that can be hosted on any node in the cluster.
- Sharding is important for two primary reasons
- It allows you to horizontally split/scale your content volume
- It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput
- In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.
- It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
- It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.