Working with Elasticsearch on Linux Using Python Client
Elasticsearch
Elasticsearch is a distributed, open-source search and analytics engine based on the Apache Lucene search library. It's designed to provide fast and scalable search and analysis capabilities for large volumes of data.
At its core, Elasticsearch is a document-oriented database that stores data in JSON format. This allows it to index and search through data quickly and efficiently. Elasticsearch uses a powerful query language called Elasticsearch Query DSL to perform complex search queries on this data.
One of the key features of Elasticsearch is its distributed architecture. This means that it can automatically distribute data and search requests across multiple servers, allowing it to scale horizontally and handle large amounts of data.
Elasticsearch also provides many useful features and capabilities out-of-the-box, including full-text search, faceted search, and real-time analytics. It can be easily integrated into existing applications and systems, making it a versatile and powerful tool for a wide range of use cases.
Installing Elasticsearch
I am using Linux Ubuntu 20.04.5 LTS. So, let's first install Elasticsearch on my machine. The following instruction works for all 20+ Ubuntu Versions.
So, let's first
1curl -fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo gpg --dearmor -o /usr/share/keyrings/elastic.gpg
Next, add the Elastic source list to the sources.list.d
directory, where apt
will search for new sources:
1echo "deb [signed-by=/usr/share/keyrings/elastic.gpg] https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list
The [signed-by=/usr/share/keyrings/elastic.gpg]
portion of the file instructs apt to use the key that you downloaded to verify repository and file information for Elasticsearch packages.
Next, update your package lists so APT will read the new Elastic source:
1sudo apt update
Then install Elasticsearch with this command:
1sudo apt install elasticsearch
Press Y
when prompted to confirm installation. If you are prompted to restart any services, press ENTER
to accept the defaults and continue. Elasticsearch is now installed and ready to be configured.
Configuring Elasticsearch
To configure Elasticsearch, we will edit its main configuration file elasticsearch.yml
where most of its configuration options are stored. This file is located in the /etc/elasticsearch
directory.
Use your preferred text editor to edit Elasticsearch’s configuration file. Here, I’ll use nano
:
1sudo nano /etc/elasticsearch/elasticsearch.yml
The elasticsearch.yml
file provides configuration options for your cluster, node, paths, memory, network, discovery, and gateway. Most of these options are preconfigured in the file but you can change them according to your needs. For the purposes of our demonstration of a single-server configuration, we will only adjust the settings for the network host.
Elasticsearch listens for traffic from everywhere on port 9200
. You will want to restrict outside access to your Elasticsearch instance to prevent outsiders from reading your data or shutting down your Elasticsearch cluster through its [REST API].
To restrict access and therefore increase security, find the line that specifies network.host
, uncomment it, and replace its value with localhost
so it reads like this:
/etc/elasticsearch/elasticsearch.yml
1. . .
2# ---------------------------------- Network -----------------------------------
3#
4# Set the bind address to a specific IP (IPv4 or IPv6):
5#
6network.host: localhost
7. . .
We have specified localhost
so that Elasticsearch listens on all interfaces and bound IPs. If you want it to listen only on a specific interface, you can specify its IP in place of localhost
. Save and close elasticsearch.yml
. If you’re using nano
, you can do so by pressing CTRL+X
, followed by Y
and then ENTER
.
Start the Elasticsearch service with systemctl
.
1sudo systemctl start elasticsearch
Next, run the following command to enable Elasticsearch to start up every time your server boots:
1sudo systemctl enable elasticsearch
Securing Elasticsearch
By default, Elasticsearch can be controlled by anyone who can access the HTTP API. This is not always a security risk because Elasticsearch listens only on the loopback interface (that is, 127.0.0.1
), which can only be accessed locally. Thus, no public access is possible and as long as all server users are trusted, security may not be a major concern.
We will now configure the firewall to allow access to the default Elasticsearch HTTP API port (TCP 9200) for the trusted remote host, generally the server you are using in a single-server setup, such as198.51.100.0
. To allow access, type the following command:
1sudo ufw allow from 198.51.100.0 to any port 9200
Once that is complete, you can enable UFW with the command:
1sudo ufw enable
Finally, check the status of UFW with the following command:
1sudo ufw status
If you have specified the rules correctly, you should receive output like this:
1Output
2Status: active
3
4To Action From
5-- ------ ----
69200 ALLOW 198.51.100.0
722 ALLOW Anywhere
822 (v6) ALLOW Anywhere (v6)
Testing Elasticsearch
By now, Elasticsearch should be running on port 9200. You can test it with cURL and a GET request.
1curl -X GET 'http://localhost:9200'
You should receive the following response:
1Output
2{
3 "name" : "elastic-22",
4 "cluster_name" : "elasticsearch",
5 "cluster_uuid" : "DEKKt_95QL6HLaqS9OkPdQ",
6 "version" : {
7 "number" : "7.17.1",
8 "build_flavor" : "default",
9 "build_type" : "deb",
10 "build_hash" : "e5acb99f822233d62d6444ce45a4543dc1c8059a",
11 "build_date" : "2022-02-23T22:20:54.153567231Z",
12 "build_snapshot" : false,
13 "lucene_version" : "8.11.1",
14 "minimum_wire_compatibility_version" : "6.8.0",
15 "minimum_index_compatibility_version" : "6.0.0-beta1"
16 },
17 "tagline" : "You Know, for Search"
18}
If you receive a response similar to the one above, Elasticsearch is working properly.
Create Elasticsearch Connection and Index for DB
To make a connection with your Elasticsearch DB, write the following from Your Python Script,
1from elasticsearch import Elasticsearch
2
3
4es = Elasticsearch(HOST="localhost", PORT=9200)
5es = Elasticsearch()
By default, Elasticsearch runs on PORT 9200, and we are running the cluster in our local machine.
Note: The above code works for the elasticsearch==7.17.4
version. For other versions, there is a high chance, the above syntax is different.
Now, let's prepare our CSV file, so that we can insert in into our Elasticsearch. I will use Pandas to make some changes in the CSV data.
1import pandas as pd
2
3# read the CSV file from the disk.
4df = pd.read_csv("../file_dir/filename.csv")
5
6# print all the columns
7print(df.columns.tolist())
output
1['product_url',
2 'product_title',
3 'product_rating',
4 'product_caption',
5 'product_description',
6 'reviews',
7 'img_links']
I have the above columns in my dataset.
Now, I will replace all null values with N/A
.
1df.fillna(value="N/A", inplace=True)
Create Database Index for Elasticsearch
Let's create a DB index with a name called db_name
.
1es.indices.create(index="db_name", ignore=500)
Now, let's check if the index has been created.
1es.indices.exists(index="db_name")
If it has been created, it must show True
.
Bulk Load Data into DB Index
Now, let's insert our Pandas Dataframe df
into the index.
1from elasticsearch import helpers
2
3helpers.bulk(es, df.to_dict(orient="records"), index="db_name", timeout="300s")
df.to_dict(orient="records")
will convert our dataframe into JSON format.
Read more on helpers from here in the doc.
This should successfully insert all of our data into the db_name
index.
Search on the Elasticsearch Index
Let's search for something for our product_title
1content_query = es.search(
2 index="db_name",
3 body={"query": {"match": {"product_title": "Mechanical Keyboards"}}},
4 timeout="300s",
5)
6
7print(content_query)
This will display all the relevant content related to Mechanical Keyboards
in JSON format.
If you want to display only particular content, then you can run something like the below:
1for hit in content_query["hits"]["hits"]:
2 print(hit["_source"]["product_url"])
This will only print the product_url
related to Mechanical Keyboards
.
Thanks for the read.
Author: Sadman Kabir Soumik
Reference:
Posts in this Series
- Selfie Segmentation, Background Blurring and Removing Background From Selfie
- Building an Instagram Auto-Liker Bot - A Step-by-Step Guide
- Working with Elasticsearch on Linux Using Python Client
- Multi-class Text Classification Using Apache Spark MLlib
- Keyphrase Extraction with BERT Embeddings and Part-Of-Speech Patterns
- Rotate IP Address with Every HTTP Request to Bypass reCAPTCHA Using Tor Proxy