Working with Elasticsearch on Linux Using Python Client

Aug 19, 2022 · 6 min read · project-tutorial programming python elasticsearch ·

Elasticsearch

Elasticsearch is a distributed, open-source search and analytics engine based on the Apache Lucene search library. It's designed to provide fast and scalable search and analysis capabilities for large volumes of data.

At its core, Elasticsearch is a document-oriented database that stores data in JSON format. This allows it to index and search through data quickly and efficiently. Elasticsearch uses a powerful query language called Elasticsearch Query DSL to perform complex search queries on this data.

One of the key features of Elasticsearch is its distributed architecture. This means that it can automatically distribute data and search requests across multiple servers, allowing it to scale horizontally and handle large amounts of data.

Elasticsearch also provides many useful features and capabilities out-of-the-box, including full-text search, faceted search, and real-time analytics. It can be easily integrated into existing applications and systems, making it a versatile and powerful tool for a wide range of use cases.

Installing Elasticsearch

I am using Linux Ubuntu 20.04.5 LTS. So, let's first install Elasticsearch on my machine. The following instruction works for all 20+ Ubuntu Versions.

So, let's first

1curl -fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo gpg --dearmor -o /usr/share/keyrings/elastic.gpg

Next, add the Elastic source list to the sources.list.d directory, where apt will search for new sources:

1echo "deb [signed-by=/usr/share/keyrings/elastic.gpg] https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list

The [signed-by=/usr/share/keyrings/elastic.gpg] portion of the file instructs apt to use the key that you downloaded to verify repository and file information for Elasticsearch packages.

Next, update your package lists so APT will read the new Elastic source:

1sudo apt update

Then install Elasticsearch with this command:

1sudo apt install elasticsearch

Press Y when prompted to confirm installation. If you are prompted to restart any services, press ENTER to accept the defaults and continue. Elasticsearch is now installed and ready to be configured.

Configuring Elasticsearch

To configure Elasticsearch, we will edit its main configuration file elasticsearch.yml where most of its configuration options are stored. This file is located in the /etc/elasticsearch directory.

Use your preferred text editor to edit Elasticsearch’s configuration file. Here, I’ll use nano:

1sudo nano /etc/elasticsearch/elasticsearch.yml

The elasticsearch.yml file provides configuration options for your cluster, node, paths, memory, network, discovery, and gateway. Most of these options are preconfigured in the file but you can change them according to your needs. For the purposes of our demonstration of a single-server configuration, we will only adjust the settings for the network host.

Elasticsearch listens for traffic from everywhere on port 9200. You will want to restrict outside access to your Elasticsearch instance to prevent outsiders from reading your data or shutting down your Elasticsearch cluster through its [REST API].

To restrict access and therefore increase security, find the line that specifies network.host, uncomment it, and replace its value with localhost so it reads like this:

/etc/elasticsearch/elasticsearch.yml

1. . .
2# ---------------------------------- Network -----------------------------------
3#
4# Set the bind address to a specific IP (IPv4 or IPv6):
5#
6network.host: localhost
7. . .

We have specified localhost so that Elasticsearch listens on all interfaces and bound IPs. If you want it to listen only on a specific interface, you can specify its IP in place of localhost. Save and close elasticsearch.yml. If you’re using nano, you can do so by pressing CTRL+X, followed by Y and then ENTER .

Start the Elasticsearch service with systemctl.

1sudo systemctl start elasticsearch

Next, run the following command to enable Elasticsearch to start up every time your server boots:

1sudo systemctl enable elasticsearch

Securing Elasticsearch

By default, Elasticsearch can be controlled by anyone who can access the HTTP API. This is not always a security risk because Elasticsearch listens only on the loopback interface (that is, 127.0.0.1), which can only be accessed locally. Thus, no public access is possible and as long as all server users are trusted, security may not be a major concern.

We will now configure the firewall to allow access to the default Elasticsearch HTTP API port (TCP 9200) for the trusted remote host, generally the server you are using in a single-server setup, such as198.51.100.0. To allow access, type the following command:

1sudo ufw allow from 198.51.100.0 to any port 9200

Once that is complete, you can enable UFW with the command:

1sudo ufw enable

Finally, check the status of UFW with the following command:

1sudo ufw status

If you have specified the rules correctly, you should receive output like this:

1Output
2Status: active
3
4To                         Action      From
5--                         ------      ----
69200                       ALLOW      198.51.100.0
722                         ALLOW       Anywhere
822 (v6)                    ALLOW       Anywhere (v6)

Testing Elasticsearch

By now, Elasticsearch should be running on port 9200. You can test it with cURL and a GET request.

1curl -X GET 'http://localhost:9200'

You should receive the following response:

 1Output
 2{
 3  "name" : "elastic-22",
 4  "cluster_name" : "elasticsearch",
 5  "cluster_uuid" : "DEKKt_95QL6HLaqS9OkPdQ",
 6  "version" : {
 7    "number" : "7.17.1",
 8    "build_flavor" : "default",
 9    "build_type" : "deb",
10    "build_hash" : "e5acb99f822233d62d6444ce45a4543dc1c8059a",
11    "build_date" : "2022-02-23T22:20:54.153567231Z",
12    "build_snapshot" : false,
13    "lucene_version" : "8.11.1",
14    "minimum_wire_compatibility_version" : "6.8.0",
15    "minimum_index_compatibility_version" : "6.0.0-beta1"
16  },
17  "tagline" : "You Know, for Search"
18}

If you receive a response similar to the one above, Elasticsearch is working properly.

Create Elasticsearch Connection and Index for DB

To make a connection with your Elasticsearch DB, write the following from Your Python Script,

1from elasticsearch import Elasticsearch
2
3
4es = Elasticsearch(HOST="localhost", PORT=9200)
5es = Elasticsearch()

By default, Elasticsearch runs on PORT 9200, and we are running the cluster in our local machine.

Note: The above code works for the elasticsearch==7.17.4 version. For other versions, there is a high chance, the above syntax is different.

Now, let's prepare our CSV file, so that we can insert in into our Elasticsearch. I will use Pandas to make some changes in the CSV data.

1import pandas as pd
2
3# read the CSV file from the disk.
4df = pd.read_csv("../file_dir/filename.csv")
5
6# print all the columns
7print(df.columns.tolist())

output

1['product_url',
2 'product_title',
3 'product_rating',
4 'product_caption',
5 'product_description',
6 'reviews',
7 'img_links']

I have the above columns in my dataset.

Now, I will replace all null values with N/A.

1df.fillna(value="N/A", inplace=True)

Create Database Index for Elasticsearch

Let's create a DB index with a name called db_name.

1es.indices.create(index="db_name", ignore=500)

Now, let's check if the index has been created.

1es.indices.exists(index="db_name")

If it has been created, it must show True.

Bulk Load Data into DB Index

Now, let's insert our Pandas Dataframe df into the index.

1from elasticsearch import helpers
2
3helpers.bulk(es, df.to_dict(orient="records"), index="db_name", timeout="300s")

df.to_dict(orient="records") will convert our dataframe into JSON format.

Read more on helpers from here in the doc.

This should successfully insert all of our data into the db_name index.

Search on the Elasticsearch Index

Let's search for something for our product_title

1content_query = es.search(
2    index="db_name",
3    body={"query": {"match": {"product_title": "Mechanical Keyboards"}}},
4    timeout="300s",
5)
6
7print(content_query)

This will display all the relevant content related to Mechanical Keyboards in JSON format.

If you want to display only particular content, then you can run something like the below:

1for hit in content_query["hits"]["hits"]:
2    print(hit["_source"]["product_url"])

This will only print the product_url related to Mechanical Keyboards.

Thanks for the read.

Author: Sadman Kabir Soumik

Reference:

https://www.digitalocean.com/community/tutorials/how-to-install-and-configure-elasticsearch-on-ubuntu-22-04

https://elasticsearch-py.readthedocs.io/en/7.x/helpers.html