How to Build a Semantic Search Engine Using Elasticsearch and SBERT

Semantic search is a type of search that focuses on the meaning of words and phrases in order to provide more relevant search results. Semantic search algorithms try to understand the intent behind a user's query and return results that match the meaning of the query, rather than just matching the exact words used. This can help to improve the accuracy of search results, especially for complex or ambiguous queries.

Imagine that you are planning a trip to Paris and you want to find out more about the city's famous landmarks and attractions. You enter the query "What to see in Paris?" into a search engine that uses semantic search. The semantic search algorithm will analyze the meaning of the words in your query and return results that are relevant to the topic of sightseeing in Paris. This might include pages about the Eiffel Tower, the Louvre Museum, the Notre Dame Cathedral, and other popular attractions in the city. The results will be focused on the specific topic of things to see in Paris, rather than just returning a list of pages that happen to contain the words "What to see in Paris?" In contrast, a traditional search engine that does not use semantic search might return a mix of results that are only loosely related to your query. For example, it might return pages about other cities that have similar names, pages about the history of Paris, or pages that contain the phrase "What to see in Paris?" but are not actually relevant to the topic of sightseeing in the city. In this example, semantic search helps to improve the accuracy and relevance of the search results by understanding the meaning of the words and phrases in the user's query and returning results that are related to the topic of things to see in Paris.

Dataset

The dataset we are going to use to build this project is called Online Job Postings. It can be downloaded from the Kaggle.

The dataset consists of 19,000 job postings that were posted through the Armenian human resource portal CareerCenter. The data was extracted from the Yahoo! mailing group https://groups.yahoo.com/neo/groups/careercenter-am. This was the only online human resource portal in the early 2000s.

The dataset contains the following columns:

 1jobpost – The original job post
 2date – Date it was posted in the group
 3Title – Job title
 4Company - employer
 5AnnouncementCode – Announcement code (some internal code, is usually missing)
 6Term – Full-Time, Part-time, etc
 7Eligibility -- Eligibility of the candidates
 8Audience --- Who can apply?
 9StartDate – Start date of work
10Duration - Duration of the employment
11Location – Employment location
12JobDescription – Job Description
13JobRequirment - Job requirements
14RequiredQual -Required Qualification
15Salary - Salary
16ApplicationP – Application Procedure
17OpeningDate – Opening date of the job announcement
18Deadline – Deadline for the job announcement
19Notes - Additional Notes
20AboutC - About the company
21Attach - Attachments
22Year - Year of the announcement (derived from the field date)
23Month - Month of the announcement (derived from the field date)
24IT – TRUE if the job is an IT job. This variable is created by a simple search of IT job titles within column “Title”