Keyphrase Extraction with BERT Embeddings and Part-Of-Speech Patterns
Keyphrases are important pieces of information that can be extracted from text documents. These are words or phrases that summarize the main ideas or topics of a text, and they can be useful for a variety of applications, such as document summarization, text classification, and information retrieval. In this blog post, we will explore how keyphrases can be extracted from text documents, and discuss some of the techniques and tools that can be used for this task.
One of the most common ways to extract keyphrases from text is to use a technique called keyword extraction. This involves identifying the most important words or phrases in a text, based on their frequency, relevance, or other criteria. There are several algorithms and tools that can be used for keyword extraction, including term frequency-inverse document frequency (TF-IDF), Latent Semantic Indexing (LSI), and topic modeling. These methods can be used to identify the words or phrases that are most relevant to the topic of the text, and to filter out common words or stop words that are not useful for keyphrase extraction.
Another approach to keyphrase extraction is to use natural language processing (NLP) techniques. NLP is a field of artificial intelligence that focuses on understanding and processing human language. It includes a wide range of techniques, such as part-of-speech tagging, syntactic parsing, and semantic analysis, that can be used to identify the structure and meaning of a text. By using NLP techniques, it is possible to extract keyphrases from a text by identifying the noun phrases, verb phrases, or other important parts of the text.
One of the most popular tools for keyphrase extraction is the KeyBERT model, which is a state-of-the-art language model developed by researchers at the Allen Institute for Artificial Intelligence. KeyBERT is trained on a large corpus of text data, and uses a combination of deep learning and NLP techniques to identify the keyphrases in a text. It can extract keyphrases from a wide range of texts, including news articles, scientific papers, and social media posts, and it is highly accurate and efficient at identifying the most important phrases in a text.
In this blog post, I am going to show how to extract keyphrases using KeyBERT at first, and then show how we can use part-of-speech patterns to extract grammatically correct keyphrases.
Let's first load our data. The dataset can be downloaded from Kaggle. I am using Jupyter Notebook to execute my codes.
Install Dependencies
1pip install pandas
2pip install keybert
3pip install keyphrase-vectorizers
Load Dataset
1import pandas as pd
2
3# set pandas setting to display full dataframe
4pd.set_option('display.max_columns', None)
5pd.set_option('display.max_rows', None)
6pd.set_option('display.max_colwidth', None)
7
8
9df = pd.read_json('News_Category_Dataset_v3.json', lines=True)
10df.head(3)
output (our dataset looks like below)
link | headline | category | short_description | authors | date | |
---|---|---|---|---|---|---|
0 | https://www.huffpost.com/entry/covid-boosters-uptake-us_n_632d719ee4b087fae6feaac9 | Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters | U.S. NEWS | Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall. | Carla K. Johnson, AP | 2022-09-23 |
1 | https://www.huffpost.com/entry/american-airlines-passenger-banned-flight-attendant-punch-justice-department_n_632e25d3e4b0e247890329fe | American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video | U.S. NEWS | He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles. | Mary Papenfuss | 2022-09-23 |
2 | https://www.huffpost.com/entry/funniest-tweets-cats-dogs-september-17-23_n_632de332e4b0695c1d81dc02 | 23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23) | COMEDY | "Until you have a dog you don't understand what could be eaten." | Elyse Wanshel | 2022-09-23 |
But we only need some of these columns; we will only extract KeyPhrases/Keywords from the headline and short_description columns. But first, I am going to merge both of these columns.
1df = df[['headline', 'short_description']]
2
3# merge headline and short description into one column
4df['content'] = df['headline'] + '. ' + df['short_description']
5
6df.head(2)
output
headline | short_description | content | |
---|---|---|---|
0 | Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters | Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall. | Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters. Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall. |
1 | American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video | He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles. | American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video. He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles. |
Now, I will use the content
column only.
We will extract keyphrases only from the top 50 rows. As this is only for demonstration purposes.
1# take only the first 50 rows
2df = df[:50]
Text Cleaning
In general, text cleaning is not necessary for transformer-based language models, as they are designed to handle a wide range of input text. However, depending on the specific application, there may be cases where some form of text cleaning can improve the performance of the model.
For example, if the input text contains a lot of noisy or irrelevant information, such as HTML tags or URLs, then removing this information could help the model focus on the relevant content and improve its performance. Additionally, if the input text is not in the correct format for the model (e.g. the text should be lowercase but the input text is not), then cleaning the text to correct this formatting could also improve the model's performance. In short, text cleaning is not always necessary for transformer-based language models, but in some cases it can be helpful. It is worth considering whether text cleaning could improve the performance of your model, and if so, implementing the appropriate cleaning steps.
In our case, we are going to skip the text cleaning step.
Keyword/Keyphrase Extraction Without POS (part-of-speech) Pattern
I will use KeyBERT. It uses BERT-embeddings and simple cosine similarity to find the sub-phrases in a document that are the most similar to the document itself.
First, document embeddings are extracted with BERT to get a document-level representation. Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity to find the words/phrases that are the most similar to the document. The most similar words could then be identified as the words that best describe the entire document.
1from keybert import KeyBERT
2
3# create a KeyBERT model object that can be used to extract keyphrases
4model = KeyBERT()
5
6def get_keyphrase_bert(text):
7 # extract keyphrases from the text using the pre-defined KeyBERT model
8 keyphrase = model.extract_keywords(text, keyphrase_ngram_range=(1, 3), stop_words='english', top_n=5)
9 return keyphrase
10
11df['keyphrase_without_pos'] = df['content'].apply(get_keyphrase_bert)
Now if we print the dataframe, it looks like below:
headline | short_description | content | keyphrase_without_pos | |
---|---|---|---|---|
0 | Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters | Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall. | Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters. Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall. | [(omicron targeted covid, 0.6294), (targeted covid boosters, 0.6243), (sleeves omicron targeted, 0.579), (covid boosters, 0.5739), (covid boosters health, 0.565)] |
1 | American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video | He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles. | American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video. He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles. | [(punching flight attendant, 0.6868), (flyer charged banned, 0.6703), (airlines flyer charged, 0.6263), (punching flight, 0.5982), (american airlines flyer, 0.5614)] |
keyphrase_without_pos
column has our top 5 keyphrases with their similarity score.
Let's only extract the keyphrases without the scores, so that we can obserbe their structure better.
1from keybert import KeyBERT
2
3# create a KeyBERT model object that can be used to extract keyphrases
4model = KeyBERT()
5
6def get_keyphrase_bert(text):
7 # extract keyphrases from the text using the pre-defined KeyBERT model
8 keyphrase = model.extract_keywords(text, keyphrase_ngram_range=(1, 3), stop_words='english', top_n=5)
9 return [i[0] for i in keyphrase]
10
11df['keyphrase_without_pos'] = df['content'].apply(get_keyphrase_bert)
Let's only display the content
and keyphrase_without_pos
column.
1df[['content', 'keyphrase_without_pos']].head(3)
output
content | keyphrase_without_pos | |
---|---|---|
0 | Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters. Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall. | [omicron targeted covid, targeted covid boosters, sleeves omicron targeted, covid boosters, covid boosters health] |
1 | American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video. He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles. | [punching flight attendant, flyer charged banned, airlines flyer charged, punching flight, american airlines flyer] |
2 | 23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23). "Until you have a dog you don't understand what could be eaten." | [tweets cats dogs, funniest tweets cats, tweets cats, 23 funniest tweets, cats dogs] |
If you notice the keyphrases, they are good, but some of them are not grammatically correct.
I will show you how to apply the POS pattern in the KeyBERT model object to get more grammatically correct keyphrases. Another problem with the above approach is that we need to define the keyphrase_ngram_range
. In our case, it was (1, 3). But it's tough to know which keyphrase_ngram_range will work best without a rigorous experiment.
Keyword/Keyphrase Extraction Using POS (part-of-speech) Pattern
To use POS pattern with our model, I am going to use the KeyphraseVectorizers library.
1from keyphrase_vectorizers import KeyphraseCountVectorizer
2from keybert import KeyBERT
3
4kw_model = KeyBERT()
5
6def get_keyPhrases_POS_BERT(text):
7 keyPhrases = kw_model.extract_keywords(docs=text, vectorizer=KeyphraseCountVectorizer(), top_n=5)
8 # return only the keyphrases
9 return [keyPhrases[i][0] for i in range(len(keyPhrases))]
Now, let's apply the get_keyPhrases_POS_BERT
function to our content
column like before, and see the result side by side.
1df['keyphrase_with_pos'] = df['content'].progress_apply(get_keyPhrases_POS_BERT)
2df[['content', 'keyphrase_without_pos', 'keyphrase_with_pos']].head(5)
output
content | keyphrase_without_pos | keyphrase_with_pos | |
---|---|---|---|
0 | Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters. Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall. | [omicron targeted covid, targeted covid boosters, sleeves omicron targeted, covid boosters, covid boosters health] | [targeted covid boosters, new boosters, omicron, sleeves, doses] |
1 | American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video. He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles. | [punching flight attendant, flyer charged banned, airlines flyer charged, punching flight, american airlines flyer] | [punching flight attendant, american airlines flyer charged, aircraft, passengers, confrontation] |
2 | 23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23). "Until you have a dog you don't understand what could be eaten." | [tweets cats dogs, funniest tweets cats, tweets cats, 23 funniest tweets, cats dogs] | [funniest tweets, cats, dogs, dog, week] |
3 | The Funniest Tweets From Parents This Week (Sept. 17-23). "Accidentally put grown-up toothpaste on my toddler’s toothbrush and he screamed like I was cleaning his teeth with a Carolina Reaper dipped in Tabasco sauce." | [funniest tweets parents, toddler toothbrush screamed, toothpaste toddler, toothbrush screamed, grown toothpaste toddler] | [funniest tweets, tabasco sauce, toothpaste, teeth, parents] |
4 | Woman Who Called Cops On Black Bird-Watcher Loses Lawsuit Against Ex-Employer. Amy Cooper accused investment firm Franklin Templeton of unfairly firing her and branding her a racist after video of the Central Park encounter went viral. | [watcher loses lawsuit, cops black bird, black bird watcher, amy cooper accused, bird watcher loses] | [amy cooper, black bird, lawsuit, watcher, woman] |
If you notice the difference between keyphrase_without_pos
and keyphrase_with_pos
, you will see key phrases/keywords in the last column are grammatically correct. Moreover, we did not have to define the n_gram range specifically.
Code
You can find all the codes from this GitHub repository.
Thanks for the read.
Author: Sadman Kabir Soumik
Posts in this Series
- Selfie Segmentation, Background Blurring and Removing Background From Selfie
- Building an Instagram Auto-Liker Bot - A Step-by-Step Guide
- Working with Elasticsearch on Linux Using Python Client
- Multi-class Text Classification Using Apache Spark MLlib
- Keyphrase Extraction with BERT Embeddings and Part-Of-Speech Patterns
- Rotate IP Address with Every HTTP Request to Bypass reCAPTCHA Using Tor Proxy