sharp
sharp

Reputation: 2158

Python: extract keywords row by row from csv

I am trying to extract keywords line by line from a csv file and create a keyword field. Right now I am able to get the full extraction. How do I get keywords for each row/field?

Data:

id,some_text
1,"What is the meaning of the word Himalaya?"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward"

Code: This is search entire text but not row by row. Do I need to put something else besides replace(r'\|', ' ')?

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

df = pd.read_csv('test-data.csv')
# print(df.head(5))

text_context = df['some_text'].str.lower().str.replace(r'\|', ' ').str.cat(sep=' ') # not put lower case?
print(text_context)
print('')
tokens=nltk.tokenize.word_tokenize(text_context)
word_dist = nltk.FreqDist(tokens)
stop_words = stopwords.words('english')
punctuations = ['(',')',';',':','[',']',',','!','?']
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
print(keywords)

final output:

id,some_text,new_keyword_field
1,What is the meaning of the word Himalaya?,"meaning,word,himalaya"
2,"Palindrome is a word, phrase, or sequence that reads the same backward as forward","palindrome,word,phrase,sequence,reads,backward,forward"

Upvotes: 1

Views: 3459

Answers (1)

L.P. Whigley
L.P. Whigley

Reputation: 126

Here is a clean way to add a new keywords column to your dataframe using pandas apply. Apply works by first defining a function (get_keywords in our case) that we can apply to each row or column.

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# I define the stop_words here so I don't do it every time in the function below
stop_words = stopwords.words('english')
# I've added the index_col='id' here to set your 'id' column as the index. This assumes that the 'id' is unique.
df = pd.read_csv('test-data.csv', index_col='id')  

Here we define our function that will be applied to each row using df.apply in the next cell. You can see that this function get_keywords takes a row as its argument and returns a string of comma separated keywords like you have in your desired output above ("meaning,word,himalaya"). Within this function we lower, tokenize, filter out punctuation with isalpha(), filter out our stop_words, and join our keywords together to form the desired output.

# This function will be applied to each row in our Pandas Dataframe
# See the docs for df.apply at: 
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
def get_keywords(row):
    some_text = row['some_text']
    lowered = some_text.lower()
    tokens = nltk.tokenize.word_tokenize(lowered)
    keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
    keywords_string = ','.join(keywords)
    return keywords_string

Now that we have defined our function that will be applied we call df.apply(get_keywords, axis=1). This will return a Pandas Series (similar to a list). Since we want this series to be a part of our dataframe we add it as a new column using df['keywords'] = df.apply(get_keywords, axis=1)

# applying the get_keywords function to our dataframe and saving the results
# as a new column in our dataframe called 'keywords'
# axis=1 means that we will apply get_keywords to each row and not each column
df['keywords'] = df.apply(get_keywords, axis=1)

Output: Dataframe after adding 'keywords' column

Upvotes: 7

Related Questions