Reputation: 838
I have a list of bigrams.
I have a pandas dataframe containing a row for each document in my corpus. What I am looking to do is get the bigrams that match from my list in each document into a new column in my dataframe.
What is the best way to accomplish this task? I have been searching for answers on stack overflow but haven't found something that gives me a specific answer I am looking for. I need the new column to contain every bigram found from my bigram list.
Any help would be appreciated!
The output what I have below is what I am looking for, although on my real example, I have used stop words so exact bigrams aren't found like the output below. Is there a way to do with with some sort of string contains maybe?
import pandas as pd
data = [['help me with my python pandas please'], ['machine learning is fun using svd with sklearn']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Message'])
import numpy as np
bigrams =[('python', 'pandas'),
('function', 'input'),
('help', 'jupyter'),
('sklearn', 'svd')]
def matcher(x):
for i in bigrams:
if i.lower() in x.lower():
return i
else:
return np.nan
df['Match'] = df['Message'].apply(matcher)
df
Upvotes: 2
Views: 2096
Reputation: 150725
This is what I would do:
# a sample, which you should've given
df = pd.DataFrame({'sentences': ['I like python pandas',
'find all function input from help jupyter',
'this has no bigrams']})
# the bigrams
bigrams = [('python', 'pandas'),
('function', 'input'),
('help', 'jupyter'),
('sklearn', 'svd')]
# create one big regex pattern:
pat = '|'.join(" ".join(x) for x in bigrams)
new_df = df.sentences.str.findall(pat)
gives you
0 [python pandas]
1 [function input, help jupyter]
2 []
Name: sentences, dtype: object
Then you can choose to unnest the list in each row.
Or you can use get_dummies
:
new_df.str.join(',').str.get_dummies(sep=',')
which gives you:
function input help jupyter python pandas
0 0 0 1
1 1 1 0
2 0 0 0
Upvotes: 3
Reputation: 2868
flashtext can also be used to solve this problem
import pandas as pd
from flashtext import KeywordProcessor
from nltk.corpus import stopwords
stop = stopwords.words('english')
bigram_token = ['python pandas','function input', 'help jupyter','svd sklearn']
data = [['help me with my python pandas please'], ['machine learning is fun using svd
with sklearn']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Message'])
kp = KeywordProcessor()
kp.add_keywords_from_list(bigram_token)
def bigram_finder(x, stop, kp):
token = x.split()
sent = ' '.join([x for x in token if x not in stop])
return kp.extract_keywords(sent)
df['bigram_token'] = df['Message'].apply(lambda x : bigram_finder(x, stop, kp))
#ouptput
0 [python pandas]
1 [svd sklearn]
Name: bigram_token, dtype: object
Upvotes: 1
Reputation: 151
Well, here's my solution featuring bigram terms detection in cleaned utterances (sentences).
It can easily be generalized to n-grams as well. It also takes into account stop words.
You can tune:
Please note that this implementation is recursive.
import pandas as pd
import re
from nltk.corpus import stopwords
data = [
['help me with my python pandas please'],
['machine learning is fun using svd with sklearn'],
['please use |svd| with sklearn, get help on JupyteR!']
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Message'])
bigrams =[
('python', 'pandas'),
('function', 'input'),
('help', 'jupyter'),
('svd', 'sklearn')
]
stop_words = set(stopwords.words('english'))
sep = ' '
def _cleanup_token(w):
""" Cleanup a token by stripping special chars """
return re.sub('[^A-Za-z0-9]+', '', w)
def _preprocessed_tokens(x):
""" Preprocess a sentence. """
return list(map(lambda w: _cleanup_token(w), x.lower().split(sep)))
def _match_bg_term_in_sentence(bg, x, depth, target_depth=2):
""" """
if depth == target_depth:
return True # the whole bigram was matched
term = bg[depth]
term = term.lower()
pp_tokens = _preprocessed_tokens(x)
if term in pp_tokens:
bg_idx = pp_tokens.index(term)
if depth > 0 and any([token not in stop_words for token in pp_tokens[0:bg_idx]]):
return False # no bigram detected
x = sep.join(pp_tokens[bg_idx+1:])
return _match_bg_term_in_sentence(bg, x, depth+1, target_depth=target_depth)
else:
return False
def matcher(x):
""" Return list of bigrams matched in sentence x """
depth = 0 # current depth
matchs = []
for bg in bigrams:
bg_idx = 0 # first term
bg_matchs = _match_bg_term_in_sentence(bg, x, depth, target_depth=2)
if bg_matchs is True:
matchs.append(bg)
return matchs
df['Match'] = df['Message'].apply(matcher)
print(df.head())
We actually obtain these results:
Match
0 [(python, pandas)]
1 [(svd, sklearn)]
2 [(help, jupyter), (svd, sklearn)]
Hope this helps !
Upvotes: 1