Reputation: 17
I have two pandas dataframes, one that contains one column of all open text movie reviews (movie_review_df) and the other (movie_ngrams_df) that contains the most common ngrams (top 5 of ngram = 1 and top 5 of ngram = 2) found within movie_review_df.
I would essentially like to write a function that would iterate over every row of my words/word phrase column within my movie_ngrams_df and use them as lookups to find the reviews that contain those words/work phrases.
Imagine my movie_ngrams_df has 2 values across 2 columns.
1) The word 'love' in column a (ngram_wordphrase) and 'one' in column b (ngram_group)
2) The phrase 'too long' in column a and 'two' in column b
I think that a function that uses a loop and a .contains()
call will work but can't seem to wrap my head around it.
This is somewhat how I would want it to work.
def ngram_lookup (ngram,reviews):
appended_df = pd.concat(for word in ngram:
reviews_df[reviews_df['reviews'].str.contains('ngram')])
return appended_df
I want a function that will search every movie review text in the movie_review_df and pull out the reviews that contain the word 'love'. I want the output to be a new df (ngram_detail_df) where each row represented contains the word_phrase (eg. love in column a) and then the full individual string review (placed in column b) that contains the word 'love' in it. So each word_phrase will likely be listed multiple times in column a.
THEN (you knew it was coming) I want to be able to do the same thing for the next word_phrase in our movie_ngrams_df which was 'too long'. I want to append these new 'too long' results to the results returned from our 'love' search so that at the end, we have just one df containing the top word_phrases and each movie review where that word/word_phrase is present.
Upvotes: 0
Views: 176
Reputation: 234
What about something like
words = movie_ngrams_df["ngram_wordphrase"].array
ngram_detail_df = movie_review_df.copy()
for word in words:
ngram_detail_df[word] = ngram_detail_df["reviews"].apply(lambda x: word in x)
ngram_detail_df = ngram_detail_df.melt(id_vars=["reviews"])
ngram_detail_df = ngram_detail_df[ngram_detail_df["value"] == True]
ngram_detail_df = ngram_detail_df.loc[:, ["reviews", "variable"]
ngram_detail_df.rename(columns={"variable": "ngram"}, inplace=True)
Upvotes: 1