Reputation: 19
I have two DataFrames with movie review data from two different platforms (id, title, review, etc). All rows about a particular movie need to be removed from one DataFrame if that movie has not been reviewed in the other DataFrame. Here's an example:
import pandas as pd
data1 = [[1, 'Great movie!', 'Spiderman'], [1, 'Not my preference', 'Spiderman'], [2, 'Just average...', 'Captain America'], [4, 'Tolerable', 'Avengers']]
data2 = [[1, 'Did not think much of this', 'Spiderman'], [2, 'Great in my opinion!', 'Captain America'], [3, 'Could not finish', 'Batman Returns']]
df1 = pd.DataFrame(data1, columns = ['id', 'review', 'movie title'])
df2 = pd.DataFrame(data2, columns = ['id', 'review', 'movie title'])
df1.insert(3, "isValid", pd.Series(df1.id.isin(df2.id).values.astype(bool)))
df1 = df1[df1.isValid != False]
I'm wondering if there's a more efficient way to do this?
Thanks in advance for any help!
Upvotes: 1
Views: 2497
Reputation: 796
If you want to have the information in df1 of 'isValid'
you can do this:
df1["isValid"] = df1.id.isin(df2.id)
new_df = df1.loc[df1.isValid == True]
id review movie title isValid
0 1 Great movie! Spiderman True
1 1 Not my preference Spiderman True
2 2 Just average... Captain America True
But if you don't care about 'isValid'
and just used it in your answer for selection you can simply do this:
new_df = df1.loc[df1.id.isin(df2.id)]
id review movie title
0 1 Great movie! Spiderman
1 1 Not my preference Spiderman
2 2 Just average... Captain America
Upvotes: 3
Reputation: 79388
you are looking for merge
function. This will drop all the ones not seen from both df1
and df2
.
df1.merge(df2,on=["id","movie title"])
Out:
id review_x movie title review_y
0 1 Great movie! Spiderman Did not think much of this
1 1 Not my preference Spiderman Did not think much of this
2 2 Just average... Captain America Great in my opinion!
your df1
id the id ,review_x ,movie title
and df2
is id, movie title review_y
Upvotes: 0