Reputation: 258
I am looking for some assistance with pandas. I am trying to compare the "email" column for any matching and then concatenate the extra columns into a separate csv.
I have the following:
file1.csv
email, user_id
[email protected], 1
[email protected], 24
[email protected], 12
[email protected], 31
file2.csv
email, group_id
[email protected], g1
[email protected], g1
[email protected], g2
[email protected], g3
output.csv:
email, user_id, group_id
[email protected], 1, g1
[email protected], 24, g1
[email protected], 12, g2
I appreciate any assistance provided. Thank You.
Upvotes: 1
Views: 322
Reputation: 186
As far as I understood, you were trying to achieve something like that:
import pandas as pd
csv1 = pd.DataFrame({"email":["[email protected]", "[email protected]", "[email protected]", "[email protected]"],"user_id":[1,24,12,31]})
csv2 = pd.DataFrame({"email":["[email protected]", "[email protected]", "[email protected]", "[email protected]"],"group_id":["g1","g1","g2","g3"]})
csv3 = pd.DataFrame({"email":["[email protected]", "[email protected]", "[email protected]"],"user_id":[1,24,12],"group_id":["g1","g1","g2"]})
list_with_dataframes = [csv2,csv3]
result = csv1
for dataframe in list_with_dataframes:
result = result.merge(dataframe, how = "outer")
print(result)
And that gives an output:
email user_id group_id
0 [email protected] 1.0 g1
1 [email protected] 24.0 g1
2 [email protected] 12.0 g2
3 [email protected] 31.0 NaN
4 [email protected] NaN g3
I hope this is what you are looking for. All matching data is retained, all missing data is changed with NaN. If you want only fully matched data to be retained simply replace "outer" with "inner" in this line:
result = result.merge(dataframe, how = "inner")
That gives an output:
email user_id group_id
0 [email protected] 1 g1
1 [email protected] 24 g1
2 [email protected] 12 g2
Upvotes: 2
Reputation: 442
Try Merge
import pandas as pd
f1 = pd.read_csv('f1.csv')
f2 = pd.read_csv('f2.csv')
out = pd.merge(f1,f2,on='email',how='inner')
print(out)
out.to_csv("final.csv", index=False)
The output is
email user_id group_id
0 [email protected] 1 g1
1 [email protected] 24 g1
2 [email protected] 12 g2
Upvotes: 3
Reputation: 1441
Yep, this is as simple as doing a merge:
output = pd.merge(left=file1, right=file2, on='email', how='inner')
Setting 'how' to 'inner' ensures that only the matches in both files are kept
Upvotes: 1