Reputation: 1728
I have 2 pandas series of dictionaries as follows:
series_1 = [{'id': 'testProd_1', 'q1':'Foo1', 'q2': 'Bar1'},
{'id': 'testProd_2', 'q1':'Foo2', 'q2': 'Bar2'},
{'id': 'testProd_3', 'q1':'Foo3', 'q2': 'Bar3'},
{'id': 'testProd_5', 'q1':'Foo5', 'q2': 'Bar5'}
]
series_2 = [{'q1':'Foo1', 'q2': 'Bar1'},
{'q1':'Foo2', 'q2': 'Bar2'},
{'q1':'Foo3', 'q2': 'Bar3'},
{'q1':'Foo4', 'q2': 'Bar4'},
{'q1':'Foo5', 'q2': 'Bar{5}'}]
I am trying to compare the two pandas series and provide the id from series_1 to all matching series_2 dicts.
expected_result = [{'id': 'testProd_1', 'q1':'Foo1', 'q2': 'Bar1'},
{'id': 'testProd_2', 'q1':'Foo2', 'q2': 'Bar2'},
{'id': 'testProd_3', 'q1':'Foo3', 'q2': 'Bar3'},
{'id': 'testProd_5', 'q1':'Foo5', 'q2': 'Bar{5}'}]
Series equal doesn't work since one series has an additional key value pair ('id') for each dict. Do I have to loop through each individual entry? What is the most efficient approach to get the expected_result?
I am working with 2 large datasets where I am trying to link the id from one series to another. The data is basically the same but sometimes the values in some of the key value pairs have some erroneous characters (ex: {5}, (5), {ex.5}).
Any suggestions?
Thanks
Upvotes: 2
Views: 417
Reputation: 1728
Thanks for all the feedback.
I used a combination of the answers above to come to the solution that worked for me.
series_2 had too many q1 and q2 values that had erroneous characters(ex: '{','.','}', etc..) as well has mixes of upper and lower case.
I first applied an apply to clean up the values to be all lowercase and remove special characters using replace.
# Creates a uniform value string
def getTrueString(valString):
trueString= valString.lower()
remove_specialChrs = [' ','{','}','ex.']
for char in remove_specialChrs:
trueString= trueString.replace(char,'')
return trueString.strip()
From there, I applied it to my 2 series (assume that I converted into Data Frames)
series_1['trueString'] = series_1['valString'].apply(getTrueString)
series_2['trueString'] = series_2['valString'].apply(getTrueString)
Now, since the trueString is clean (lower case and all special characters removed), I then used pandas merge as suggested by Scott Boston and Daneolog in the posts above.
joined_data = pd.merge(series_2, series_1, on='trueString', how='left' )
The resulting dataFrame displays all matches based on trueString being the same and for ones that dont match, it stays empty. This is because I went with the left join ( You can also use right and switch the 2 input frames) instead of inner because I wanted to see all series_2 data regardless of if an id was found or not.
Hope this helps.
Upvotes: 0
Reputation: 153460
You can use pandas like this:
pd.DataFrame(series_1)[['id','q1']].merge(pd.DataFrame(series_2), on=['q1']).to_dict('records')
Output:
[{'id': 'testProd_1', 'q1': 'Foo1', 'q2': 'Bar1'},
{'id': 'testProd_2', 'q1': 'Foo2', 'q2': 'Bar2'},
{'id': 'testProd_3', 'q1': 'Foo3', 'q2': 'Bar3'},
{'id': 'testProd_5', 'q1': 'Foo5', 'q2': 'Bar{5}'}]
pandas will create a cartesian production for 1 to many join or many to many join. Hence you will the combinations.
df1.merge(df2, on=['q1'])
Output:
id q1 q2_x q2_y
0 testProd_1 Foo1 Bar1 Bar1
1 testProd_2 Foo2 Bar2 Bar2
2 testProd_3 Foo3 Bar3 Bar3
3 testProd_5 Foo5 Bar5 Bar{5}
4 testProd_5 Foo5 Bar5 Bar{6}
5 testProd_6 Foo5 Bar6 Bar{5}
6 testProd_6 Foo5 Bar6 Bar{6}
Without duplicates you can create a cumcount so that row one joins to row one in df2 like this:
df1m = df1.assign(mergekey=df1.groupby('q1').cumcount())
df2m = df2.assign(mergekey=df2.groupby('q1').cumcount())
df1m.merge(df2m, on=['q1','mergekey'])
Output:
id q1 q2_x mergekey q2_y
0 testProd_1 Foo1 Bar1 0 Bar1
1 testProd_2 Foo2 Bar2 0 Bar2
2 testProd_3 Foo3 Bar3 0 Bar3
3 testProd_5 Foo5 Bar5 0 Bar{5}
4 testProd_6 Foo5 Bar6 1 Bar{6}
Upvotes: 1
Reputation: 286
So it seems like what you want to use is merge
. From what I understood, you're wanting to find the inner join of the two dataframes on the 'q1' key. If so, then merge is definitely the right function for you. It's used in the following way:
series_join = series_1.merge(series_2, on='q1')
With this, it'll find the intersection of q1 and only select matching data pairs. If you do want to join on both q1
and q2
, you can simply pass in an array here (although this won't give your desired output, since Bar5
can't be compared to Bar{5}
, unfortunately:
series_join = series_1.merge(series_2, on=['q1', 'q2'])
As far as clearing erroneous values from your data so that they can be compared in this way, I would recommend first doing a cleaning step, since the main merge step does not have much customization regarding how you're comparing data values.
The output will include a set of duplicate columns, but you can simply ignore those columns anyway:
id q1 q2_x q2_y
0 testProd_1 Foo1 Bar1 Bar1
1 testProd_2 Foo2 Bar2 Bar2
2 testProd_3 Foo3 Bar3 Bar3
3 testProd_5 Foo5 Bar5 Bar{5}
Here's a repl where it runs.
EDIT: Keeping duplicates
The default functionality of merge is that it will keep all duplicate keys in both tables. The issue with manipulating duplicates here is that pandas doesn't know which row is the intended lookup row, so it'll simply create a pair for each combination. As in the following example (series 1, 2, then joined):
id q1 q2
0 testProd_1 Foo1 Bar1
1 testProd_2 Foo2 Bar2
2 testProd_3 Foo3 Bar3
3 testProd_5 Foo5 Bar5
4 testProd_6 Foo5 Bar6
q1 q2
0 Foo1 Bar1
1 Foo2 Bar2
2 Foo3 Bar3
3 Foo4 Bar4
4 Foo5 Bar{5}
5 Foo5 Bar{6}
id q1 q2_y
0 testProd_1 Foo1 Bar1
1 testProd_2 Foo2 Bar2
2 testProd_3 Foo3 Bar3
3 testProd_5 Foo5 Bar{5} <<< [3 testProd_5 Foo5 Bar5] + [4 Foo5 Bar{5}]
4 testProd_5 Foo5 Bar{6} <<< [3 testProd_5 Foo5 Bar5] + [5 Foo5 Bar{6}]
5 testProd_6 Foo5 Bar{5} <<< [4 testProd_6 Foo5 Bar6] + [4 Foo5 Bar{5}]
6 testProd_6 Foo5 Bar{6} <<< [4 testProd_6 Foo5 Bar6] + [5 Foo5 Bar{6}]
Therefore, there isn't a simple way to say "pick the first row of the 2nd table", but what you can do is simply remove duplicates in the 2nd table beforehand with a function like drop_duplicates
.
Upvotes: 1