Ray
Ray

Reputation: 1728

How do you compare two pandas series based on specific key value pair?

I have 2 pandas series of dictionaries as follows:

series_1 = [{'id': 'testProd_1', 'q1':'Foo1', 'q2': 'Bar1'},
            {'id': 'testProd_2', 'q1':'Foo2', 'q2': 'Bar2'},
            {'id': 'testProd_3', 'q1':'Foo3', 'q2': 'Bar3'},
            {'id': 'testProd_5', 'q1':'Foo5', 'q2': 'Bar5'}
 ]
series_2 = [{'q1':'Foo1', 'q2': 'Bar1'},
            {'q1':'Foo2', 'q2': 'Bar2'}, 
            {'q1':'Foo3', 'q2': 'Bar3'}, 
            {'q1':'Foo4', 'q2': 'Bar4'}, 
            {'q1':'Foo5', 'q2': 'Bar{5}'}]

I am trying to compare the two pandas series and provide the id from series_1 to all matching series_2 dicts.

expected_result = [{'id': 'testProd_1', 'q1':'Foo1', 'q2': 'Bar1'},
                   {'id': 'testProd_2', 'q1':'Foo2', 'q2': 'Bar2'},
                   {'id': 'testProd_3', 'q1':'Foo3', 'q2': 'Bar3'},
                   {'id': 'testProd_5', 'q1':'Foo5', 'q2': 'Bar{5}'}]

Series equal doesn't work since one series has an additional key value pair ('id') for each dict. Do I have to loop through each individual entry? What is the most efficient approach to get the expected_result?

I am working with 2 large datasets where I am trying to link the id from one series to another. The data is basically the same but sometimes the values in some of the key value pairs have some erroneous characters (ex: {5}, (5), {ex.5}).

Any suggestions?

Thanks

Upvotes: 2

Views: 417

Answers (3)

Ray
Ray

Reputation: 1728

Thanks for all the feedback.

I used a combination of the answers above to come to the solution that worked for me.

series_2 had too many q1 and q2 values that had erroneous characters(ex: '{','.','}', etc..) as well has mixes of upper and lower case.

I first applied an apply to clean up the values to be all lowercase and remove special characters using replace.

# Creates a uniform value string 
def getTrueString(valString):
    
    trueString= valString.lower()
    remove_specialChrs = [' ','{','}','ex.']
    
    for char in remove_specialChrs:
        trueString= trueString.replace(char,'')
            
    return trueString.strip()

From there, I applied it to my 2 series (assume that I converted into Data Frames)

series_1['trueString'] = series_1['valString'].apply(getTrueString)
series_2['trueString'] = series_2['valString'].apply(getTrueString)

Now, since the trueString is clean (lower case and all special characters removed), I then used pandas merge as suggested by Scott Boston and Daneolog in the posts above.

joined_data = pd.merge(series_2, series_1, on='trueString', how='left' )

The resulting dataFrame displays all matches based on trueString being the same and for ones that dont match, it stays empty. This is because I went with the left join ( You can also use right and switch the 2 input frames) instead of inner because I wanted to see all series_2 data regardless of if an id was found or not.

Hope this helps.

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153460

You can use pandas like this:

pd.DataFrame(series_1)[['id','q1']].merge(pd.DataFrame(series_2), on=['q1']).to_dict('records')

Output:

[{'id': 'testProd_1', 'q1': 'Foo1', 'q2': 'Bar1'},
 {'id': 'testProd_2', 'q1': 'Foo2', 'q2': 'Bar2'},
 {'id': 'testProd_3', 'q1': 'Foo3', 'q2': 'Bar3'},
 {'id': 'testProd_5', 'q1': 'Foo5', 'q2': 'Bar{5}'}]

Update with new data in question

pandas will create a cartesian production for 1 to many join or many to many join. Hence you will the combinations.

df1.merge(df2, on=['q1'])

Output:

           id    q1  q2_x    q2_y
0  testProd_1  Foo1  Bar1    Bar1
1  testProd_2  Foo2  Bar2    Bar2
2  testProd_3  Foo3  Bar3    Bar3
3  testProd_5  Foo5  Bar5  Bar{5}
4  testProd_5  Foo5  Bar5  Bar{6}
5  testProd_6  Foo5  Bar6  Bar{5}
6  testProd_6  Foo5  Bar6  Bar{6}

Without Duplicates

Without duplicates you can create a cumcount so that row one joins to row one in df2 like this:

df1m = df1.assign(mergekey=df1.groupby('q1').cumcount())
df2m = df2.assign(mergekey=df2.groupby('q1').cumcount())
df1m.merge(df2m, on=['q1','mergekey'])

Output:

           id    q1  q2_x  mergekey    q2_y
0  testProd_1  Foo1  Bar1         0    Bar1
1  testProd_2  Foo2  Bar2         0    Bar2
2  testProd_3  Foo3  Bar3         0    Bar3
3  testProd_5  Foo5  Bar5         0  Bar{5}
4  testProd_6  Foo5  Bar6         1  Bar{6}

Upvotes: 1

Daneolog
Daneolog

Reputation: 286

So it seems like what you want to use is merge. From what I understood, you're wanting to find the inner join of the two dataframes on the 'q1' key. If so, then merge is definitely the right function for you. It's used in the following way:

series_join = series_1.merge(series_2, on='q1')

With this, it'll find the intersection of q1 and only select matching data pairs. If you do want to join on both q1 and q2, you can simply pass in an array here (although this won't give your desired output, since Bar5 can't be compared to Bar{5}, unfortunately:

series_join = series_1.merge(series_2, on=['q1', 'q2'])

As far as clearing erroneous values from your data so that they can be compared in this way, I would recommend first doing a cleaning step, since the main merge step does not have much customization regarding how you're comparing data values.

The output will include a set of duplicate columns, but you can simply ignore those columns anyway:

           id    q1  q2_x    q2_y
0  testProd_1  Foo1  Bar1    Bar1
1  testProd_2  Foo2  Bar2    Bar2
2  testProd_3  Foo3  Bar3    Bar3
3  testProd_5  Foo5  Bar5  Bar{5}

Here's a repl where it runs.

EDIT: Keeping duplicates

The default functionality of merge is that it will keep all duplicate keys in both tables. The issue with manipulating duplicates here is that pandas doesn't know which row is the intended lookup row, so it'll simply create a pair for each combination. As in the following example (series 1, 2, then joined):

           id    q1    q2
0  testProd_1  Foo1  Bar1
1  testProd_2  Foo2  Bar2
2  testProd_3  Foo3  Bar3
3  testProd_5  Foo5  Bar5
4  testProd_6  Foo5  Bar6
     q1      q2
0  Foo1    Bar1
1  Foo2    Bar2
2  Foo3    Bar3
3  Foo4    Bar4
4  Foo5  Bar{5}
5  Foo5  Bar{6}
           id    q1    q2_y
0  testProd_1  Foo1    Bar1
1  testProd_2  Foo2    Bar2
2  testProd_3  Foo3    Bar3
3  testProd_5  Foo5  Bar{5} <<< [3  testProd_5  Foo5  Bar5] + [4  Foo5  Bar{5}]
4  testProd_5  Foo5  Bar{6} <<< [3  testProd_5  Foo5  Bar5] + [5  Foo5  Bar{6}]
5  testProd_6  Foo5  Bar{5} <<< [4  testProd_6  Foo5  Bar6] + [4  Foo5  Bar{5}]
6  testProd_6  Foo5  Bar{6} <<< [4  testProd_6  Foo5  Bar6] + [5  Foo5  Bar{6}]

Therefore, there isn't a simple way to say "pick the first row of the 2nd table", but what you can do is simply remove duplicates in the 2nd table beforehand with a function like drop_duplicates.

Upvotes: 1

Related Questions