lostsoul29
lostsoul29

Reputation: 756

groupby multiple value columns

I need to do a fuzzy groupby where a single record can be in one or more groups.

I have a DataFrame like this:

test = pd.DataFrame({'score1' : pandas.Series(['a', 'b', 'c', 'd', 'e']), 'score2' : pd.Series(['b', 'a', 'k', 'n', 'c'])})

Output:

  score1  score2
0   a       b
1   b       a
2   c       k
3   d       n
4   e       c

I wish to have groups like this: enter image description here

The group keys should be the union of the unique values between score1 and score2. Record 0 should be in groups a and b because it contains both score values. Similarly record 1 should be in groups b and a; record 2 should be in groups c and k and so on.

I've tried doing a groupby on two columns like this:

In [192]: score_groups = pd.groupby(['score1', 'score2'])

However I get the group keys as tuples - (1, 2), (2, 1), (3, 8), etc, instead of unique group keys where records can be in multiple groups. The output is shown below:

In [192]: score_groups.groups

Out[192]: {('a', 'b'): [0],
           ('b', 'a'): [1],
           ('c', 'k'): [2],
           ('d', 'n'): [3],
           ('e', 'c'): [4]}

Also, I need the indexes preserved because I'm using them for another operation later. Please help!

Upvotes: 2

Views: 6461

Answers (3)

lostsoul29
lostsoul29

Reputation: 756

Using Stefan's help, I solved it like this.

In (283): frame1 = test[['score1']]
          frame2 = test[['score2']]
          frame2.rename(columns={'score2': 'score1'}, inplace=True)

          test = pandas.concat([frame1, frame2])

          test

Out[283]:   
   score1
0   a
1   b
2   c
3   d
4   e
0   b
1   a
2   k
3   n
4   c

Notice the duplicate indices. The indexes have been preserved, which is what I wanted. Now, lets get to business - the group by operation.

In (283): groups = test.groupby('score1')

          groups.get_group('a') # Get group with key a

Out[283]: 
    score1
0   a
1   a

In (283): groups.get_group('b') # Get group with key b

Out[283]: 
    score1
1   b
0   b

In (283): groups.get_group('c') # Get group with key c

Out[283]: 
    score1
2   c
4   c

In (283): groups.get_group('k') # Get group with key k

Out[283]: 
    score1
2   k

I'm baffled by how pandas retrieves rows with the correct index even though they are duplicated. As I understand, the group by operation uses an inverted index data structure to store the references (indexes) to rows. Any insights would be greatly appreciated. Anyone who answers this will have their answer accepted :)

Upvotes: 0

Stefan
Stefan

Reputation: 42885

Combine the two columns in a single column using e.g. pd.concat():

s = pd.concat([test['score1'], test['score2'].rename(columns={'score2': 'score1'})]).reset_index()
s.columns = ['val', 'grp']

   val grp
0    0   a
1    1   b
2    2   c
3    3   d
4    4   e
5    0   b
6    1   a
7    2   k
8    3   n
9    4   c

And then .groupby() on 'grp' and collect 'val' in a list:

s = s.groupby('grp').apply(lambda x: x.val.tolist())

a    [0, 1]
b    [1, 0]
c    [2, 4]
d       [3]
e       [4]
k       [2]
n       [3]

or, if you prefer dict:

s.to_dict()

{'e': [4], 'd': [3], 'n': [3], 'k': [2], 'a': [0, 1], 'c': [2, 4], 'b': [1, 0]}

Or, to the same effect in a single step, skipping renaming the columns:

test.unstack().reset_index(-1).groupby(0).apply(lambda x: x.level_1.tolist())

a    [0, 1]
b    [1, 0]
c    [2, 4]
d       [3]
e       [4]
k       [2]
n       [3]

Upvotes: 1

Alex
Alex

Reputation: 6933

Reorganizing your data for ease of manipulation (having multiple value columns for the same data will always cause you headaches).

import pandas as pd

test = pd.DataFrame({'score1' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']), 'score2' : pd.Series([2, 1, 8, 9, 3], index=['a', 'b', 'c', 'd', 'e'])})

test['name'] = test.index
result = pd.melt(test, id_vars=['name'], value_vars=['score1', 'score2'])

  name variable  value
0    a   score1      1
1    b   score1      2
2    c   score1      3
3    d   score1      4
4    e   score1      5
5    a   score2      2
6    b   score2      1
7    c   score2      8
8    d   score2      9
9    e   score2      3

Now you have only one column for your value and it's easy to groupby score or select by your name column:

   hey = result.groupby('value')
   hey.groups
   #below are the indices that you care about
   {1: [0, 6], 2: [1, 5], 3: [2, 9], 4: [3], 5: [4], 8: [7], 9: [8]}

Upvotes: 0

Related Questions