Reputation: 756
I need to do a fuzzy groupby
where a single record can be in one or more groups.
I have a DataFrame
like this:
test = pd.DataFrame({'score1' : pandas.Series(['a', 'b', 'c', 'd', 'e']), 'score2' : pd.Series(['b', 'a', 'k', 'n', 'c'])})
Output:
score1 score2
0 a b
1 b a
2 c k
3 d n
4 e c
I wish to have groups like this:
The group keys should be the union of the unique values between score1
and score2
. Record 0
should be in groups a
and b
because it contains both score values. Similarly record 1
should be in groups b
and a
; record 2
should be in groups c
and k
and so on.
I've tried doing a groupby on two columns like this:
In [192]: score_groups = pd.groupby(['score1', 'score2'])
However I get the group keys as tuples - (1, 2), (2, 1), (3, 8), etc, instead of unique group keys where records can be in multiple groups. The output is shown below:
In [192]: score_groups.groups
Out[192]: {('a', 'b'): [0],
('b', 'a'): [1],
('c', 'k'): [2],
('d', 'n'): [3],
('e', 'c'): [4]}
Also, I need the indexes preserved because I'm using them for another operation later. Please help!
Upvotes: 2
Views: 6461
Reputation: 756
Using Stefan's help, I solved it like this.
In (283): frame1 = test[['score1']]
frame2 = test[['score2']]
frame2.rename(columns={'score2': 'score1'}, inplace=True)
test = pandas.concat([frame1, frame2])
test
Out[283]:
score1
0 a
1 b
2 c
3 d
4 e
0 b
1 a
2 k
3 n
4 c
Notice the duplicate indices. The indexes have been preserved, which is what I wanted. Now, lets get to business - the group by operation.
In (283): groups = test.groupby('score1')
groups.get_group('a') # Get group with key a
Out[283]:
score1
0 a
1 a
In (283): groups.get_group('b') # Get group with key b
Out[283]:
score1
1 b
0 b
In (283): groups.get_group('c') # Get group with key c
Out[283]:
score1
2 c
4 c
In (283): groups.get_group('k') # Get group with key k
Out[283]:
score1
2 k
I'm baffled by how pandas retrieves rows with the correct index even though they are duplicated. As I understand, the group by operation uses an inverted index data structure to store the references (indexes) to rows. Any insights would be greatly appreciated. Anyone who answers this will have their answer accepted :)
Upvotes: 0
Reputation: 42885
Combine the two columns
in a single column
using e.g. pd.concat()
:
s = pd.concat([test['score1'], test['score2'].rename(columns={'score2': 'score1'})]).reset_index()
s.columns = ['val', 'grp']
val grp
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
5 0 b
6 1 a
7 2 k
8 3 n
9 4 c
And then .groupby()
on 'grp'
and collect 'val'
in a list
:
s = s.groupby('grp').apply(lambda x: x.val.tolist())
a [0, 1]
b [1, 0]
c [2, 4]
d [3]
e [4]
k [2]
n [3]
or, if you prefer dict
:
s.to_dict()
{'e': [4], 'd': [3], 'n': [3], 'k': [2], 'a': [0, 1], 'c': [2, 4], 'b': [1, 0]}
Or, to the same effect in a single step, skipping renaming the columns:
test.unstack().reset_index(-1).groupby(0).apply(lambda x: x.level_1.tolist())
a [0, 1]
b [1, 0]
c [2, 4]
d [3]
e [4]
k [2]
n [3]
Upvotes: 1
Reputation: 6933
Reorganizing your data for ease of manipulation (having multiple value columns for the same data will always cause you headaches).
import pandas as pd
test = pd.DataFrame({'score1' : pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']), 'score2' : pd.Series([2, 1, 8, 9, 3], index=['a', 'b', 'c', 'd', 'e'])})
test['name'] = test.index
result = pd.melt(test, id_vars=['name'], value_vars=['score1', 'score2'])
name variable value
0 a score1 1
1 b score1 2
2 c score1 3
3 d score1 4
4 e score1 5
5 a score2 2
6 b score2 1
7 c score2 8
8 d score2 9
9 e score2 3
Now you have only one column for your value and it's easy to groupby score or select by your name column:
hey = result.groupby('value')
hey.groups
#below are the indices that you care about
{1: [0, 6], 2: [1, 5], 3: [2, 9], 4: [3], 5: [4], 8: [7], 9: [8]}
Upvotes: 0