emax
emax

Reputation: 7245

Grouping entries in pandas dataframe

Hello I have a dataframe like this

df = pd.DataFrame( {'Item':['A','A','A','B','B','C','C','C','C'], 
'b':[Tom,John,Paul,Tom,Frank,Tom, John, Richard, James]})
df 
Item Name
A    Tom
A    John
A    Paul
B    Tom 
B    Frank
C    Tom
C    John
C    Richard
C    James

For each people I want the list of the people with same item and them time

df1 
Name              People                          Times
Tom     [John, Paul, Frank, Richard, James]       [2,1,1,1,1]
John    [Tom, Richard, James]                     [2,1,1]
Paul    [Tom, John]                               [1,1]
Frank   [Tom]                                     [1]
Richard [Tom, John, James]                        [1,1,1]
James   [Tom, John, Richard]                      [1,1,1]

So far I have tried this to count the different people for the different items

df.groupby("Item").agg({ "Name": pd.Series.nunique})
      Name
Item    
A      3
B      2
C      4

and

df.groupby("Name").agg({ "Item": pd.Series.nunique})
        Item
Name    
Frank   1
James   1
John    2
Paul    1
Richard 1
Tom     3

Upvotes: 1

Views: 94

Answers (2)

Alexander
Alexander

Reputation: 109510

ct = pd.crosstab(df.Name, df.Item)

d = {Name: [(name, val) 
            for name, val in ct.loc[ct.index != Name, ct.ix[Name] == 1]
            .sum(axis=1).iteritems() if val] 
     for Name in df.Name.unique()}

>>> pd.DataFrame({'Name': d.keys(), 
                  'People': [[t[0] for t in d[name]] for name in d], 
                  'times': [[t[1] for t in d[name]] for name in d]})
      Name                               People            times
0  Richard                   [James, John, Tom]        [1, 1, 1]
1    James                 [John, Richard, Tom]        [1, 1, 1]
2      Tom  [Frank, James, John, Paul, Richard]  [1, 1, 2, 1, 1]
3    Frank                                [Tom]              [1]
4     Paul                          [John, Tom]           [1, 1]
5     John          [James, Paul, Richard, Tom]     [1, 1, 1, 2]

EXPLANATION

The crosstab gets you the location of each name by Item type.

>>> ct
Item     A  B  C
Name            
Frank    0  1  0
James    0  0  1
John     1  0  1
Paul     1  0  0
Richard  0  0  1
Tom      1  1  1

This table is then shrunk. For each name on which the key is being build, that name is removed from the table and only columns where that name appears are chosen.

Using 'John' as an example:

>>> ct.loc[ct.index != 'John', ct.ix['John'] == 1]
Item     A  C
Name         
Frank    0  0
James    0  1
Paul     1  0
Richard  0  1
Tom      1  1

This result is then summed along the rows to yield the results for John:

Name
Frank      0
James      1
Paul       1
Richard    1
Tom        2
dtype: int64

These results are then iterated over to pack them into tuple pairs and to remove the case where the value is zero (e.g. Frank above).

>>> [(name, val) for name, val in 
     ct.loc[ct.index != 'John', ct.ix['John'] == 1].sum(axis=1).iteritems() if val]
[('James', 1), ('Paul', 1), ('Richard', 1), ('Tom', 2)]

This action is performed for each name using a dictionary comprehension.

>>> d
{'Frank': [('Tom', 1)],
 'James': [('John', 1), ('Richard', 1), ('Tom', 1)],
 'John': [('James', 1), ('Paul', 1), ('Richard', 1), ('Tom', 2)],
 'Paul': [('John', 1), ('Tom', 1)],
 'Richard': [('James', 1), ('John', 1), ('Tom', 1)],
 'Tom': [('Frank', 1), ('James', 1), ('John', 2), ('Paul', 1), ('Richard', 1)]}

This dictionary is then used to create the desired dataframe using a nested list comprehension to unpack the tuple pairs.

Upvotes: 1

jezrael
jezrael

Reputation: 862406

You can use numpy.unique for count items in lists:

print df
  Item     Name
0    A      Tom
1    A     John
2    A     Paul
3    B      Tom
4    B    Frank
5    C      Tom
6    C     John
7    C  Richard
8    C    James

#merge M:N by column Item
df1 = pd.merge(df, df, on=['Item'])

#remove duplicity - column Name_x == Name_y
df1 = df1[~(df1['Name_x'] == df1['Name_y'])]
#print df1

#create lists
df1 = df1.groupby('Name_x')['Name_y'].apply(lambda x: x.tolist()).reset_index()
print df1
    Name_x                                     Name_y
0    Frank                                      [Tom]
1    James                       [Tom, John, Richard]
2     John           [Tom, Paul, Tom, Richard, James]
3     Paul                                [Tom, John]
4  Richard                         [Tom, John, James]
5      Tom  [John, Paul, Frank, John, Richard, James]
#get count by np.unique
df1['People'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[0])
df1['times'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[1])
#remove column Name_y
df1 = df1.drop('Name_y', axis=1).rename(columns={'Name_x':'Name'})
print df1
      Name                               People            times
0    Frank                                [Tom]              [1]
1    James                 [John, Richard, Tom]        [1, 1, 1]
2     John          [James, Paul, Richard, Tom]     [1, 1, 1, 2]
3     Paul                          [John, Tom]           [1, 1]
4  Richard                   [James, John, Tom]        [1, 1, 1]
5      Tom  [Frank, James, John, Paul, Richard]  [1, 1, 2, 1, 1]

Upvotes: 1

Related Questions