Reputation: 7245
Hello I have a dataframe like this
df = pd.DataFrame( {'Item':['A','A','A','B','B','C','C','C','C'],
'b':[Tom,John,Paul,Tom,Frank,Tom, John, Richard, James]})
df
Item Name
A Tom
A John
A Paul
B Tom
B Frank
C Tom
C John
C Richard
C James
For each people I want the list of the people with same item and them time
df1
Name People Times
Tom [John, Paul, Frank, Richard, James] [2,1,1,1,1]
John [Tom, Richard, James] [2,1,1]
Paul [Tom, John] [1,1]
Frank [Tom] [1]
Richard [Tom, John, James] [1,1,1]
James [Tom, John, Richard] [1,1,1]
So far I have tried this to count the different people for the different items
df.groupby("Item").agg({ "Name": pd.Series.nunique})
Name
Item
A 3
B 2
C 4
and
df.groupby("Name").agg({ "Item": pd.Series.nunique})
Item
Name
Frank 1
James 1
John 2
Paul 1
Richard 1
Tom 3
Upvotes: 1
Views: 94
Reputation: 109510
ct = pd.crosstab(df.Name, df.Item)
d = {Name: [(name, val)
for name, val in ct.loc[ct.index != Name, ct.ix[Name] == 1]
.sum(axis=1).iteritems() if val]
for Name in df.Name.unique()}
>>> pd.DataFrame({'Name': d.keys(),
'People': [[t[0] for t in d[name]] for name in d],
'times': [[t[1] for t in d[name]] for name in d]})
Name People times
0 Richard [James, John, Tom] [1, 1, 1]
1 James [John, Richard, Tom] [1, 1, 1]
2 Tom [Frank, James, John, Paul, Richard] [1, 1, 2, 1, 1]
3 Frank [Tom] [1]
4 Paul [John, Tom] [1, 1]
5 John [James, Paul, Richard, Tom] [1, 1, 1, 2]
EXPLANATION
The crosstab gets you the location of each name by Item type.
>>> ct
Item A B C
Name
Frank 0 1 0
James 0 0 1
John 1 0 1
Paul 1 0 0
Richard 0 0 1
Tom 1 1 1
This table is then shrunk. For each name on which the key is being build, that name is removed from the table and only columns where that name appears are chosen.
Using 'John' as an example:
>>> ct.loc[ct.index != 'John', ct.ix['John'] == 1]
Item A C
Name
Frank 0 0
James 0 1
Paul 1 0
Richard 0 1
Tom 1 1
This result is then summed along the rows to yield the results for John:
Name
Frank 0
James 1
Paul 1
Richard 1
Tom 2
dtype: int64
These results are then iterated over to pack them into tuple pairs and to remove the case where the value is zero (e.g. Frank above).
>>> [(name, val) for name, val in
ct.loc[ct.index != 'John', ct.ix['John'] == 1].sum(axis=1).iteritems() if val]
[('James', 1), ('Paul', 1), ('Richard', 1), ('Tom', 2)]
This action is performed for each name using a dictionary comprehension.
>>> d
{'Frank': [('Tom', 1)],
'James': [('John', 1), ('Richard', 1), ('Tom', 1)],
'John': [('James', 1), ('Paul', 1), ('Richard', 1), ('Tom', 2)],
'Paul': [('John', 1), ('Tom', 1)],
'Richard': [('James', 1), ('John', 1), ('Tom', 1)],
'Tom': [('Frank', 1), ('James', 1), ('John', 2), ('Paul', 1), ('Richard', 1)]}
This dictionary is then used to create the desired dataframe using a nested list comprehension to unpack the tuple pairs.
Upvotes: 1
Reputation: 862406
You can use numpy.unique
for count items in lists:
print df
Item Name
0 A Tom
1 A John
2 A Paul
3 B Tom
4 B Frank
5 C Tom
6 C John
7 C Richard
8 C James
#merge M:N by column Item
df1 = pd.merge(df, df, on=['Item'])
#remove duplicity - column Name_x == Name_y
df1 = df1[~(df1['Name_x'] == df1['Name_y'])]
#print df1
#create lists
df1 = df1.groupby('Name_x')['Name_y'].apply(lambda x: x.tolist()).reset_index()
print df1
Name_x Name_y
0 Frank [Tom]
1 James [Tom, John, Richard]
2 John [Tom, Paul, Tom, Richard, James]
3 Paul [Tom, John]
4 Richard [Tom, John, James]
5 Tom [John, Paul, Frank, John, Richard, James]
#get count by np.unique
df1['People'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[0])
df1['times'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[1])
#remove column Name_y
df1 = df1.drop('Name_y', axis=1).rename(columns={'Name_x':'Name'})
print df1
Name People times
0 Frank [Tom] [1]
1 James [John, Richard, Tom] [1, 1, 1]
2 John [James, Paul, Richard, Tom] [1, 1, 1, 2]
3 Paul [John, Tom] [1, 1]
4 Richard [James, John, Tom] [1, 1, 1]
5 Tom [Frank, James, John, Paul, Richard] [1, 1, 2, 1, 1]
Upvotes: 1