Reputation: 4214
Is there a better way to do what the code below does in a (slow!) loop?
Using an input DataFrame, I want to convert it to a list of products each user has consumed. But this list will be up to the millions long and this seems quite inefficient (unless I use cython). Any ideas how to make this more python-happy? Thanks!
a = pd.DataFrame({'user_id':['a', 'a', 'b', 'c', 'c', 'c'], 'prod_id':['p1', 'p2', 'p1', 'p2', 'p3', 'p7']})
print "Input Dataframe:\n", a
print '\nDesired Output:'
# Build desired output:
uniqIDs = a.user_id.unique()
for id in uniqIDs:
prod_list = list(a[a.user_id == id].prod_id.values)
s = id + '\t'
for x in prod_list:
s += x + '\t'
print s # This will get saved to a TAB DELIMITED file
Gives this output (which is exactly what I desire):
Input Dataframe:
prod_id user_id
0 p1 a
1 p2 a
2 p1 b
3 p2 c
4 p3 c
5 p7 c
Desired Output:
a p1 p2
b p1
c p2 p3 p7
Upvotes: 1
Views: 118
Reputation: 353389
You could use groupby
:
>>> a = pd.DataFrame({'user_id':['a', 'a', 'b', 'c', 'c', 'c'], 'prod_id':['p1', 'p2', 'p1', 'p2', 'p3', 'p7']})
>>> a
prod_id user_id
0 p1 a
1 p2 a
2 p1 b
3 p2 c
4 p3 c
5 p7 c
>>> a.groupby("user_id")["prod_id"].unique()
user_id
a [p1, p2]
b [p1]
c [p2, p3, p7]
dtype: object
Hmm.
Getting the output in the exact format you want is surprisingly troublesome, because I can't see any way to override pandas
' desire to escape things. IOW, it's easy to .apply('\t'.join)
to the resulting series manually, but it's hard to save it using \t
as a separator too.
So here's another approach:
>>> df = pd.DataFrame({k: g.reset_index(drop=True)
for k,g in a.groupby("user_id")["prod_id"]}).T
>>> df.to_csv("prod.csv", sep="\t", header=False)
>>> !cat prod.csv
a p1 p2
b p1
c p2 p3 p7
If you really want, you could strip the extra tabs at the end.
Upvotes: 3