zbinsd
zbinsd

Reputation: 4214

Python - From Pandas to Sparse Output Format

Is there a better way to do what the code below does in a (slow!) loop?

Using an input DataFrame, I want to convert it to a list of products each user has consumed. But this list will be up to the millions long and this seems quite inefficient (unless I use cython). Any ideas how to make this more python-happy? Thanks!

a = pd.DataFrame({'user_id':['a', 'a', 'b', 'c', 'c', 'c'], 'prod_id':['p1', 'p2', 'p1', 'p2', 'p3', 'p7']})

print "Input Dataframe:\n", a
print '\nDesired Output:'

# Build desired output:
uniqIDs = a.user_id.unique()

for id in uniqIDs:

    prod_list = list(a[a.user_id == id].prod_id.values)        

    s = id + '\t'
    for x in prod_list:
        s += x + '\t'

    print s # This will get saved to a TAB DELIMITED file

Gives this output (which is exactly what I desire):

Input Dataframe:
  prod_id user_id
0      p1       a
1      p2       a
2      p1       b
3      p2       c
4      p3       c
5      p7       c

Desired Output:
a   p1  p2  
b   p1  
c   p2  p3  p7

Upvotes: 1

Views: 118

Answers (1)

DSM
DSM

Reputation: 353389

You could use groupby:

>>> a = pd.DataFrame({'user_id':['a', 'a', 'b', 'c', 'c', 'c'], 'prod_id':['p1', 'p2', 'p1', 'p2', 'p3', 'p7']})
>>> a
  prod_id user_id
0      p1       a
1      p2       a
2      p1       b
3      p2       c
4      p3       c
5      p7       c
>>> a.groupby("user_id")["prod_id"].unique()
user_id
a              [p1, p2]
b                  [p1]
c          [p2, p3, p7]
dtype: object

Hmm.

Getting the output in the exact format you want is surprisingly troublesome, because I can't see any way to override pandas' desire to escape things. IOW, it's easy to .apply('\t'.join) to the resulting series manually, but it's hard to save it using \t as a separator too.

So here's another approach:

>>> df = pd.DataFrame({k: g.reset_index(drop=True) 
                       for k,g in a.groupby("user_id")["prod_id"]}).T
>>> df.to_csv("prod.csv", sep="\t", header=False)
>>> !cat prod.csv
a   p1  p2  
b   p1      
c   p2  p3  p7

If you really want, you could strip the extra tabs at the end.

Upvotes: 3

Related Questions