Keithx
Keithx

Reputation: 3148

Data loading using arrays in Python

Have a data in such format in .txt file:

UserId   WordID
  1       20
  1       30
  1       40
  2       25
  2       16
  3       56
  3       44
  3       12

What I'm looking for- some function that can give the result grouping for every userid creating a list of wordid:

[[20, 30, 40], [25, 16], [56, 44, 12]]

What I trying to do is:

def loadSet(path='/data/file.txt'):
  datset={}
  for line in open(path+'/file.txt'):
    (userid,wordid)=line.split('\t')
    dataset.setdefault(user,{})
    dataset[userid][wordid]=float(wordid)
    return dataset

But I cant handle with it. Can you please advice the right approach for doing that?

Upvotes: 2

Views: 71

Answers (3)

B. M.
B. M.

Reputation: 18628

If you are concerned with performance issues, like often numpy is better :

df=pd.read_csv('file.txt')
def numpyway():
    u,v=df.values.T
    ind=argsort(u,kind='mergesort') # stable sort to preserve order
    return np.split(v[ind],add(1,*where(diff(u[ind]))))


In [12]: %timeit numpyway() # on 8000 lines
10000 loops, best of 3: 250 µs per loop

If 'UserId' is already sorted, it is yet three times faster.

Upvotes: 0

jezrael
jezrael

Reputation: 862611

I think you can use groupby with apply tolist with values:

print df.groupby('UserId')['WordID'].apply(lambda x: x.tolist()).values
[[20, 30, 40] [25, 16] [56, 44, 12]]

Or apply list, thank you B.M.

print df.groupby('UserId')['WordID'].apply(list).values
[[20, 30, 40] [25, 16] [56, 44, 12]]

Timings:

df = pd.concat([df]*1000).reset_index(drop=True)

In [358]: %timeit df.groupby('UserId')['WordID'].apply(list).values
1000 loops, best of 3: 1.22 ms per loop

In [359]: %timeit df.groupby('UserId')['WordID'].apply(lambda x: x.tolist()).values
1000 loops, best of 3: 1.23 ms per loop

Upvotes: 1

M.T
M.T

Reputation: 5231

while you might be more interested in doing it in pandas depending on your purpose, the numpy way would be:

userid,wordid = np.loadtxt('/data/file.txt',skiprows=1,unpack=True)
#example use:
mylist = []
for uid in np.unique(userid):
    mylist.append(wordid[userid==uid])

Upvotes: 0

Related Questions