Issue with reading text file into numpy array using pandas reader

Question

I have a massive text file, a dummy version looks like this after skipping headers:

1444455        7        8        12 52 45 68 70

1356799        3        3        45 34 23 22 11

I would like to read this into a numpy array and np.loadtxt is working really slow. The name of the file is data.txt. Right now I am using:

u=pd.read_csv('data.txt',dtype=np.float16,header=3).values

I have played with the parameters to no avail. If I leave out the dtype I get a single long string of numbers for each row in my array. When I insert the dtype I get the error: invalid literal for float(). I believe there is also some confusion about the two types of separators I have in the text file (tabs and single spaces). How can I get this into a numpy array of shape (2,8).

Could any of you pros help? Thanks

jezrael · Accepted Answer

It seems you need delim_whitespace=True in read_csv if separator is whitespace and header=None:

Then cast to float:

u=pd.read_csv('data.txt', delim_whitespace=True, header=None).astype(float).values

print (u)
[[  1.44445500e+06   7.00000000e+00   8.00000000e+00   1.20000000e+01
    5.20000000e+01   4.50000000e+01   6.80000000e+01   7.00000000e+01]
 [  1.35679900e+06   3.00000000e+00   3.00000000e+00   4.50000000e+01
    3.40000000e+01   2.30000000e+01   2.20000000e+01   1.10000000e+01]]

but there is numpy.float64:

u=pd.read_csv('data.txt', delim_whitespace=True, header=None).astype(float)

print (type(u.loc[0,0]))

If use dtype=np.float16 get inf:

u=pd.read_csv('data.txt', dtype=np.float16, delim_whitespace=True, header=None).values
print (u)
[[ inf   7.   8.  12.  52.  45.  68.  70.]
 [ inf   3.   3.  45.  34.  23.  22.  11.]]

Issue with reading text file into numpy array using pandas reader

Answers (1)

Related Questions