Reputation: 6496
I have data of the form:
#---------------------
# Data
#---------------------
p q r y 1 y 2 y 3 y 4
2 8 14 748 748 748 790
2 9 22 262 245 252 328
1 5 19 512 514 511 569
2 7 19 748 748 748 805
3 11 13 160 168 108 164
2 7 20 788 788 788 848
1 4 15 310 310 310 355
3 12 17 230 210 213 218
And I am trying to generate array B by using np.genfromtxt(), using the code:
import numpy as np
A = open('data.dat', "r")
line = A.readline()
while line.startswith('#'):
line = A.readline()
A_header = line.split("\t")
A_header[-1] = A_header[-1].strip()
B = np.genfromtxt('data.dat', comments='#', delimiter='\t', names = A_header, dtype = None, unpack = True).transpose()
print B
print B['y_1']
I have two questions:
Why doesn't np.genfromtxt() remove data header while importing? When data is imported array B still has the header p, q, ... y 3, y 4.
Why do we have to provide underscore for header names, e.g. y_1, y_2, etc.? Why can't we provide names as it is y 1, y 2 ... y 4?
Upvotes: 1
Views: 8044
Reputation: 19770
For what it's worth, pandas.read_table
reads this file easily.
import pandas
B = pandas.read_table('data.dat', comment='#')
print B['y 1'] # Note the space is retained in the column name
Upvotes: 0
Reputation: 231385
Your format is fighting a couple of assumptions that genfromtxt
is making:
1) you have both comment lines and a header line (without # character)
2) your column names have spaces, which genfromtxt
insists on converting to _
(or some other valid character).
If I create a text file from your sample, and replace blanks with tabs (which is a pain, especially since my editors are set to replace tabs with spaces), this works:
In [330]: np.genfromtxt('stack29451030.txt',delimiter='\t',dtype=None,skip_header=3,names=True)
Out[330]:
array([(2, 8, 14, 748, 748, 748, 790), (2, 9, 22, 262, 245, 252, 328)],
dtype=[('p', '<i4'), ('q', '<i4'), ('r', '<i4'), ('y_1', '<i4'), ('y_2', '<i4'), ('y_3', '<i4'), ('y_4', '<i4')])
I played with replace_space=' '
. Looks like it only uses replacements that produce valid Python variable and attribute names. So 'y_1'
is fine, but not 'y 1
'. I don't see way around this using parameters.
comments
and names
don't cooperate in your case. It can skip the comment lines, but then will read the names line as data.
In [350]: np.genfromtxt('stack29451030.txt',delimiter='\t',dtype=None,comments='#')
Out[350]:
array([['p', 'q', 'r', 'y 1', 'y 2', 'y 3', 'y 4'],
['2', '8', '14', '748', '748', '748', '790'],
['2', '9', '22', '262', '245', '252', '328']],
dtype='|S3')
It can handle a names line like #p q r y1 y2 y3 y4
, ignoring the #, but then it doesn't skip the earlier comments lines. So if you could remove the comment lines, or the header line, it could read it. But with both it looks like you have to use something other than comments
.
This looks like the cleanest load - explicitly skip 1st 3 lines, accept the header line, and then use jedwards's
idea to replace the _
.
In [396]: A=np.genfromtxt('stack29451030.txt',delimiter='\t',dtype=None,skip_header=3,names=True)
In [397]: A.dtype.names = [n.replace('_', ' ') for n in A.dtype.names]
In [398]: A
Out[398]:
array([(2, 8, 14, 748, 748, 748, 790), (2, 9, 22, 262, 245, 252, 328)],
dtype=[('p', '<i4'), ('q', '<i4'), ('r', '<i4'), ('y 1', '<i4'), ('y 2', '<i4'), ('y 3', '<i4'), ('y 4', '<i4')])
If you don't know how many comment lines there are, this generator can filter them out:
with open('stack29451030.txt') as f:
g = (line for line in f if not line.startswith('#'))
A = np.genfromtxt(g, delimiter='\t', names=True, dtype=None)
genfromtxt
accepts input from any iterable, whether a file, a list of lines, or a generator like this.
Upvotes: 1
Reputation: 30210
Instead of opening the file twice, what about:
import numpy as np
with open('input.txt', "r") as data:
while True:
line = data.readline()
if not line.startswith('#'): break
header = [e for e in line.strip().split('\t') if e]
print(header)
B = np.genfromtxt(data, names=header, dtype=None, delimiter='\t')
print B
print B['y_1']
Output:
# header
['p', 'q', 'r', 'y 1', 'y 2', 'y 3', 'y 4']
# B
[(2, 8, 14, 748, 748, 748, 790) (2, 9, 22, 262, 245, 252, 328)
(1, 5, 19, 512, 514, 511, 569) (2, 7, 19, 748, 748, 748, 805)
(3, 11, 13, 160, 168, 108, 164) (2, 7, 20, 788, 788, 788, 848)
(1, 4, 15, 310, 310, 310, 355) (3, 12, 17, 230, 210, 213, 218)]
# B['y_1']
[748 262 512 748 160 788 310 230]
Instead of passing the filename to np.genfromtxt
, here, you pass data
the file reader generator.
Otherwise, you get into a weird situation where skip_header
doesn't really work, because it considers comment lines. So you'd have to say skip_header=4
(3 comment lines + 1 header line) when what makes is skip_header=1
.
So this approach first "throws out" comment lines. Then for the next line it extracts the headers. It passes the remaining lines into the np.genfromtxt
function with the associated headers.
A few notes:
unpack=True
+ transpose()
cancel each other out. So the effect of using both is the same as using neither. So use neither.
And if you really want to access the fields using names with spaces (instead of underscores) you can always rename the fields after you generate the ndarray
:
B.dtype.names = [n.replace('_', ' ') for n in B.dtype.names]
print B['y 1'] # [748 262 512 748 160 788 310 230]
Upvotes: 3