Tom Kurushingal
Tom Kurushingal

Reputation: 6496

Why doesn't np.genfromtxt() remove header while importing in Python?

I have data of the form:

#---------------------
# Data
#---------------------
p   q   r   y 1 y 2 y 3 y 4
2   8   14  748 748 748 790
2   9   22  262 245 252 328
1   5   19  512 514 511 569
2   7   19  748 748 748 805
3   11  13  160 168 108 164
2   7   20  788 788 788 848
1   4   15  310 310 310 355
3   12  17  230 210 213 218

And I am trying to generate array B by using np.genfromtxt(), using the code:

import numpy as np
A = open('data.dat', "r")
line = A.readline()
while line.startswith('#'):
    line = A.readline()
A_header = line.split("\t")
A_header[-1] = A_header[-1].strip()
B = np.genfromtxt('data.dat', comments='#', delimiter='\t', names = A_header, dtype = None, unpack = True).transpose()
print B
print B['y_1']

I have two questions:

  1. Why doesn't np.genfromtxt() remove data header while importing? When data is imported array B still has the header p, q, ... y 3, y 4.

  2. Why do we have to provide underscore for header names, e.g. y_1, y_2, etc.? Why can't we provide names as it is y 1, y 2 ... y 4?

Upvotes: 1

Views: 8044

Answers (3)

chthonicdaemon
chthonicdaemon

Reputation: 19770

For what it's worth, pandas.read_table reads this file easily.

import pandas
B = pandas.read_table('data.dat', comment='#')

print B['y 1']  # Note the space is retained in the column name

Upvotes: 0

hpaulj
hpaulj

Reputation: 231385

Your format is fighting a couple of assumptions that genfromtxt is making:

1) you have both comment lines and a header line (without # character)

2) your column names have spaces, which genfromtxt insists on converting to _ (or some other valid character).

If I create a text file from your sample, and replace blanks with tabs (which is a pain, especially since my editors are set to replace tabs with spaces), this works:

In [330]: np.genfromtxt('stack29451030.txt',delimiter='\t',dtype=None,skip_header=3,names=True)
Out[330]: 
array([(2, 8, 14, 748, 748, 748, 790), (2, 9, 22, 262, 245, 252, 328)], 
      dtype=[('p', '<i4'), ('q', '<i4'), ('r', '<i4'), ('y_1', '<i4'), ('y_2', '<i4'), ('y_3', '<i4'), ('y_4', '<i4')])

I played with replace_space=' '. Looks like it only uses replacements that produce valid Python variable and attribute names. So 'y_1' is fine, but not 'y 1'. I don't see way around this using parameters.

comments and names don't cooperate in your case. It can skip the comment lines, but then will read the names line as data.

In [350]: np.genfromtxt('stack29451030.txt',delimiter='\t',dtype=None,comments='#')
Out[350]: 
array([['p', 'q', 'r', 'y 1', 'y 2', 'y 3', 'y 4'],
       ['2', '8', '14', '748', '748', '748', '790'],
       ['2', '9', '22', '262', '245', '252', '328']], 
      dtype='|S3')

It can handle a names line like #p q r y1 y2 y3 y4, ignoring the #, but then it doesn't skip the earlier comments lines. So if you could remove the comment lines, or the header line, it could read it. But with both it looks like you have to use something other than comments.

This looks like the cleanest load - explicitly skip 1st 3 lines, accept the header line, and then use jedwards's idea to replace the _.

In [396]: A=np.genfromtxt('stack29451030.txt',delimiter='\t',dtype=None,skip_header=3,names=True)

In [397]: A.dtype.names = [n.replace('_', ' ') for n in A.dtype.names]

In [398]: A
Out[398]: 
array([(2, 8, 14, 748, 748, 748, 790), (2, 9, 22, 262, 245, 252, 328)], 
      dtype=[('p', '<i4'), ('q', '<i4'), ('r', '<i4'), ('y 1', '<i4'), ('y 2', '<i4'), ('y 3', '<i4'), ('y 4', '<i4')])

If you don't know how many comment lines there are, this generator can filter them out:

with open('stack29451030.txt') as f:
    g = (line for line in f if not line.startswith('#'))
    A = np.genfromtxt(g, delimiter='\t', names=True, dtype=None)

genfromtxt accepts input from any iterable, whether a file, a list of lines, or a generator like this.

Upvotes: 1

jedwards
jedwards

Reputation: 30210

Instead of opening the file twice, what about:

import numpy as np

with open('input.txt', "r") as data:
    while True:
        line = data.readline()
        if not line.startswith('#'): break

    header = [e for e in line.strip().split('\t') if e]
    print(header)

    B = np.genfromtxt(data, names=header, dtype=None, delimiter='\t')

print B
print B['y_1']

Output:

# header
['p', 'q', 'r', 'y 1', 'y 2', 'y 3', 'y 4']

# B
[(2, 8, 14, 748, 748, 748, 790) (2, 9, 22, 262, 245, 252, 328)
 (1, 5, 19, 512, 514, 511, 569) (2, 7, 19, 748, 748, 748, 805)
 (3, 11, 13, 160, 168, 108, 164) (2, 7, 20, 788, 788, 788, 848)
 (1, 4, 15, 310, 310, 310, 355) (3, 12, 17, 230, 210, 213, 218)]

# B['y_1']
[748 262 512 748 160 788 310 230]

Instead of passing the filename to np.genfromtxt, here, you pass data the file reader generator.

Otherwise, you get into a weird situation where skip_header doesn't really work, because it considers comment lines. So you'd have to say skip_header=4 (3 comment lines + 1 header line) when what makes is skip_header=1.

So this approach first "throws out" comment lines. Then for the next line it extracts the headers. It passes the remaining lines into the np.genfromtxt function with the associated headers.

A few notes:

  • unpack=True + transpose() cancel each other out. So the effect of using both is the same as using neither. So use neither.

  • And if you really want to access the fields using names with spaces (instead of underscores) you can always rename the fields after you generate the ndarray:

    B.dtype.names = [n.replace('_', ' ') for n in B.dtype.names]
    print B['y 1']  # [748 262 512 748 160 788 310 230]
    

Upvotes: 3

Related Questions