Markus
Markus

Reputation: 189

NumPy genfromxt TypeError: data type not understood error

I would like to read in this file (test.txt)

01.06.2015;00:00:00;0.000;0;-9.999;0;8;0.00;18951;(SPECTRUM)ZERO(/SPECTRUM)
01.06.2015;00:01:00;0.000;0;-9.999;0;8;0.00;18954;(SPECTRUM)ZERO(/SPECTRUM)
01.06.2015;00:02:00;0.000;0;-9.999;0;8;0.00;18960;(SPECTRUM)ZERO(/SPECTRUM)
01.06.2015;09:23:00;0.327;61;25.831;39;29;0.18;19006;01.06.2015;09:23:00;0.327;61;25.831;39;29;0.18;19006;(SPECTRUM);;;;;;;;;;;;;;1;1;;;1;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;1;;;;;;;;;;;;(/SPECTRUM)
01.06.2015;09:24:00;0.000;0;-9.999;0;29;0.00;19010;(SPECTRUM)ZERO(/SPECTRUM)

...I tried it with the numpy function genfromtxt() (see below in the code excerpt).

import numpy as np
col_names = ["date", "time", "rain_intensity", "weather_code_1", "radar_ref", "weather_code_2", "val6", "rain_accum", "val8", "val9"]
types = ["object", "object", "float", "uint8", "float", "uint8", "uint8", "float", "uint8","|S10"]
# Read in the file with np.genfromtxt
mydata = np.genfromtxt("test.txt", delimiter=";", names=col_names, dtype=types)

Now when I execute the code I get the following error -->

raise ValueError(errmsg)ValueError: Some errors were detected !
    Line #4 (got 79 columns instead of 10)

Now I think that the difficulties come from the last column (val9) with the many ;;;;;;;
It is obvious that the delimeters and the signs in the last column; are the same!

How can I read in the file without an error, maybe there is a possibility to skip the last column, or to replace the ; only in the last column?

Upvotes: 0

Views: 2160

Answers (2)

hpaulj
hpaulj

Reputation: 231325

usecols can be used to ignore excess delimiters, e.g.

In [546]: np.genfromtxt([b'1,2,3',b'1,2,3,,,,,,'], dtype=None,
    delimiter=',', usecols=np.arange(3))
Out[546]: 
array([[1, 2, 3],
       [1, 2, 3]])

Upvotes: 0

SiHa
SiHa

Reputation: 8412

From the numpy documentation

invalid_raise : bool, optional
If True, an exception is raised if an inconsistency is detected in the number of columns. If False, a warning is emitted and the offending lines are skipped.

mydata = np.genfromtxt("test.txt", delimiter=";", names=col_names, dtype=types, invalid_raise = False)

Note that there were errors in your code which I have corrected (delimiter spelled incorrectly, and types list referred to as dtypes in function call)

Edit: From your comment, I see I slightly misunderstood. You meant that you want to skip the last column not the last row.

Take a look at the following code. I have defined a generator that only returns the first ten elements of each row. This will allow genfromtxt() to complete without error and you now get column #3 from all rows.

Note though, that you are still going to lose some data, as if you look carefully you will see that the problem line is actually two lines concatenated together with garbage where the other lines have ZERO. So you are still going to lose this second line. You could maybe modify the generator to parse each line and deal with this differently, but I'll leave that as a fun exercise :)

import numpy as np

def filegen(filename):
    with open(filename, 'r') as infile:
        for line in infile:
            yield ';'.join(line.split(';')[:10])

col_names = ["date", "time", "rain_intensity", "weather_code_1", "radar_ref", "weather_code_2", "val6", "rain_accum", "val8", "val9"]
dtypes = ["object", "object", "float", "uint8", "float", "uint8", "uint8", "float", "uint8","|S10"]
# Read in the file with np.genfromtxt
mydata = np.genfromtxt(filegen('temp.txt'), delimiter=";", names=col_names, dtype = dtypes)

Output

[('01.06.2015', '00:00:00', 0.0, 0, -9.999, 0, 8, 0.0, 7, '(SPECTRUM)')
 ('01.06.2015', '00:01:00', 0.0, 0, -9.999, 0, 8, 0.0, 10, '(SPECTRUM)')
 ('01.06.2015', '00:02:00', 0.0, 0, -9.999, 0, 8, 0.0, 16, '(SPECTRUM)')
 ('01.06.2015', '09:23:00', 0.327, 61, 25.831, 39, 29, 0.18, 62, '01.06.2015')
 ('01.06.2015', '09:24:00', 0.0, 0, -9.999, 0, 29, 0.0, 66, '(SPECTRUM)')]

Upvotes: 2

Related Questions