maurobio
maurobio

Reputation: 1577

Reading a text file with numeric columns of different lengths

I have a text file, generated by a FORTRAN program, with the rather strange (and surely annoying) format:

3.4502    1.5959    0.2160    0.9423    0.1098    1.2463   -2.8673    0.8803
3.5724    1.8022    0.3423    1.0801    2.4177   -0.2012   -0.1142   -0.2061
2.6028    2.6395    0.2959    0.8280    2.0526   -0.0721   -1.1345    0.0110
2.5628    0.0000    0.0539    0.0000   -0.4520    1.3030   -3.0792    1.0428
1.1823    1.4084    0.2315    1.1359    1.5945    3.2098    1.6739    0.0713
0.0296    1.3689    0.0000    1.0425   -0.4525    1.3043   -2.9785    1.0428
2.4825    1.6460    0.2573    2.4801    3.4533    1.5960    0.3609    0.9574
2.2358    0.8858    0.1344    0.5376    3.1102   -0.8025    0.1282   -0.8398
0.0000    1.4078    1.5464    1.0526    3.9754    3.7823    0.3376    0.1303
                                        3.3068    2.5148    0.2390   -0.3816
                                       -0.4672    1.3604    2.0157    1.0405
                                        4.4009    2.9969    0.8777    3.6270
                                        3.0271    4.1610    0.2094    3.0105
                                       -0.4889    1.3888    3.1442    1.0423
                                        6.0767    1.7731    0.6439    2.3744
                                        5.9313    1.3423    0.2204    1.0397
                                        4.4335    2.9075   -0.0328   -0.4526
                                        4.8670    2.6906    0.1088    0.0275
                                        2.5303    3.3157   -0.2649    0.9895
                                        4.3957    3.4142    0.3900    0.4282
                                        3.3185    1.4058    0.2024    3.3997
                                        0.9097    1.3423    0.2388    1.1809
                                        1.3302    1.6167    0.2009    1.0491
                                        2.4382   -0.1739    0.4722    3.5331
                                        1.8617    1.4082    0.2140    0.6741

I want to read separately the first four and the last four columns, storing them into Numpy arrays. Using numpy.genfromtxt, I easily got the data from first four columns:

object_scores = numpy.genfromtxt("results.out", usecols=(0,1,2,3), max_rows=9)

But when attempting to do the same for the other four columns

descriptor_scores = numpy.genfromtxt("results.out", usecols=(4,5,6,7), max_rows=25)

I got a long list of error messages, that seem to be related to the missing cells in the first four columns.

 ValueError: Some errors were detected !
     Line #10 (got 4 columns instead of 1)
     Line #11 (got 4 columns instead of 1)
     Line #12 (got 4 columns instead of 1)
     Line #13 (got 4 columns instead of 1)
     Line #14 (got 4 columns instead of 1)
     Line #15 (got 4 columns instead of 1)
     Line #16 (got 4 columns instead of 1)
     Line #17 (got 4 columns instead of 1)
     Line #18 (got 4 columns instead of 1)
     Line #19 (got 4 columns instead of 1)
     Line #20 (got 4 columns instead of 1)
     Line #21 (got 4 columns instead of 1)
     Line #22 (got 4 columns instead of 1)
     Line #23 (got 4 columns instead of 1)
     Line #24 (got 4 columns instead of 1)
     Line #25 (got 4 columns instead of 1)

Any hints or suggestions on how to solve this problem?

Upvotes: 1

Views: 1070

Answers (4)

Grismar
Grismar

Reputation: 31319

Although it's certainly different from delimited formats like .csv (and thus may be annoying to some), Fortran and similar languages often use fixed width formats like this example. This is because they perform very well for larger files and they often directly match how the data is represented in memory, which makes it easier to code for in those languages.

I'm not sure your example contains the full data (StackOverflow may be getting rid of some whitespace for you). But I expect that, when you read the file directly, each column will be exactly 10 characters wide and you could read it like this:

def convert(s):
    try:
        return float(s)
    except ValueError:
        return None


data = []
size = 10
with open('input.data', 'r') as f:
    for line in f:
        # process line, minus the EOL (len(line)-1)
        data.append([convert(line[0+i:size+i]) for i in range(0, len(line)-1, size)])

Others have noticed that the width of the columns appears to vary, but I think this is just an artifact of you copying the data into your question - it seems highly likely that the fields are actually all the same width in the source data file.

Upvotes: 0

hpaulj
hpaulj

Reputation: 231335

With a copy-n-paste to file

In [85]: data = np.genfromtxt('stack54544789.py', delimiter=[10]*8)
In [86]: data
Out[86]: 
array([[3.4502, 1.5959, 0.216 , 0.9423, 0.1098,    nan, 2.8673, 0.8803],
       [3.5724, 1.8022, 0.3423, 1.0801,    nan,    nan,    nan, 0.2061],
       [2.6028, 2.6395, 0.2959, 0.828 ,    nan,    nan, 1.1345, 0.011 ],
       [2.5628, 0.    , 0.0539,    nan, 0.452 ,    nan, 3.0792, 1.0428],
       [1.1823, 1.4084, 0.2315, 1.1359, 1.5945, 3.2098, 1.6739, 0.0713],
       ...
       [   nan,    nan,    nan,    nan, 1.3302, 1.6167, 0.2009, 1.0491],
       [   nan,    nan,    nan,    nan,    nan, 0.1739, 0.4722, 3.5331],
       [   nan,    nan,    nan,    nan, 1.8617, 1.4082, 0.214 , 0.6741],
       [   nan,    nan,    nan,    nan,    nan,    nan,    nan,    nan]])

That almost looks right; I think the extra nan come from negative signs that are misplaced.

In [87]: data = np.genfromtxt('stack54544789.py', delimiter=[9]+[10]*7)
In [88]: data
Out[88]: 
array([[ 3.4502,  1.5959,  0.216 ,  0.9423,  0.1098,  1.2463, -2.8673,
         0.8803],
       [ 3.5724,  1.8022,  0.3423,  1.0801,  2.4177, -0.2012, -0.1142,
        -0.2061],
       [ 2.6028,  2.6395,  0.2959,  0.828 ,  2.0526, -0.0721, -1.1345,
         0.011 ],
       [ 2.5628,  0.    ,  0.0539,  0.    , -0.452 ,  1.303 , -3.0792,
         1.0428],
       ...
       [    nan,     nan,     nan,     nan,  2.4382, -0.1739,  0.4722,
         3.5331],
       [    nan,     nan,     nan,     nan,  1.8617,  1.4082,  0.214 ,
         0.6741],
       [    nan,     nan,     nan,     nan,     nan,     nan,     nan,
            nan]])

Upvotes: 0

gmds
gmds

Reputation: 19885

If the file's format is always the same, this will do:

import numpy as np

def squash(obj):
    return [[float(element) for element in column if element.strip() != ''] for column in obj]

with open('results.out') as f:
    data = f.read()

lines = data.split('\n')

number_width = 6
number_spacing = 4

result = squash(zip(*[[line[i:i + number_width] for i in range(0, len(line), number_width + number_spacing)]
                      for line in lines]))

first_four_cols = np.array(result[0:4]).T
last_four_cols = np.array(result[4:]).T

Upvotes: 0

rene-d
rene-d

Reputation: 343

Unluckily the columns don't seem to have the same width (10 for the first four fields, then 11). If it is the case, the delimiter= option of numpy.genfromtxt could help you.

Here is an alternate solution to read the 4 fields starting at column 37:

descriptor_scores = numpy.genfromtxt([s[37:] for s in open("results.out")], usecols=(0,1,2,3), max_rows=25)

Upvotes: 1

Related Questions