WhySoSerious
WhySoSerious

Reputation: 194

Reading irregular colunm data into python 3.X using pandas or numpy

Below is my piece of code.

import numpy as np

filename1=open(f)
xf = np.loadtxt(filename1, dtype=float)

Below is my data file.

0.14200E+02 0.18188E+01 0.44604E-03
0.14300E+02 0.18165E+01 0.45498E-03
0.14400E+02-0.17694E+01 0.44615E+03
0.14500E+02-0.17226E+01 0.43743E+03
0.14600E+02-0.16767E+01 0.42882E+03
0.14700E+02-0.16318E+01 0.42033E+03
0.14800E+02-0.15879E+01 0.41196E+03

as one can see there are negative values that take up the space between 2 values this causes numpy to give

ValueError: Wrong number of columns at line 3

This is just small snippet of my code. I want to read this data using numpy or pandas. Any suggestion would be great.

Edit 1:

@ZarakiKenpachi I used your suggestion of sep=' |-' but it gives me extra 4th column with NaN values.

Edit 2:

@Serge Ballesta nice suggestion but all these are some kind of pre-processing. I want some kind of inbuild function to do this in pandas or numpy.

Edit 3:

Important Note it should be noted that there also negative sign in 0.4373E-03

Thank-you

Upvotes: 1

Views: 179

Answers (3)

Serge Ballesta
Serge Ballesta

Reputation: 149115

np.loadtext can read from a (byte string) generator, so you can filter the input file while loading it to add an additional before a minus:

...
def filter(fd):
    rx = re.compile(rb'\d-')
    for line in fd:
        yield rx.sub(b' -', line)

xf = np.loadtxt(filter(open(f, 'b')), dtype=float)

This does not require to preload everything into memory, so it is expected to be memory efficient.


The regex is required to avoid to change something like 0.16545E-012.

In my tests for 10k lines, this should be at most 10% slower than loading everything in memory but will require far less memory

Upvotes: 2

tituszban
tituszban

Reputation: 5152

You can do a preprocess your data to add an additional space before your - signs. While there are many ways of doing it, the best approach would be in my opinion (in order to avoid adding whitespaces at the start of the line) is using regex re.sub:

import re

with open(f) as file:
    raw_data = file.readlines()

processed_data = re.sub(r'(?:\d)-', " -", raw_data)

xf = np.loadtxt(processed_data, dtype=float)

This replaces every - preceded by a number with -.

Upvotes: 2

Shishir Naresh
Shishir Naresh

Reputation: 763

Try the below code :

with open('app.txt') as f:
    data = f.read()

import re 
data_mod = []
for number in data.split('\n')[:-1]:
    num = re.findall(r'[\w\.-]+-[\w\.-]',number)
    for n in num:
        number = number.replace('-',' -')
    data_mod.append(number)

with open('mod_text.txt','w') as f:
    for data in data_mod:
        f.write(data+"\n")

filename1='mod_text.txt'
xf = np.loadtxt(filename1, dtype=float)

Actually you have to per-process the data, using regex. After that you can load that data as you required.

I hope this helps.

Upvotes: 0

Related Questions