Reputation: 21
I have a data-file some lines of which look like this. Data is space separated.But spaces are not the same...
AAA B C D E F G H I J
AAA B C D E F G H I J
AAA B C D E F G H I J
I used
AAA,B,C,D,E,F,G,H,I = line.split()
to read data.
Recently I obtained new data is missing sometimes columns D and/or I and/or J.
Columns are similiar to:
AAA B C D E F G H I J
AAA B C E F G H J
AAA B C E F G H
All data important for me are in B, E,F and G columns. I cant use line.split() because variables to the left are changing. It is possible to rewrite the script to read all cases of input data? Any suggestion?
Upvotes: 1
Views: 111
Reputation: 21
Thanks to your answers I've found a solution to my problem. Because data is formatted in columns with rigids columns (ex %8.3f) i think the next code is the only I can do top read variable input data. I don't know if it is the better solution.
data= "AAA B C D E F G H I J
AAA B C E F G H I J
AAA B C E F G H "
for line in data_raw.splitlines():
aaa = line[0:2].strip()
b = line[4:6].strip()
c = line[7:10].strip()
d = line[11:14].strip()
e = line[15:16].strip()
f = line[17:20].strip()
g = line[21:26].strip()
h = line[27:32].strip()
i = line[37:38].strip()
j = line[39:40].strip()
print b, f,g,h
output:
B E F G
B E F G
B E F G
Upvotes: 0
Reputation: 8510
if the amount of space between the data is fixed, and missing data is just a space, you can do this:
>>> s="AAA B C E F G H J "
>>> s.split(" ")
['AAA', 'B', 'C', '', ' E', 'F', 'G', 'H', '', ' J ']
EDIT
Assuming that the space between 2 consecutive data is constant in all the file, I gave you this
making this file as example: missing.txt
AAA B C D E F G H I J
AAA B C D E F G H I J
AAA B C E F G H J
AAA B C E F G H
100 2 3 4 5 6 7 8 9 10
100 2 3 5 6 7 8 9 10
100 2 3 5 6 7 8 10
100 2 3 5 6 7 8
100.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1
100.1 2.1 3.1 5.1 6.1 7.1 8.1 9.1 10.1
100.1 2.1 3.1 5.1 6.1 7.1 8.1 10.1
100.1 2.1 3.1 5.1 6.1 7.1 8.1
hello this is a example of a normal file right?
hello this is example of a normal file right?
hello this is example of a normal right?
hello this is example of a normal
and with this function
def read_data_line(path_file, data_size=10, line_format=None, temp_char="@", ignore=True):
"""Generator to read data_size data from a file that may have some missing
path_file: path to the file
line_format: list with the space between 2 consecutive data
temp_char: character that this function will use as placeholder for
the missing data during procesing
data_size: amount of data expected per line of the file
ignore: in case that 'line_format' is not given, ignore all
lines that don't have the correct format, otherwise
is expected that the first line have the correct
format to use it a model for the rest of the file
Expected format of the content of the file:
A B C D E F G H I J
with A,B,...,J strings without space or 'temp_char' or numbers
This function assume that the space between 2 consecutive
data is constant in all the file
usage
>>> datos = list(read_data_line("/some_folder/some_file.txt")
or
>>> for line in read_data_line("/some_folder/some_file.txt"):
print(line)"""
with open(path_file,"r") as data_raw: #this is the usual way of managing files
for line in data_raw: #here you read each line of the file one by one
datos = line.split()
if not line_format and len(datos)==data_size: #I have all the data, and I assume this structure is the norm
line = line.strip()
for d in datos:
line = line.replace(d,temp_char,1)
line_format = [ len(x) for x in line.split(temp_char)[1:-1] ]
if len(datos) < data_size: #missisng data
if line_format:
for t in line_format:
line = line.replace(" "*t,temp_char,1)
datos = list(map(str.strip,line.split(temp_char)))
else:
if ignore:
continue
raise RuntimeError("Imposible determinate the structure of file")
yield datos
output
>>> for x in read_data_line("missing.txt"):
print(x)
['AAA', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
['AAA', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
['AAA', 'B', 'C', '', 'E', 'F', 'G', 'H', '', 'J']
['AAA', 'B', 'C', '', 'E', 'F', 'G', 'H']
['']
['100', '2', '3', '4', '5', '6', '7', '8', '9', '10']
['100', '2', '3', '', '5', '6', '7', '8', '9', '10']
['100', '2', '3', '', '5', '6', '7', '8', '', '10']
['100', '2', '3', '', '5', '6', '7', '8', '', '']
['']
['100.1', '2.1', '3.1', '4.1', '5.1', '6.1', '7.1', '8.1', '9.1', '10.1']
['100.1', '2.1', '3.1', '', '5.1', '6.1', '7.1', '8.1', '9.1', '10.1']
['100.1', '2.1', '3.1', '', '5.1', '6.1', '7.1', '8.1', '', '10.1']
['100.1', '2.1', '3.1', '', '5.1', '6.1', '7.1', '8.1', '', '']
['']
['hello', 'this', 'is', 'a', 'example', 'of', 'a', 'normal', 'file', 'right?']
['hello', 'this', 'is', '', 'example', 'of', 'a', 'normal', 'file', 'right?']
['hello', 'this', 'is', '', 'example', 'of', 'a', 'normal', '', 'right?']
['hello', 'this', 'is', '', 'example', 'of', 'a', 'normal', '', '']
>>>
hope that solve your problem
Upvotes: 1
Reputation: 127
If you have a consistent number of spaces between your data and the missing data is replaced with a space (as in the example) you can still do something very similar:
a,_,b,_,c,_,d,_,e = "A B C E".split(' ')
Where you'd put a _
for each space between each letter. Or, in the case that your missing data is not replaced with a space, split on the number of spaces between each letter and do what you had done before (this example is for if there are 3 spaces between each datum):
AAA,B,C,D,E,F,G,H,I = line.split(' ')
The missing letter will be filled with ''
, which is the result of there being two sets of ' '
side by side.
Upvotes: 0
Reputation: 14997
You can use pandas' or numpy's csv reading capabilities:
import numpy as np
data = np.genfromtxt(
'data.txt',
missings_values=['-', ],
names=['AAA', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
)
print(data['AAA'])
Or pandas:
import pandas as pd
data = pd.read_csv(
'data.txt',
sep='\S+',
na_values='-',
names=['AAA', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
)
print(data['AAA'])
Upvotes: 1