Reading columns with missing items in python

I have a data-file some lines of which look like this. Data is space separated.But spaces are not the same...

AAA  B      C    D E    F    G    H    I  J  
AAA  B      C    D E    F    G    H    I  J  
AAA  B      C    D E    F    G    H    I  J  

I used

AAA,B,C,D,E,F,G,H,I = line.split()  

to read data.

Recently I obtained new data is missing sometimes columns D and/or I and/or J.
Columns are similiar to:

AAA  B    C    D E    F    G    H    I  J  
AAA  B    C      E    F    G    H       J  
AAA  B    C      E    F    G    H            

All data important for me are in B, E,F and G columns. I cant use line.split() because variables to the left are changing. It is possible to rewrite the script to read all cases of input data? Any suggestion?

Upvotes: 1

Views: 111

Answers (4)

Thanks to your answers I've found a solution to my problem. Because data is formatted in columns with rigids columns (ex %8.3f) i think the next code is the only I can do top read variable input data. I don't know if it is the better solution.

data= "AAA   B  C   D E   F     G     H     I J  
       AAA   B  C     E   F     G     H     I J  
       AAA   B  C     E   F     G     H        "
for line in data_raw.splitlines(): 
    aaa = line[0:2].strip()
    b = line[4:6].strip()
    c = line[7:10].strip()
    d = line[11:14].strip()
    e = line[15:16].strip()
    f = line[17:20].strip()
    g = line[21:26].strip()
    h = line[27:32].strip()
    i = line[37:38].strip()
    j = line[39:40].strip()
    print b, f,g,h

output:

B E F G  
B E F G
B E F G

Upvotes: 0

Copperfield
Copperfield

Reputation: 8510

if the amount of space between the data is fixed, and missing data is just a space, you can do this:

>>> s="AAA    B    C         E    F    G    H         J  "
>>> s.split("    ")
['AAA', 'B', 'C', '', ' E', 'F', 'G', 'H', '', ' J  ']

EDIT

Assuming that the space between 2 consecutive data is constant in all the file, I gave you this

making this file as example: missing.txt

AAA  B      C    D E    F    G    H    I  J  
AAA  B      C    D E    F    G    H    I  J  
AAA  B      C      E    F    G    H       J  
AAA  B      C      E    F    G    H 

100  2      3    4 5    6    7    8    9  10 
100  2      3      5    6    7    8    9  10 
100  2      3      5    6    7    8       10 
100  2      3      5    6    7    8        

100.1  2.1      3.1    4.1 5.1    6.1    7.1    8.1    9.1  10.1 
100.1  2.1      3.1      5.1    6.1    7.1    8.1    9.1  10.1 
100.1  2.1      3.1      5.1    6.1    7.1    8.1       10.1 
100.1  2.1      3.1      5.1    6.1    7.1    8.1         

hello  this      is    a example    of    a    normal    file  right?
hello  this      is      example    of    a    normal    file  right?
hello  this      is      example    of    a    normal       right?
hello  this      is      example    of    a    normal        

and with this function

def read_data_line(path_file, data_size=10, line_format=None, temp_char="@", ignore=True):
    """Generator to read data_size data from a file that may have some missing

       path_file:   path to the file
       line_format: list with the space between 2 consecutive data
       temp_char:   character that this function will use as placeholder for 
                    the missing data during procesing
       data_size:   amount of data expected per line of the file
       ignore:      in case that 'line_format' is not given, ignore all 
                    lines that don't have the correct format, otherwise 
                    is expected that the first line have the correct 
                    format to use it a model for the rest of the file

       Expected format of the content of the file:
       A  B      C    D E    F    G    H    I  J

       with A,B,...,J strings without space or 'temp_char' or numbers

       This function assume that the space between 2 consecutive 
       data is constant in all the file

       usage

       >>> datos = list(read_data_line("/some_folder/some_file.txt")

       or

       >>> for line in read_data_line("/some_folder/some_file.txt"):
               print(line)"""
    with open(path_file,"r") as data_raw: #this is the usual way of managing files
        for line in data_raw: #here you read each line of the file one by one
            datos = line.split()
            if not line_format and len(datos)==data_size: #I have all the data, and I assume this structure is the norm
                line = line.strip()
                for d in datos:
                    line = line.replace(d,temp_char,1)
                line_format = [ len(x) for x in line.split(temp_char)[1:-1] ]
            if len(datos) < data_size: #missisng data
                if line_format:
                    for t in line_format:
                        line = line.replace(" "*t,temp_char,1)
                    datos = list(map(str.strip,line.split(temp_char)))
                else:
                    if ignore:
                        continue
                    raise RuntimeError("Imposible determinate the structure of file")
            yield datos

output

>>> for x in read_data_line("missing.txt"):
    print(x)


['AAA', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
['AAA', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
['AAA', 'B', 'C', '', 'E', 'F', 'G', 'H', '', 'J']
['AAA', 'B', 'C', '', 'E', 'F', 'G', 'H']
['']
['100', '2', '3', '4', '5', '6', '7', '8', '9', '10']
['100', '2', '3', '', '5', '6', '7', '8', '9', '10']
['100', '2', '3', '', '5', '6', '7', '8', '', '10']
['100', '2', '3', '', '5', '6', '7', '8', '', '']
['']
['100.1', '2.1', '3.1', '4.1', '5.1', '6.1', '7.1', '8.1', '9.1', '10.1']
['100.1', '2.1', '3.1', '', '5.1', '6.1', '7.1', '8.1', '9.1', '10.1']
['100.1', '2.1', '3.1', '', '5.1', '6.1', '7.1', '8.1', '', '10.1']
['100.1', '2.1', '3.1', '', '5.1', '6.1', '7.1', '8.1', '', '']
['']
['hello', 'this', 'is', 'a', 'example', 'of', 'a', 'normal', 'file', 'right?']
['hello', 'this', 'is', '', 'example', 'of', 'a', 'normal', 'file', 'right?']
['hello', 'this', 'is', '', 'example', 'of', 'a', 'normal', '', 'right?']
['hello', 'this', 'is', '', 'example', 'of', 'a', 'normal', '', '']
>>> 

hope that solve your problem

Upvotes: 1

doykle
doykle

Reputation: 127

If you have a consistent number of spaces between your data and the missing data is replaced with a space (as in the example) you can still do something very similar:

a,_,b,_,c,_,d,_,e = "A B C   E".split(' ')

Where you'd put a _ for each space between each letter. Or, in the case that your missing data is not replaced with a space, split on the number of spaces between each letter and do what you had done before (this example is for if there are 3 spaces between each datum):

AAA,B,C,D,E,F,G,H,I = line.split('   ') 

The missing letter will be filled with '', which is the result of there being two sets of ' ' side by side.

Upvotes: 0

MaxNoe
MaxNoe

Reputation: 14997

You can use pandas' or numpy's csv reading capabilities:

import numpy as np
data = np.genfromtxt(
    'data.txt',
    missings_values=['-', ],
    names=['AAA', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
)
print(data['AAA'])

Or pandas:

import pandas as pd
data = pd.read_csv(
    'data.txt', 
    sep='\S+',
    na_values='-',
    names=['AAA', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
)

print(data['AAA'])

Upvotes: 1

Related Questions