Fast parsing of a variable length lines with regular structure in python

Question

I want to parse a txt file into pandas dataframe. The file has info in the following format,

Note that number of items N is different for each line.

# some useless header lines
< ID1 , rest of the line I am not interested in> 
< item1.val1 item1.val2 item1.val3 item2.val1 item2.val2 item2.val3 .... itemN.val1 item2.val2 item3.val3>
< ID2 , rest of the line I am not interested in> 
< item1.val1 item1.val2 item1.val3 item2.val1 item2.val2 item2.val3 .... itemN.val1 item2.val2 item3.val3>
...

At present I am using this approach,

    file = open(file_path)
    line = file.readline()
    list_dfs = []
    all_df = pd.DataFrame(columns=['ID', 'val1', 'val2', 'val3'])
    while line:
        if line[0] == '#':
            line = file.readline()
            continue
        id = line.split()[0]
        line = file.readline()
        list_dfs.append(process_line_info(line, id))
        line = file.readline()
    file.close()
    all_df = pd.concat(list_dfs)

def process_line_info(txt, id)->pd.DataFrame:
    entries = txt.split()
    if not len(entries)%3 == 0:
        print('Error with line, not a triple, quitting')
        exit(-1)
    
    numItems =  math.ceil(len(entries)/3) 
    df = pd.DataFrame(columns=['val1', 'val2', 'val3', 'id'])
    for i in range(numItems):
        val1 = entries[3*i]
        val2 = entries[3*i + 1]
        val3 = entries[3*i + 2]
        if(val3) != -1:
            row = pd.Series(
                {
                    'val1': val1,
                    'val2': val2,
                    'val3': val3,
                    'id' : id,
                }
            )
            df = df.append(row, ignore_index=True)
    return df

The dataframe will look something like this,

ID val1 val2 val3
0  234  345  112
0  111  333  232
1  ..
1  ..
1  ..
1  ..
2  ..
2  ..
3  .. 
3  ..
3  ..
3  ..
3  ..
.....
.....
N  ..

Is there a way to do this more efficiently? This is painfully slow.

Here's an example,

# some random txt
# some more random txt
1 0.851773 0.0165051 0.503764 -0.142941 -0.737434 1.02973 3.74354 
2362.39 248.498 58396 1784.7 268.254 59027 1784.7 268.254 -1
2 0.851773 0.0165051 0.503764 -0.142941 -0.737434 1.02973 3.74354 
1190.83 663.957 23056 1258.77 640.354 59070

Arty · Accepted Answer

I've used numpy in my solution, as I'm more familiar with it. I expect my code to be faster, try it on your big data. You can also run this code online here.

import io, numpy as np, pandas as pd

with open('input.txt', 'r', encoding = 'utf-8') as f:
    lines = f.read().splitlines()
    
lines = [line.strip() for line in lines]
lines = [line for line in lines if line and not line.startswith('#')]
assert len(lines) % 2 == 0, len(lines)
lines = [
    (
        int(id_line.partition(' ')[0]),
        np.loadtxt(io.StringIO(data_line), dtype = np.float64),
    )
    for id_line, data_line in zip(lines[0::2], lines[1::2])
]
ids = np.concatenate([np.full((v.size // 3,), k, dtype = np.int64) for k, v in lines])
nums = np.concatenate([v.reshape((v.size // 3, 3)) for k, v in lines])
df = pd.DataFrame(index = ids, data = nums, columns = ['val1', 'val2', 'val3'])

print(df)

Fast parsing of a variable length lines with regular structure in python

Answers (2)

Related Questions