Python -Optimize file parsing

Question

I am working on a project which involves big data stored in a .txt files. My program is running a little bit slow. A reason to that I think is that my program parses the file in a non-efficient manner.

FILE SAMPLE:

X | Y | Weight
--------------

1  1  1
1  2  1
1  3  1
1  4  1
1  5  1
1  6  1
1  7  1
1  8  1
1  9  1
1  10  1

PARSER CODE:

def _parse(pathToFile):
    with open(pathToFile) as f:
    myList = []
    for line in f:
        s = line.split()
        x, y, w = [int(v) for v in s]
        obj = CoresetPoint(x, y, w)
        myList.append(obj)
    return myList

This function is invoked NumberOfRows/N times, as I only parse a small chunk of data to process until no lines are left. My .txt is several Giga Bytes.

I can obviously see that I iterate NumberOfLines times in the loop and this is a huge bottleneck and BAD. Which leads me to my question:

Question: What is the right approach to parse a file, what would be the most efficient way to do so and will organizing the data differently in the .txt fasten the parser ? if so, how should I organize the data inside the file ?

lorenzori · Accepted Answer

In Python you have a library to do this called Pandas. Import the data with Pandas in the following way:

import pandas as pd
df = pd.read_csv('.txt')

In case the file is too big to be loaded all together into memory, you could loop through parts of the data and load them one at the time. Here a pretty good blog post that can help you do that.

Python -Optimize file parsing

Answers (1)

Related Questions