Reputation: 14886
I am working on a project which involves big data stored in a .txt
files. My program is running a little bit slow. A reason to that I think is that my program parses the file in a non-efficient manner.
FILE SAMPLE:
X | Y | Weight
--------------
1 1 1
1 2 1
1 3 1
1 4 1
1 5 1
1 6 1
1 7 1
1 8 1
1 9 1
1 10 1
PARSER CODE:
def _parse(pathToFile):
with open(pathToFile) as f:
myList = []
for line in f:
s = line.split()
x, y, w = [int(v) for v in s]
obj = CoresetPoint(x, y, w)
myList.append(obj)
return myList
This function is invoked NumberOfRows/N
times, as I only parse a small chunk of data to process until no lines are left. My .txt
is several Giga Bytes.
I can obviously see that I iterate NumberOfLines
times in the loop and this is a huge bottleneck and BAD. Which leads me to my question:
Question:
What is the right approach to parse a file, what would be the most efficient way to do so and will organizing the data differently in the .txt
fasten the parser ? if so, how should I organize the data
inside the file
?
Upvotes: 0
Views: 654
Reputation: 747
In Python you have a library to do this called Pandas. Import the data with Pandas in the following way:
import pandas as pd
df = pd.read_csv('<pathToFile>.txt')
In case the file is too big to be loaded all together into memory, you could loop through parts of the data and load them one at the time. Here a pretty good blog post that can help you do that.
Upvotes: 1