Reputation: 2441
I am using python to parse the incoming comma separated string. I want to do some calculation afterwards on the data. The length of the string is: 800 characters with 120 comma separated fields. There such 1.2 million strings to process.
for v in item.values():
l.extend(get_fields(v.split(',')))
#process l
get_fields uses operator.itemgetter() to extract around 20 fields out of 120.
This entire operation takes about 4-5 minutes excluding the time to bring in the data. In the later part of the program I insert these lines into sqlite memory table for further use. But overall 4-5 minutes time for just parsing and getting a list is not good for my project.
I run this processing in around 6-8 threads.
Does switching to C/C++ might help?
Upvotes: 4
Views: 6800
Reputation: 63709
Are you loading a dict with your file records? Probably better to process the data directly:
datafile = file("file_with_1point2million_records.dat")
# uncomment next to skip over a header record
# file.next()
l = sum(get_fields(v.split(',')) for v in file, [])
This avoids creating any overall data structures, and only accumulated the desired values as returned by get_fields.
Upvotes: 3
Reputation: 879113
Your program might be slowing down trying to allocate enough memory for 1.2M strings. In other words, the speed problem might not be due to the string parsing/manipulation, but rather in the l.extend
. To test this hypothsis, you could put a print statement in the loop:
for v in item.values():
print('got here')
l.extend(get_fields(v.split(',')))
If the print statements get slower and slower, you can probably conclude l.extend
is the culprit. In this case, you may see significant speed improvement if you can move the processing of each line into the loop.
PS: You probably should be using the csv
module to take care of the parsing for you in a more high-level manner, but I don't think that will affect the speed very much.
Upvotes: 2