Reputation: 145
I have python scripts to filter a massive data in csv file. The requirement asks for considering scalability with respect to run time and memory usage.
I wrote 2 scripts, both of them are working fine of filtering data. Regarding to considering scalability, I decided to use python generator, because it uses iterator and don't save much data in memory.
When I compared the running time of 2 scripts, I found the following:
script 1 - use generator - take more time - 0.0155925750732s
def each_sentence(text):
match = re.match(r'[0-9]+', text)
num = int(text[match.start():match.end()])
if sympy.isprime(num) == False:
yield text.strip()
with open("./file_testing.csv") as csvfile:
for line in csvfile:
for text in each_sentence(line):
print(text)
script 2 - use split and without generator - take less time - 0.00619888305664
with open("./file_testing.csv") as csvfile:
for line in csvfile:
array = line.split(',')
num = int(array[0])
if sympy.isprime(num) == False:
print line.strip()
To meet the requirement, do I need to use python generator? or any suggestions or recommendations?
Upvotes: 2
Views: 67
Reputation: 985
Split your analysis into two discrete regular expression results: A small result with 10 values, and a large result with 10,000,000 values. This question is about the average len()
of match
, as much as it is about the len()
of csvfile
.
1st code block will have slower run time, and relatively low memory usage.
2nd code block will have faster run time, and also relatively low memory usage.
1st code block will have slower run time, and very little memory usage.
2nd code block will have faster run time, and very large memory usage.
If you are supposed to build a function considering run time and memory, then the yield function is definitely the best way to go when the problem requires a scalable solution to different result sizes.
Another question on scalability: What if the re result equals None? I would slightly modify the code to below:
def each_sentence(text):
match = re.match(r'[0-9]+', text)
if match != None:
num = int(text[match.start():match.end()])
if sympy.isprime(num) == False:
yield text.strip()
with open("./file_testing.csv") as csvfile:
for line in csvfile:
for text in each_sentence(line):
print(text)
Upvotes: 1
Reputation: 86
To meet the requirement, do I need to use python generator?
No, you don't. Script 1 doesn't make sense. The generator is always executed once and return one result in the first iteration.
Any suggestions or recommendations?
You need to learn about three things: complexity, parallelization and caching.
Complexity basically means "if I double the size of input data (csv file), do I need twice the time? Or four times? Or what"?
Parallelization means attacking a problem in a way that makes it easy to add more resources for solving it.
Caching is important. Things get much faster if you don't have to re-create everything all the time, but you can re-use stuff you have already generated.
The main loop for line in csvfile:
already scales very well unless the csv file contains extremely long lines.
Script 2 contains a bug: If the first cell in a line is not integer, then int(array[0])
will raise a value error.
The isprime
function is probably the "hotspot" in your code, so you can try to parallelize it with multiple threads or sub-processes.
Upvotes: 1