Yan
Yan

Reputation: 145

Python - scalability with respect to run time and memory usage is important

I have python scripts to filter a massive data in csv file. The requirement asks for considering scalability with respect to run time and memory usage.

I wrote 2 scripts, both of them are working fine of filtering data. Regarding to considering scalability, I decided to use python generator, because it uses iterator and don't save much data in memory.

When I compared the running time of 2 scripts, I found the following:

script 1 - use generator - take more time - 0.0155925750732s

def each_sentence(text):
    match = re.match(r'[0-9]+', text)
    num = int(text[match.start():match.end()])
    if sympy.isprime(num) == False:
        yield text.strip()

with open("./file_testing.csv") as csvfile:
    for line in csvfile:
        for text in each_sentence(line):
            print(text)

script 2 - use split and without generator - take less time - 0.00619888305664

with open("./file_testing.csv") as csvfile:
for line in csvfile:
    array = line.split(',')
    num = int(array[0])
    if sympy.isprime(num) == False:
        print line.strip()

To meet the requirement, do I need to use python generator? or any suggestions or recommendations?

Upvotes: 2

Views: 67

Answers (2)

Daniel Scott
Daniel Scott

Reputation: 985

Split your analysis into two discrete regular expression results: A small result with 10 values, and a large result with 10,000,000 values. This question is about the average len() of match, as much as it is about the len() of csvfile.

With a small re result - 10 bytes

1st code block will have slower run time, and relatively low memory usage.

2nd code block will have faster run time, and also relatively low memory usage.

With a large re result - 10,000,000 bytes

1st code block will have slower run time, and very little memory usage.

2nd code block will have faster run time, and very large memory usage.

Bottom line:

If you are supposed to build a function considering run time and memory, then the yield function is definitely the best way to go when the problem requires a scalable solution to different result sizes.

Another question on scalability: What if the re result equals None? I would slightly modify the code to below:

def each_sentence(text):
    match = re.match(r'[0-9]+', text)
    if match != None:
        num = int(text[match.start():match.end()])
        if sympy.isprime(num) == False:
            yield text.strip()

with open("./file_testing.csv") as csvfile:
    for line in csvfile:
        for text in each_sentence(line):
            print(text)

Upvotes: 1

Gusi Gao
Gusi Gao

Reputation: 86

To meet the requirement, do I need to use python generator?

No, you don't. Script 1 doesn't make sense. The generator is always executed once and return one result in the first iteration.

Any suggestions or recommendations?

You need to learn about three things: complexity, parallelization and caching.

  • Complexity basically means "if I double the size of input data (csv file), do I need twice the time? Or four times? Or what"?

  • Parallelization means attacking a problem in a way that makes it easy to add more resources for solving it.

  • Caching is important. Things get much faster if you don't have to re-create everything all the time, but you can re-use stuff you have already generated.

The main loop for line in csvfile: already scales very well unless the csv file contains extremely long lines.

Script 2 contains a bug: If the first cell in a line is not integer, then int(array[0]) will raise a value error.

The isprime function is probably the "hotspot" in your code, so you can try to parallelize it with multiple threads or sub-processes.

Upvotes: 1

Related Questions