Python - scalability with respect to run time and memory usage is important

Question

I have python scripts to filter a massive data in csv file. The requirement asks for considering scalability with respect to run time and memory usage.

I wrote 2 scripts, both of them are working fine of filtering data. Regarding to considering scalability, I decided to use python generator, because it uses iterator and don't save much data in memory.

When I compared the running time of 2 scripts, I found the following:

script 1 - use generator - take more time - 0.0155925750732s

def each_sentence(text):
    match = re.match(r'[0-9]+', text)
    num = int(text[match.start():match.end()])
    if sympy.isprime(num) == False:
        yield text.strip()

with open("./file_testing.csv") as csvfile:
    for line in csvfile:
        for text in each_sentence(line):
            print(text)

script 2 - use split and without generator - take less time - 0.00619888305664

with open("./file_testing.csv") as csvfile:
for line in csvfile:
    array = line.split(',')
    num = int(array[0])
    if sympy.isprime(num) == False:
        print line.strip()

To meet the requirement, do I need to use python generator? or any suggestions or recommendations?

Gusi Gao · Accepted Answer

To meet the requirement, do I need to use python generator?

No, you don't. Script 1 doesn't make sense. The generator is always executed once and return one result in the first iteration.

Any suggestions or recommendations?

You need to learn about three things: complexity, parallelization and caching.

Complexity basically means "if I double the size of input data (csv file), do I need twice the time? Or four times? Or what"?
Parallelization means attacking a problem in a way that makes it easy to add more resources for solving it.
Caching is important. Things get much faster if you don't have to re-create everything all the time, but you can re-use stuff you have already generated.

The main loop for line in csvfile: already scales very well unless the csv file contains extremely long lines.

Script 2 contains a bug: If the first cell in a line is not integer, then int(array[0]) will raise a value error.

The isprime function is probably the "hotspot" in your code, so you can try to parallelize it with multiple threads or sub-processes.

Python - scalability with respect to run time and memory usage is important

Answers (2)

With a small re result - 10 bytes

With a large re result - 10,000,000 bytes

Bottom line:

Related Questions