Pierre
Pierre

Reputation: 23

Python3 script consumes too much CPU

I have a simple script that subtract the first number from the list het with all the number from the list het_noHet, then it subtract the second number from the list het with all the number from the list het_noHet and so on:

Then, for each subtraction between both lists, it keeps the minimum number > 0: formula : d[i] = min(w[w>0])

The script that I wrote is :

het = [12,40,50]
het_noHet = [5, 12, 22, 30, 40, 70]
import csv
def extraction_csv(csv_file):

        file_csv = []
        with open(csv_file, 'r') as f:
                reader = csv.reader(f, delimiter=",")
                for rows in reader:

                        file_csv.append(''.join(rows))

        return (file_csv)

het = extraction_csv("het.csv")

het_noHet = extraction_csv("het_nothet.csv")

dico = {}
for i in het_noHet:
        for y in het:
                if abs(int(y)-int(i)) > 0:
                        if i not in dico.keys():
                                dico[i]=[(abs(int(y)-int(i)))]
                        elif isinstance(dico[i], list):
                                dico[i].append(abs(int(y)-int(i)))
                        else:
                                dico[i]= [dico[i],(abs(int(y)-int(i)))]


test = [min(x) for x in dico.values()]
print (test)
[7, 28, 10, 10, 10, 20]

The result is correct. However, normally het = 171K numbers and het_noHet = 530K numbers.

I am allowing the maximum memory per CPU that I can (32) but I keep getting the error message /var/spool/slurmd/job7755164/slurm_script: line 16: 216342 Killed slurmstepd: error: Exceeded step memory limit at some point.

I was wondering if someone has an idea to improve the script to use less memory?

Many thanks in advance.

Upvotes: 0

Views: 138

Answers (1)

MSalters
MSalters

Reputation: 180305

Exceeded step memory limit is not so much a CPU problem as it is a memory problem.

It seems to me that dico has as many values as there are in het_noHet, and that each value in dico corresponds exactly to one entry in het_noHet. Hence, you calculate 530K lists of 171K items.

This means you need memory for all these lists, because you delay calling min(x) to the very end. Don't do that. Process one entry in het_noHet at a time, calculate one list of 171K items, call min on that list, and append one item to test. Then repeat this loop 530K times.

This reduces your memory by about 99.9995%.

For performance, using Numpy could keep het and het_noHet in much smaller memory allocations. Assuming int32, het would become about 700kB, well in range of the L3 cache on your CPU and possibly even L2. (het_noHet won't need cache since it's processed one item at a time, and that one item will be in a CPU register.)

Upvotes: 1

Related Questions