Reputation: 23
I have a simple script that subtract the first number from the list het
with all the number from the list het_noHet
, then it subtract the second number from the list het
with all the number from the list het_noHet
and so on:
Then, for each subtraction between both lists, it keeps the minimum number > 0:
formula : d[i] = min(w[w>0])
The script that I wrote is :
het = [12,40,50]
het_noHet = [5, 12, 22, 30, 40, 70]
import csv
def extraction_csv(csv_file):
file_csv = []
with open(csv_file, 'r') as f:
reader = csv.reader(f, delimiter=",")
for rows in reader:
file_csv.append(''.join(rows))
return (file_csv)
het = extraction_csv("het.csv")
het_noHet = extraction_csv("het_nothet.csv")
dico = {}
for i in het_noHet:
for y in het:
if abs(int(y)-int(i)) > 0:
if i not in dico.keys():
dico[i]=[(abs(int(y)-int(i)))]
elif isinstance(dico[i], list):
dico[i].append(abs(int(y)-int(i)))
else:
dico[i]= [dico[i],(abs(int(y)-int(i)))]
test = [min(x) for x in dico.values()]
print (test)
[7, 28, 10, 10, 10, 20]
The result is correct. However, normally het = 171K numbers and het_noHet = 530K numbers.
I am allowing the maximum memory per CPU that I can (32) but I keep getting the error message /var/spool/slurmd/job7755164/slurm_script: line 16: 216342 Killed slurmstepd: error: Exceeded step memory limit at some point.
I was wondering if someone has an idea to improve the script to use less memory?
Many thanks in advance.
Upvotes: 0
Views: 138
Reputation: 180305
Exceeded step memory limit
is not so much a CPU problem as it is a memory problem.
It seems to me that dico
has as many values as there are in het_noHet
, and that each value in dico
corresponds exactly to one entry in het_noHet
. Hence, you calculate 530K lists of 171K items.
This means you need memory for all these lists, because you delay calling min(x)
to the very end. Don't do that. Process one entry in het_noHet
at a time, calculate one list of 171K items, call min
on that list, and append one item to test
. Then repeat this loop 530K times.
This reduces your memory by about 99.9995%.
For performance, using Numpy could keep het
and het_noHet
in much smaller memory allocations. Assuming int32
, het
would become about 700kB, well in range of the L3 cache on your CPU and possibly even L2. (het_noHet
won't need cache since it's processed one item at a time, and that one item will be in a CPU register.)
Upvotes: 1