Reputation: 499
I tried to handle this code as I have a big file with the size 3 GB
"mydata.dat" with 161991000
lines. Code is for calculating the distance between two points using DensityPeakCluster. number of points 18000
sample of the file like as
1 2 26.23
1 3 44.49
1 4 47.17
and so on until
1 18000 23.5
then
2 3 25.2
2 4 15.2
until 2 18000 0.25
and so on until 17999 18000 0.25
block one for the code is
class Graph(defaultdict):
def __init__(self, input_file, sep=" ", header=False, undirect=True):
super(Graph, self).__init__(dict)
self.edges_num = 0
with open(input_file) as f:
if header:
f.readline()
for line in f:
line = line.strip().split(sep)
self[line[0]][line[1]] = float(line[2])
self.edges_num += 1
if undirect:
self[line[1]][line[0]] = float(line[2])
self.edges_num += 1
def edges(self):
edges_list = []
for node1 in self:
for node2 in self[node1]:
edges_list.append((node1, node2))
return edges_list
block 2 of the code as the code is long to write it here
def edges_weight(self):
weight_list = []
for edge in self.edges():
node1, node2 = edge
weight_list.append([node1, node2, self[node1][node2]])
weight_list = sorted(weight_list, key=lambda x:x[2])
return weight_list
def get_weight(self, node1, node2):
return self[node1][node2]
def get_weights(self):
weights = []
for edge in self.edges():
weights.append(self.get_weight(edge[0], edge[1]))
return weights
if __name__=="__main__":
input_file = "./data/mydata.dat"
percent = 2.0
output_file = "./data/results"
G = Graph(input_file)
position = round(G.number_of_edges()*percent/100)
dc = G.edges_weight()[position][2]
print("average percentage of neighbours (hard coded): {}".format(percent))
print("Computing Rho with gaussian kernel of radius: {}".format(dc))
nodes = G.nodes()
for i in range(G.number_of_nodes()-1):
for j in range(i+1, G.number_of_nodes()):
node_i = nodes[i]
node_j = nodes[j]
dist_ij = G.get_weight(node_i, node_j)
what happened to me
1- I got killed
so I tried to make reading from the file as
bigfile = open(input_file,'r')
tmp_lines = bigfile.readlines(1024*1024)
for line in tmp_lines:
line = line.strip().split(sep)
self[line[0]][line[1]] = float(line[2])
self.edges_num += 1
if undirect:
self[line[1]][line[0]] = float(line[2])
self.edges_num += 1
2- but got
dist_ij = G.get_weight(node_i, node_j) in get_weight
return self[node1][node2]
KeyError: '6336'
3- I tried to use google colab but didn't work as RAM is 12 GB and didn't enough for me .. i asked for buying a neW RAM but the problem still was I couldn't manage the code well so the RAM will be less for processing .. i'm stuck in this problem and couldn't know what should I do ?
**1- My problem is how to deal with a big file as I have ? what is the way that I should use to handle this size?
2- if I use NumPy to load the file can this decrease usage of memory?**
Upvotes: 1
Views: 101
Reputation: 3328
The most straight forward answer is to not load the whole file at once. This can even be done one line at a time. For example, suppose you wanted the sum:
filename = 'file.dat'
lines = (int(line.split(' ')[2]) for line in open(filename))
print(sum(lines))
Here we did not load all the lines into memory. We instead opened a file pointer and started a python generator. The generator holds the function "int(line.split(' ')[2])" and only executes that function when each line is called. The initiation of needing to call each line is started by the sum(), and sum only calls each line one at a time as needed, never loading more than one line into memory at a time. Hence, when we execute that line we start to add up all the values on the lines from the generator and keep a running total. The point is that the code uses no memory RAM (aside from the kernel overhead).
This could be done a piece at a time as well. Load all the zeros.
filename = 'file.dat'
lines = (line.split(' ') for line in open(filename))
zeros = (line for line in lines if line[0]=='0' or line[1]=='0')
print(sum(c for a,b,c in zeros))
This can of course be slower than loading some or all of the file into memory. Moreover you have to consider how many times you want to iterate over the file like this. It is preferred to only iterate over the lines a few times, gathering all the calculations you want. You then probably want to save those answers because re-iterating over the file again takes more time.
In considering loading the file into memory, you need to double check what exactly you want to load and how. For example, do you want to load the values 1 2 in the line 1 2 26.23? If not, then strip those out to take up less memory. For example
import numpy as np
filename = 'file.dat'
values = (float(line.split(' ')[2]) for line in open(filename))
X = np.fromiter(values,dtype='float32',count=161991000)
By specifying the count we told python EXACTLY how much memory to allocate in advance (instead of having python re-adjust the array every time it needs more memory). With a count of that size and dtype of float32, we know that this data will take up exactly 647.97mb in RAM. So, be careful not to write any operations that duplicate this data. If you write something that makes 5 copies of this that will eat up RAM quickly.
I think this gives you an idea of how to manage memory. :-)
Upvotes: 1