How can i handle file with 161 million line?

Question

I tried to handle this code as I have a big file with the size 3 GB "mydata.dat" with 161991000 lines. Code is for calculating the distance between two points using DensityPeakCluster. number of points 18000 sample of the file like as

1 2 26.23
1 3 44.49
1 4 47.17

and so on until

1 18000 23.5

then

2 3 25.2
2 4 15.2

until 2 18000 0.25 and so on until 17999 18000 0.25

block one for the code is

class Graph(defaultdict):
     def __init__(self, input_file, sep=" ", header=False, undirect=True):

         super(Graph, self).__init__(dict)
         self.edges_num = 0
         with open(input_file) as f:
             if header:
                 f.readline()
             for line in f:
                 line = line.strip().split(sep)
                 self[line[0]][line[1]] = float(line[2])
                 self.edges_num += 1
                 if undirect:
                    self[line[1]][line[0]] = float(line[2])
                    self.edges_num += 1
   def edges(self):
        edges_list = []
        for node1 in self:
            for node2 in self[node1]:
                edges_list.append((node1, node2))
        return edges_list

block 2 of the code as the code is long to write it here

    def edges_weight(self):
        weight_list = []
        for edge in self.edges():
            node1, node2 = edge
            weight_list.append([node1, node2, self[node1][node2]])
        weight_list = sorted(weight_list, key=lambda x:x[2])
        return weight_list
    def get_weight(self, node1, node2):
       return self[node1][node2]
    def get_weights(self):
        weights = []
        for edge in self.edges():
            weights.append(self.get_weight(edge[0], edge[1]))
        return weights
if __name__=="__main__":

    input_file = "./data/mydata.dat"
    percent = 2.0
    output_file = "./data/results"

    G = Graph(input_file)
    position = round(G.number_of_edges()*percent/100)
    dc = G.edges_weight()[position][2]
    print("average percentage of neighbours (hard coded): {}".format(percent))
    print("Computing Rho with gaussian kernel of radius: {}".format(dc))
    nodes = G.nodes()
    for i in range(G.number_of_nodes()-1):
        for j in range(i+1, G.number_of_nodes()):
            node_i = nodes[i]
            node_j = nodes[j]
            dist_ij = G.get_weight(node_i, node_j)

what happened to me

1- I got killed so I tried to make reading from the file as

        bigfile = open(input_file,'r')
        tmp_lines = bigfile.readlines(1024*1024)
        for line in tmp_lines:
            line = line.strip().split(sep)
            self[line[0]][line[1]] = float(line[2])
            self.edges_num += 1
            if undirect:
               self[line[1]][line[0]] = float(line[2])
               self.edges_num += 1

2- but got

 dist_ij = G.get_weight(node_i, node_j) in get_weight
    return self[node1][node2]
 KeyError: '6336'

3- I tried to use google colab but didn't work as RAM is 12 GB and didn't enough for me .. i asked for buying a neW RAM but the problem still was I couldn't manage the code well so the RAM will be less for processing .. i'm stuck in this problem and couldn't know what should I do ?

**1- My problem is how to deal with a big file as I have ? what is the way that I should use to handle this size?

2- if I use NumPy to load the file can this decrease usage of memory?**

How can i handle file with 161 million line?

Answers (1)

Related Questions