user3254544
user3254544

Reputation: 91

Memory error at python

from __future__ import division
import dataProcess
import csv,re
from collections import OrderedDict
import itertools
#######################################################################################
#                   Pruning of N-grams depending upon the frequency of tags           #                                                       
#######################################################################################

for k in range(2,8):
    filename="Dataset/Cross/N_gram_Features_Pruned/"+str(k)+"_gram.txt"
    filewrite=open(filename,"w")
    CSV_tag_reader=csv.reader(open("Dataset/Cross/N_grams_recored/"+str(k)+"_gram.csv","r"),delimiter=',')
    header_data=CSV_tag_reader.next();    
    table = [row for row in CSV_tag_reader]
    values=[]
    result_tag=[]
    for j in range(0,len(header_data)):
        sum1=0
        avg1=0
        for i in range (0,3227):
            sum1=sum1+int(table[i][j])
    ##    print "************************************************************"
    ##    print sum1
        avg1=sum1/3227
    ##    print avg1
        if(avg1>=0.3):
            result_tag.append(header_data[j])
    print len(header_data)
    print len(result_tag)
    print "************************************************************"
    filewrite.write(str(result_tag))

My code is to count frequency of particular word in 3227 samples of data.i have record about 277436 words frequency in 3227 samples.so image csv file with 3227 rows and 60k columns.so am reading each word and sum the frequency and find the average..bt am getting memory error when am running this code?how can i solve?

Error:
Traceback (most recent call last):
  File "N_gram_pruning.py", line 15, in <module>
    table = [row for row in CSV_tag_reader]
MemoryError

My csv file like thisss

f1 f2 f3  f4.....f277436(header row)
0  9  1    4      70
56 2  66   8      23
(3227 rows...)

Upvotes: 0

Views: 672

Answers (2)

Brave Sir Robin
Brave Sir Robin

Reputation: 1046

To find the average for each column, there is no need to load the whole thing into memory. Do something like

with open(filename) as f:
    csvreader = csv.reader(f)
    tags = next(csvreader)
    sums = [0] * len(tags)
    for count, row in enumerate(csvreader, 1):
        sums = [x + y for x, y in zip(sums, row)]
avgs = [x / count for x in sums]
result_tags = [h for (h, a) in zip(tags, avgs) if a > 0.3]

Upvotes: 0

whereswalden
whereswalden

Reputation: 4959

The problem is that you're reading the entire file into memory. To avoid this, you may have to restructure your algorithm. It seems that you're operating on every column individually, meaning operations on each column are independent. Therefore, if you transpose your csv files so they can be read line by line, you can iterate over those lines rather than reading them all into memory.

Alternatively, you could use file.seek(), though it'll be very slow.

Upvotes: 1

Related Questions