Reputation: 91
from __future__ import division
import dataProcess
import csv,re
from collections import OrderedDict
import itertools
#######################################################################################
# Pruning of N-grams depending upon the frequency of tags #
#######################################################################################
for k in range(2,8):
filename="Dataset/Cross/N_gram_Features_Pruned/"+str(k)+"_gram.txt"
filewrite=open(filename,"w")
CSV_tag_reader=csv.reader(open("Dataset/Cross/N_grams_recored/"+str(k)+"_gram.csv","r"),delimiter=',')
header_data=CSV_tag_reader.next();
table = [row for row in CSV_tag_reader]
values=[]
result_tag=[]
for j in range(0,len(header_data)):
sum1=0
avg1=0
for i in range (0,3227):
sum1=sum1+int(table[i][j])
## print "************************************************************"
## print sum1
avg1=sum1/3227
## print avg1
if(avg1>=0.3):
result_tag.append(header_data[j])
print len(header_data)
print len(result_tag)
print "************************************************************"
filewrite.write(str(result_tag))
My code is to count frequency of particular word in 3227 samples of data.i have record about 277436 words frequency in 3227 samples.so image csv file with 3227 rows and 60k columns.so am reading each word and sum the frequency and find the average..bt am getting memory error when am running this code?how can i solve?
Error: Traceback (most recent call last): File "N_gram_pruning.py", line 15, in <module> table = [row for row in CSV_tag_reader] MemoryError
My csv file like thisss
f1 f2 f3 f4.....f277436(header row)
0 9 1 4 70
56 2 66 8 23
(3227 rows...)
Upvotes: 0
Views: 672
Reputation: 1046
To find the average for each column, there is no need to load the whole thing into memory. Do something like
with open(filename) as f:
csvreader = csv.reader(f)
tags = next(csvreader)
sums = [0] * len(tags)
for count, row in enumerate(csvreader, 1):
sums = [x + y for x, y in zip(sums, row)]
avgs = [x / count for x in sums]
result_tags = [h for (h, a) in zip(tags, avgs) if a > 0.3]
Upvotes: 0
Reputation: 4959
The problem is that you're reading the entire file into memory. To avoid this, you may have to restructure your algorithm. It seems that you're operating on every column individually, meaning operations on each column are independent. Therefore, if you transpose your csv files so they can be read line by line, you can iterate over those lines rather than reading them all into memory.
Alternatively, you could use file.seek(), though it'll be very slow.
Upvotes: 1