Reputation: 8297
I need to use the Google ngram corpus (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) that has data of frequency of n-grams appeared in a book year by year.
File format: Each of the files below is compressed tab-separated data and line has the following format:
ngram TAB year TAB match_count TAB volume_count NEWLINE.
I wrote a code to retrieve the frequency of my input ngram
And the code is written as :
file = 'D:\Chrome Downloads\googlebooks-eng-all-4gram-20120701-aj\googlebooks-eng-all-4gram-20120701-aj'
z = []
counter = 0
freq = 0
with open(file, 'rt', encoding='UTF8') as input:
for line in input:
if(counter == 150):
break
if('Ajax and Achilles ?' == (line.strip().split('\t')[0])):
#else:
print(line.strip().split('\t'))
freq += int((line.strip().split('\t')[2]))
print('Frequency :', freq)
This worked well only because Ajax and Achilles
appears on the top part of the corpus (the counter stops it). When I try to search for an ngram that appears later, it takes forever.
The problem with using this corpus to get the frequency of an n-gram is that I have to look through the whole corpus regardless.
So, I was thinking of merging the rows ignoring the year and summing up the frequency.
Is this a valid idea? If so, how can I do this programmatically?
It not, what is a better way of doing this?
Upvotes: 0
Views: 90
Reputation: 73498
You do split the lines multiple times and, of course, reading the entire file for each ngram you want to check is not ideal. Why don't you write out the total frequencies for each ngram to another file. Guessing that this google file of yours is enormous, you probably cannot easily collect counts into a single data structrue before writing them out. Relying on the file being sorted by ngram already, you can write the new file without loading the whole corpus at once:
from csv import reader, writer
from itertools import groupby
from operator import itemgetter
get_ngram = itemgetter(0)
with open(file, 'rt', encoding='UTF8') as input, open('freq.txt', 'w', encoding='UTF8') as output:
r = reader(input, delimiter='\t')
w = writer(output, delimiter='\t')
for ngram, rows in groupby(r, key=get_ngram):
# for i, (ngram, rows) in enumerate(groupby(r, key=get_ngram)):
# the i and enumerate is just for the loop not being too silent ...
freq = sum(int(row[2]) for row in rows)
w.writerow((ngram, freq))
# if not i % 10000: # ... and give you some idea what's happening
# print('Processing ngram {}')
The csv
classes just take over the csv parsing and writing part. The csv.reader
is a lazy iterator over lists of strings. The groupby
groups the rows produced by the csv reader by the first token using the key
parameter with an appropriate function. This is done using the itemgetter
just to avoid some clunky key=lambda x: x[0]
. groupby
produces pairs of the key value and the elements of the grouped iterator that have said value. Then it sums the frequencies for these grouped rows and writes only the ngram and frequency to the file using the csv.writer
.
Upvotes: 1