Dawn17
Dawn17

Reputation: 8297

Programmatically merge rows of a huge file for NLP

enter image description here

I need to use the Google ngram corpus (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) that has data of frequency of n-grams appeared in a book year by year.

File format: Each of the files below is compressed tab-separated data and line has the following format:

ngram TAB year TAB match_count TAB volume_count NEWLINE.

I wrote a code to retrieve the frequency of my input ngram

enter image description here

And the code is written as :

file = 'D:\Chrome Downloads\googlebooks-eng-all-4gram-20120701-aj\googlebooks-eng-all-4gram-20120701-aj'

z = []
counter = 0
freq = 0
with open(file, 'rt', encoding='UTF8') as input:

    for line in input:
        if(counter == 150):
            break
        if('Ajax and Achilles ?' == (line.strip().split('\t')[0])):
        #else:
           print(line.strip().split('\t'))
           freq += int((line.strip().split('\t')[2]))

print('Frequency :', freq)

This worked well only because Ajax and Achilles appears on the top part of the corpus (the counter stops it). When I try to search for an ngram that appears later, it takes forever.

The problem with using this corpus to get the frequency of an n-gram is that I have to look through the whole corpus regardless.

So, I was thinking of merging the rows ignoring the year and summing up the frequency.

Is this a valid idea? If so, how can I do this programmatically?

It not, what is a better way of doing this?

Upvotes: 0

Views: 90

Answers (1)

user2390182
user2390182

Reputation: 73498

You do split the lines multiple times and, of course, reading the entire file for each ngram you want to check is not ideal. Why don't you write out the total frequencies for each ngram to another file. Guessing that this google file of yours is enormous, you probably cannot easily collect counts into a single data structrue before writing them out. Relying on the file being sorted by ngram already, you can write the new file without loading the whole corpus at once:

from csv import reader, writer
from itertools import groupby
from operator import itemgetter

get_ngram = itemgetter(0)

with open(file, 'rt', encoding='UTF8') as input, open('freq.txt', 'w', encoding='UTF8') as output:
    r = reader(input, delimiter='\t')
    w = writer(output, delimiter='\t')
    for ngram, rows in groupby(r, key=get_ngram):
    # for i, (ngram, rows) in enumerate(groupby(r, key=get_ngram)):
        # the i and enumerate is just for the loop not being too silent ...
        freq = sum(int(row[2]) for row in rows)
        w.writerow((ngram, freq))
        # if not i % 10000:   # ... and give you some idea what's happening
            # print('Processing ngram {}')

The csv classes just take over the csv parsing and writing part. The csv.reader is a lazy iterator over lists of strings. The groupby groups the rows produced by the csv reader by the first token using the key parameter with an appropriate function. This is done using the itemgetter just to avoid some clunky key=lambda x: x[0]. groupby produces pairs of the key value and the elements of the grouped iterator that have said value. Then it sums the frequencies for these grouped rows and writes only the ngram and frequency to the file using the csv.writer.

Upvotes: 1

Related Questions