Most frequent ngrams in a CSV using nltk

Question

I have a csv file with million+ tweets. I have sanitized the data and I want to find the most frequent 2 / 3 / 4 word phrases that occur across the file.

I am importing the csv as a list. It is important that bigrams / trigrams are generated within the line of list. i. e. for a list: 'Sally is great' 'Bob is not'

Bigrams will be 'Sally is', 'is great', 'Bob is', 'is not'

And not 'great Bob' (i. e. rows should not be concatenated)

Here's the code:

#import necessary packages
#read csv
with open("small_sample.csv", 'r') as f:
    reader = csv.reader(f, delimiter=',')
    dfl = list(reader)

#import ngrams function
from nltk import ngrams
#store bigrams in string_bigrams
string_bigrams=''
n=2
for line in dfl:
    string_bigrams += ngrams(str(line).split(),n)

edit Since I cannot use += with generator object and converting the ngrams to string does not give required results, i used itertools.chain to add to the generator object.

updated code:

for line in dfl:
    string_bigrams 
    =itertools.chain(string_bigrams,ngrams(str(line).split(),n))

However, the output here has '[' concatenated to it. i. e. if list is saved as ['Sally is great','Bob is not'] string_bigrams returns

("['Sally", 'is')
('is', "great']")
("['Bob", 'is')
('is', "not']")

Expected output is

('Sally', 'is')
('is', 'great')
('Bob', 'is')
('is', 'not')

Why are the [] appended?

alexis · Accepted Answer

("['Sally", 'is')
('is', "great']")
Why are the [] appended?

It's not just the brackets, you also have stray quotes. This clearly comes from applying str to a list, which novice Python programmers often do to paper over an error instead of figuring out where it's coming from.

Where it's coming from must be this: Your "csv file" doesn't actually have columns, it's just got one message per line. But the csv module always returns the contents of each row as a list of columns, meaning that the variable line is a one-element list that looks like this:

['Sally is great']

To fix the problem, initialize string_bigrams to an empty list and change this

string_bigrams += ngrams(str(line).split(),n)

to this:

string_bigrams.extend( ngrams(line[0].split(), n) )

And never, ever apply str to a list again.

Most frequent ngrams in a CSV using nltk

Answers (1)

Related Questions