lilipunk
lilipunk

Reputation: 183

Most frequent ngrams in a CSV using nltk

I have a csv file with million+ tweets. I have sanitized the data and I want to find the most frequent 2 / 3 / 4 word phrases that occur across the file.

I am importing the csv as a list. It is important that bigrams / trigrams are generated within the line of list. i. e. for a list: 'Sally is great' 'Bob is not'

Bigrams will be 'Sally is', 'is great', 'Bob is', 'is not'

And not 'great Bob' (i. e. rows should not be concatenated)

Here's the code:

#import necessary packages
#read csv
with open("small_sample.csv", 'r') as f:
    reader = csv.reader(f, delimiter=',')
    dfl = list(reader)

#import ngrams function
from nltk import ngrams
#store bigrams in string_bigrams
string_bigrams=''
n=2
for line in dfl:
    string_bigrams += ngrams(str(line).split(),n)

edit Since I cannot use += with generator object and converting the ngrams to string does not give required results, i used itertools.chain to add to the generator object.

updated code:

for line in dfl:
    string_bigrams 
    =itertools.chain(string_bigrams,ngrams(str(line).split(),n))

However, the output here has '[' concatenated to it. i. e. if list is saved as ['Sally is great','Bob is not'] string_bigrams returns

("['Sally", 'is')
('is', "great']")
("['Bob", 'is')
('is', "not']")

Expected output is

('Sally', 'is')
('is', 'great')
('Bob', 'is')
('is', 'not')

Why are the [] appended?

Upvotes: 1

Views: 2138

Answers (1)

alexis
alexis

Reputation: 50220

("['Sally", 'is')
('is', "great']")

Why are the [] appended?

It's not just the brackets, you also have stray quotes. This clearly comes from applying str to a list, which novice Python programmers often do to paper over an error instead of figuring out where it's coming from.

Where it's coming from must be this: Your "csv file" doesn't actually have columns, it's just got one message per line. But the csv module always returns the contents of each row as a list of columns, meaning that the variable line is a one-element list that looks like this:

['Sally is great']

To fix the problem, initialize string_bigrams to an empty list and change this

string_bigrams += ngrams(str(line).split(),n)

to this:

string_bigrams.extend( ngrams(line[0].split(), n) ) 

And never, ever apply str to a list again.

Upvotes: 1

Related Questions