Applying nltk.FreqDist after splitting a CSV

Question

I've been trying to work with a dataset which has | as a delimiter and a for new line.
a | b | c c | e | f

I have been trying to split the set with rec[0].split('|') and apply nltk.FreqDist(rec)

Here's my source code

import nltk
import csv
from nltk.util import ngrams

with open('CG_Attribute.csv', 'r') as f:
    for row in f:
        splitSet = row.split('|')
        for rec in splitSet:
            # token = nltk.word_tokenize(rec)
            result = nltk.FreqDist(rec)
            print(result)

The output that I am getting is as follows

What I am expecting is

[('a',1),('b',1),('c',2),('e',1),('f',1)]

Can anyone point out as to where am I screwing up? Any suggestions would help :)

PS - I even used csv, but had no luck

Ilia Kurenkov · Accepted Answer

You seem to be missing a couple of steps along the way, sir.

When you iterate over the rows in the file, splitting them by "|", your result is actually a sequence of lists:

row1: ["a ", " b ", " c "]
row2: ["c ", " e ", " f "]

What I think you want (correct me if I'm wrong) is to stitch these lists into one big one so that you can count frequencies of items in the whole file. You can do this with something like the following:

with open('CG_Attribute.csv') as f:
    tokens = [token for row in f for token in row.split("|")]

Now that you have all your words in one list, you can count their frequencies. Based on the output data you describe, I actually think nltk.FreqDist is overkill for this and you should be just fine with collections.Counter.

from collections import Counter
token_counts = Counter(tokens)
# if using python 2
token_count_tuples = token_counts.items()

Note that since FreqDist inherits from Counter, you can easily substitute it in the snippet above in case you still really want to use it.

If you're using Python 3, Counter.items() returns an iterator, not a list, so you have to explicitly convert it:

token_count_tuples = list(token_counts.items())

Et viola, you have your tokens paired up with their respective counts!

One final note: you may have to call str.strip() on your tokens because I don't think splitting by "|" will remove the whitespace between the words and the delimiters. But that depends on what your real data looks like and whether you want to take spaces into account or no.

Applying nltk.FreqDist after splitting a CSV

Answers (1)

Related Questions