OC2PS
OC2PS

Reputation: 1079

Count word frequency of all words in a file

I have a text file, from which I have removed symbols and stop words.

I have also tokenized it (broken it into a list of all words) in case operations are easier with a list.

I would like to create a .csv file with frequencies of all words (long format) in descending order. How could I go about it?

I have thought about looping through the list thus:

longData = pandas.DataFrame([], index=[], columns=['Frequency'])
for word in tokenizedFile:
    if word in longData.index:
         longData.loc[word]=longData.loc[word]+1
    else:
         wordFrame = pandas.DataFrame([1], index=[word])
         longData.append(wordFrame)

but that seems pretty inefficient and wasteful.

Upvotes: 1

Views: 573

Answers (4)

bvt
bvt

Reputation: 1

If anyone is still struggling with this, you could try the following method:

df = pd.DataFrame({"words": tokenizedFile.lower()})
value_count = pd.value_counts(df["words"])  # getting the count of all the words
# storing the words and its respective count in a new dataframe
# value_count.keys() are the words, value_count.values is the count
vocabulary_df = pd.DataFrame({"words": value_count.keys(), "count": value_count.values})

What this does is,

  1. take the list of words (tokenizedFile), and convert all the words to lowercase. And then, create a column with title words and the data will be all the words from the file.
  2. The value_count variable will store the number of times each word appears in our df dataframe by making use of the value_counts method available for dataframes. It sorts it by default in descending order of the count.
  3. Our final line of code creates a new vocabulary_df that will store all the words and it's count nicely into a new dataframe (value_count is saved as a Series type). Here, value_count.keys() has the words, and value_count.values has the count of each word.

Hopefully, this will be helpful to someone along the line. :)

Upvotes: 0

jxc
jxc

Reputation: 13998

You can use Series.str.extractall() and Series.value_counts(). Assume file.txt is the file-path with texts removed symbols and stop words:

# read file into one column dataframe, the default column name is '0'
df = pd.read_csv('file.txt', sep='\n', header=None)

# extract words into rows and then do value_counts()
words_count = df[0].str.extractall(r'(\w+)')[0].value_counts()

The above result words_count is a Series which you can convert to dataframe by:

df_new = words_count.to_frame('words_count')

Upvotes: 0

if you text is a list of strings like these above:

from sklearn.feature_extraction import text


texts = [
        'this is the first text',
        'this is the secound text',
        'and this is the last text the have two word text'


        ]


#istantiate.
cv = text.CountVectorizer()



cv.fit(texts)


vectors = cv.transform(texts).toarray()

you will need explore more the parameters.

Upvotes: 0

Sandalaphon
Sandalaphon

Reputation: 154

Counter would be good here:

    from collections import Counter
    c = Counter(tokenizedFile)
    longData = pd.DataFrame(c.values(), index = c.keys(), columns=['Frequency'])

Upvotes: 1

Related Questions