Reputation: 1079
I have a text file, from which I have removed symbols and stop words.
I have also tokenized it (broken it into a list of all words) in case operations are easier with a list.
I would like to create a .csv
file with frequencies of all words (long format) in descending order. How could I go about it?
I have thought about looping through the list thus:
longData = pandas.DataFrame([], index=[], columns=['Frequency'])
for word in tokenizedFile:
if word in longData.index:
longData.loc[word]=longData.loc[word]+1
else:
wordFrame = pandas.DataFrame([1], index=[word])
longData.append(wordFrame)
but that seems pretty inefficient and wasteful.
Upvotes: 1
Views: 573
Reputation: 1
If anyone is still struggling with this, you could try the following method:
df = pd.DataFrame({"words": tokenizedFile.lower()})
value_count = pd.value_counts(df["words"]) # getting the count of all the words
# storing the words and its respective count in a new dataframe
# value_count.keys() are the words, value_count.values is the count
vocabulary_df = pd.DataFrame({"words": value_count.keys(), "count": value_count.values})
What this does is,
tokenizedFile
), and convert all the words to lowercase. And then, create a column with title words
and the data will be all the words from the file.value_count
variable will store the number of times each word appears in our df dataframe by making use of the value_counts
method available for dataframes. It sorts it by default in descending order of the count.vocabulary_df
that will store all the words and it's count nicely into a new dataframe (value_count
is saved as a Series type). Here, value_count.keys()
has the words, and value_count.values
has the count of each word.Hopefully, this will be helpful to someone along the line. :)
Upvotes: 0
Reputation: 13998
You can use Series.str.extractall() and Series.value_counts(). Assume file.txt
is the file-path with texts removed symbols and stop words:
# read file into one column dataframe, the default column name is '0'
df = pd.read_csv('file.txt', sep='\n', header=None)
# extract words into rows and then do value_counts()
words_count = df[0].str.extractall(r'(\w+)')[0].value_counts()
The above result words_count
is a Series which you can convert to dataframe by:
df_new = words_count.to_frame('words_count')
Upvotes: 0
Reputation: 109
if you text is a list of strings like these above:
from sklearn.feature_extraction import text
texts = [
'this is the first text',
'this is the secound text',
'and this is the last text the have two word text'
]
#istantiate.
cv = text.CountVectorizer()
cv.fit(texts)
vectors = cv.transform(texts).toarray()
you will need explore more the parameters.
Upvotes: 0
Reputation: 154
Counter would be good here:
from collections import Counter
c = Counter(tokenizedFile)
longData = pd.DataFrame(c.values(), index = c.keys(), columns=['Frequency'])
Upvotes: 1