Reputation: 8096
I have the following code to extract sentences out of a directory of text files.
# -*- coding: utf-8 -*-
from nltk.tokenize import sent_tokenize
import pandas as pd
directory_in_str = "E:\\Extracted\\"
directory = os.fsencode(directory_in_str)
for file in os.listdir(directory):
filename = os.fsdecode(file)
with open(os.path.join(directory_in_str, filename), encoding="utf8") as f_in:
for line in f_in:
sentences = sent_tokenize(line)
I would like to build up a pandas dataframe and append the sentences to that dataframe so that I can build a frequency count of the n-grams in the sentences as per How to find ngram frequency of a column in a pandas dataframe?
That is to say I need to append the sentences to df = pd.DataFrame([], columns=['description'])
so that I can then do:
from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['description'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])
What would be the code to append the sentences to the df
Dataframe?
Upvotes: 0
Views: 4040
Reputation: 402553
Your extraction code needs a slight change. Declare sentences
outside and keep extend
ing it as needed.
sentences = []
for file in os.listdir(directory):
filename = os.fsdecode(file)
with open(os.path.join(directory_in_str, filename), encoding="utf8") as f_in:
for line in f_in:
sentences.extend(sent_tokenize(line))
Once done, simply initialise your df
like this:
df = pd.DataFrame({'Description' : sentences})
Upvotes: 1