How to append strings to a pandas dataframe?

Question

I have the following code to extract sentences out of a directory of text files.

# -*- coding: utf-8 -*-
from nltk.tokenize import sent_tokenize
import pandas as pd

directory_in_str = "E:\Extracted\"
directory = os.fsencode(directory_in_str)

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    with open(os.path.join(directory_in_str, filename), encoding="utf8") as f_in:
        for line in f_in:
            sentences = sent_tokenize(line)

I would like to build up a pandas dataframe and append the sentences to that dataframe so that I can build a frequency count of the n-grams in the sentences as per How to find ngram frequency of a column in a pandas dataframe?

That is to say I need to append the sentences to df = pd.DataFrame([], columns=['description']) so that I can then do:

from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['description'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])

What would be the code to append the sentences to the df Dataframe?

cs95 · Accepted Answer

Your extraction code needs a slight change. Declare sentences outside and keep extending it as needed.

sentences = []
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    with open(os.path.join(directory_in_str, filename), encoding="utf8") as f_in:
        for line in f_in:
            sentences.extend(sent_tokenize(line))

Once done, simply initialise your df like this:

df = pd.DataFrame({'Description' : sentences})

How to append strings to a pandas dataframe?

Answers (1)

Related Questions