Superdooperhero
Superdooperhero

Reputation: 8096

How to append strings to a pandas dataframe?

I have the following code to extract sentences out of a directory of text files.

# -*- coding: utf-8 -*-
from nltk.tokenize import sent_tokenize
import pandas as pd

directory_in_str = "E:\\Extracted\\"
directory = os.fsencode(directory_in_str)

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    with open(os.path.join(directory_in_str, filename), encoding="utf8") as f_in:
        for line in f_in:
            sentences = sent_tokenize(line)

I would like to build up a pandas dataframe and append the sentences to that dataframe so that I can build a frequency count of the n-grams in the sentences as per How to find ngram frequency of a column in a pandas dataframe?

That is to say I need to append the sentences to df = pd.DataFrame([], columns=['description']) so that I can then do:

from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['description'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])

What would be the code to append the sentences to the df Dataframe?

Upvotes: 0

Views: 4040

Answers (1)

cs95
cs95

Reputation: 402553

Your extraction code needs a slight change. Declare sentences outside and keep extending it as needed.

sentences = []
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    with open(os.path.join(directory_in_str, filename), encoding="utf8") as f_in:
        for line in f_in:
            sentences.extend(sent_tokenize(line))

Once done, simply initialise your df like this:

df = pd.DataFrame({'Description' : sentences})

Upvotes: 1

Related Questions