Roger
Roger

Reputation: 395

Convert string to dataframe, separated by colon

I have a string that comes from an article with a few hundred sentences. I want to convert the string to a dataframe, with each sentence as a row. For example,

data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'

I hope it becomes:

This is a book, to which I found exciting.
I bought it for my cousin.
He likes it.

As a python newbie, this is what I tried:

import pandas as pd
data_csv = StringIO(data)
data_df = pd.read_csv(data_csv, sep = ".")

With the code above, all sentences become column names. I actually want them in rows of a single column.

Upvotes: 0

Views: 1391

Answers (4)

DeepSpace
DeepSpace

Reputation: 81614

Don't use read_csv. Just split by '.' and use the standard pd.DataFrame:

data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
data_df = pd.DataFrame([sentence for sentence in data.split('.') if sentence],
                       columns=['sentences'])
print(data_df)

#                                     sentences
#  0  This is a book, to which I found exciting
#  1                  I bought it for my cousin
#  2                                He likes it

Keep in mind that this will break if there will be floating point numbers in some of the sentences. In this case you will need to change the format of your string (eg use '\n' instead of '.' to separate sentences.)

Upvotes: 5

cdwoelk
cdwoelk

Reputation: 101

What you are trying to do is called tokenizing sentences. The easiest way would be to use a Text-Mining library such as NLTK for it:

from nltk.tokenize import sent_tokenize
pd.DataFrame(sent_tokenize(data))

Otherwise you could simply try something like:

pd.DataFrame(data.split('. '))

However, this will fail if you run into sentences like this:

problem = 'Tim likes to jump... but not always!'

Upvotes: 0

jpp
jpp

Reputation: 164693

You can achieve this via a list comprehension:

data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'

df = pd.DataFrame({'sentence': [i+'.' for i in data.split('. ')]})

print(df)

#                                      sentence
# 0  This is a book, to which I found exciting.
# 1                  I bought it for my cousin.
# 2                                He likes it.

Upvotes: 1

gyx-hh
gyx-hh

Reputation: 1431

this is a quick solution but it solves your issue:

data_df = pd.read_csv(data, sep=".", header=None).T

Upvotes: 1

Related Questions