Reputation: 395
I have a string that comes from an article with a few hundred sentences. I want to convert the string to a dataframe, with each sentence as a row. For example,
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
I hope it becomes:
This is a book, to which I found exciting.
I bought it for my cousin.
He likes it.
As a python newbie, this is what I tried:
import pandas as pd
data_csv = StringIO(data)
data_df = pd.read_csv(data_csv, sep = ".")
With the code above, all sentences become column names. I actually want them in rows of a single column.
Upvotes: 0
Views: 1391
Reputation: 81614
Don't use read_csv
. Just split by '.'
and use the standard pd.DataFrame
:
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
data_df = pd.DataFrame([sentence for sentence in data.split('.') if sentence],
columns=['sentences'])
print(data_df)
# sentences
# 0 This is a book, to which I found exciting
# 1 I bought it for my cousin
# 2 He likes it
Keep in mind that this will break if there will be
floating point numbers in some of the sentences. In this case you will need to change the format of your string (eg use '\n'
instead of '.'
to separate sentences.)
Upvotes: 5
Reputation: 101
What you are trying to do is called tokenizing sentences. The easiest way would be to use a Text-Mining library such as NLTK for it:
from nltk.tokenize import sent_tokenize
pd.DataFrame(sent_tokenize(data))
Otherwise you could simply try something like:
pd.DataFrame(data.split('. '))
However, this will fail if you run into sentences like this:
problem = 'Tim likes to jump... but not always!'
Upvotes: 0
Reputation: 164693
You can achieve this via a list comprehension:
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
df = pd.DataFrame({'sentence': [i+'.' for i in data.split('. ')]})
print(df)
# sentence
# 0 This is a book, to which I found exciting.
# 1 I bought it for my cousin.
# 2 He likes it.
Upvotes: 1
Reputation: 1431
this is a quick solution but it solves your issue:
data_df = pd.read_csv(data, sep=".", header=None).T
Upvotes: 1