Reputation: 9345
I have a Pandas DataFrame with a text
column containing HTML. I want to get just the text, aka strip the tags. I try to do this below as follows:
from bs4 import BeautifulSoup
result_df['text'] = BeautifulSoup(result_df['text']).get_text()
However, I end up getting this error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
What am I doing incorrectly?
Thanks!
Upvotes: 2
Views: 6234
Reputation: 21
df = pd.read_csv("./testlog.tsv", index_col=0,delimiter='\t')
X = df.review
df['review'] = [BeautifulSoup(X).getText() for X in df['review']]
This will remove any HTML tags from the column review in the file testlog.
Upvotes: 1
Reputation: 21643
You could alternatively use an approach that uses apply
, although I doubt it makes much difference.
>>> import pandas as pd
>>> data = {'a': ['<div><span>something</span></div>', '<a href="nowhere.org">erowhon</a>']}
>>> df = pd.DataFrame(data)
>>> df
a
0 <div><span>something</span></div>
1 <a href="nowhere.org">erowhon</a>
>>> import bs4
>>> df['a'] = df['a'].apply(lambda x: bs4.BeautifulSoup(x, 'lxml').get_text())
>>> df
a
0 something
1 erowhon
Upvotes: 5
Reputation: 1337
Try this:
from bs4 import BeautifulSoup
result_df['text'] = [BeautifulSoup(text).get_text() for text in result_df['text'] ]
Upvotes: 14