nid
nid

Reputation: 155

How to remove HTML from pandas dataframe without list comprehension

Problem definition

The goal is to strip away each row of its html tags and save them in the dataframe.

The dataframe is defined as:

test = pd.DataFrame(data=["<p> test 1 </p>", "<p> random text </p>"], columns=["text"])

I already found this elegant answer to solve the problem. However for curiosity's sake I want to try to achieve the same result using a for loop.

Solution with list comprehension:

test['text'] = [BeautifulSoup(text,"lxml").get_text() for text in test['text'] ]

Attempt with for loop with iterative approach to solution:

First attempt:

This code has the variable text iterative over every element of the dataframe test and print out the result. So far so good.

for text in test['text']:

print(text)

Second attempt:

This code does the same thing with a stripped down version of the text.

for text in test['text']:

soup = BeautifulSoup(text,"lxml")

print(soup.get_text())

Third attempt:

Why is the result of this code a dataframe whose values are all "random text"?

test = pd.DataFrame(data=["<p> test 1 </p>", "<p> random text </p>"], columns=["text"])

for text in test['text']:

soup = BeautifulSoup(text,"lxml")

test["text"] = soup.get_text()

In the first loop the local variable text iterates over the first element of the dataframe which is "test 1". It turns it into a soup and adds it to the column "text" of the dataframe test. Same thing should happen in the second loop. Yet all that happens is that the value of the last loop is broadcasted over the whole column.

I think my last line of code actually broadcasts the same value to all rows of the dataframe. But how do I just modify the value that the variable text is taking in a given loop?

The whole post might look weird but I was thinking and testing while writing the post. I might find the solution myself and update the post. But I might stay stuck and need another perspective. Thank you for your time.

[1]: Pandas: Trouble Stripping HTML Tags From DataFrame Column

Upvotes: 0

Views: 3265

Answers (1)

Igor Dragushhak
Igor Dragushhak

Reputation: 637

You can use regular expressions in order to remove the tags.

import re

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

Upvotes: 3

Related Questions