Hamideh
Hamideh

Reputation: 697

Removing html tags in pandas

I am using pandas library on Python 3.5.1. How can I remove html tags from field values? Here are my input and output:

enter image description here

My code returned an error:

import pandas as pd

code=[1,2,3]
overview =['<p>Environments subject.</p>',
          '<ul><li> property ;</li></ul><ul><li>markets and exchange;</li></ul>',
          '<p class="MsoNormal" style="margin: 0cm 0cm 0pt;">']
# '<p class="SSPBodyText" style="padding: 0cm; text-align: justify;">The subject.</p>']
df= pd.DataFrame(overview,code)

df.columns = ['overview']
df['overview_copy'] = df['overview']

# print(df)

tags_list = ['<p>' ,'</p>' , '<p*>',
             '<ul>','</ul>',
             '<li>','</li>',
             '<br>',
             '<strong>','</strong>',
             '<span*>','</span>',
             '<a href*>','</a>',
             '<em>','</em>']

for tag in tags_list:
#     df['overview_copy'] = df['overview_copy'].str.replace(tag, '')
  df['overview_copy'].replace(to_replace=tag, value='', regex=True, inplace=True)
print(df)

Upvotes: 9

Views: 21374

Answers (3)

Bill the Lizard
Bill the Lizard

Reputation: 405745

Note that if you have the column of data with HTML tags in a list, it is much faster to remove the tags before you create the dataframe. (This will not always be possible when loading data from an external source.) Even for this small example, it's consistently 10 times faster.

import re
import pandas as pd
from timeit import default_timer as timer

code = [1, 2, 3]
overview = ['<p>Environments subject.</p>',
          '<ul><li> property ;</li></ul><ul><li>markets and exchange;</li></ul>',
          '<p class="MsoNormal" style="margin: 0cm 0cm 0pt;">']
# '<p class="SSPBodyText" style="padding: 0cm; text-align: justify;">The subject.</p>']
df = pd.DataFrame({'overview': overview, 'code': code})

start = timer()
overview = [re.sub(r'<[^<]+?>', '', text) for text in overview]
end = timer()
re_sub_time = end - start
print("re_sub time:", re_sub_time)

start = timer()
df['overview_copy'] = df['overview'].str.replace(r'<[^<>]*>', '', regex=True)
# df['overview_copy'] = df['overview'].str.replace(r'<[^<]+?>', '', regex=True)
end = timer()
str_replace_time = end - start
print("Pandas str.replace time:", str_replace_time)

print("Ratio:", str_replace_time / re_sub_time)

Note that the speed improvement is not due to the slight difference in regular expressions used in the other examples. I tested both regular expressions, and stripping tags is faster in the list with either regex.

Output:

re_sub time: 8.690000000000087e-05
Pandas str.replace time: 0.0010488999999999082
Ratio: 12.070195627156476

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626758

The Pandas way is using Series.str.replace:

df['overview_copy'] = df['overview_copy'].str.replace(r'<[^<>]*>', '', regex=True)

Details:

  • < - a < char
  • [^<>]* - zero or more chars ther than < and > as many as possible
  • > - a > char.

See the regex demo.

Pandas output:

>>> df['overview_copy']
1               Environments subject.
2     property ;markets and exchange;
3                                    
Name: overview_copy, dtype: object
>>> 

Upvotes: 15

Pobe
Pobe

Reputation: 2793

Like so re.sub('<[^<]+?>', '', text)

You can find details answer there.

Upvotes: 8

Related Questions