Reputation: 697
I am using pandas library on Python 3.5.1. How can I remove html tags from field values? Here are my input and output:
My code returned an error:
import pandas as pd
code=[1,2,3]
overview =['<p>Environments subject.</p>',
'<ul><li> property ;</li></ul><ul><li>markets and exchange;</li></ul>',
'<p class="MsoNormal" style="margin: 0cm 0cm 0pt;">']
# '<p class="SSPBodyText" style="padding: 0cm; text-align: justify;">The subject.</p>']
df= pd.DataFrame(overview,code)
df.columns = ['overview']
df['overview_copy'] = df['overview']
# print(df)
tags_list = ['<p>' ,'</p>' , '<p*>',
'<ul>','</ul>',
'<li>','</li>',
'<br>',
'<strong>','</strong>',
'<span*>','</span>',
'<a href*>','</a>',
'<em>','</em>']
for tag in tags_list:
# df['overview_copy'] = df['overview_copy'].str.replace(tag, '')
df['overview_copy'].replace(to_replace=tag, value='', regex=True, inplace=True)
print(df)
Upvotes: 9
Views: 21374
Reputation: 405745
Note that if you have the column of data with HTML tags in a list, it is much faster to remove the tags before you create the dataframe. (This will not always be possible when loading data from an external source.) Even for this small example, it's consistently 10 times faster.
import re
import pandas as pd
from timeit import default_timer as timer
code = [1, 2, 3]
overview = ['<p>Environments subject.</p>',
'<ul><li> property ;</li></ul><ul><li>markets and exchange;</li></ul>',
'<p class="MsoNormal" style="margin: 0cm 0cm 0pt;">']
# '<p class="SSPBodyText" style="padding: 0cm; text-align: justify;">The subject.</p>']
df = pd.DataFrame({'overview': overview, 'code': code})
start = timer()
overview = [re.sub(r'<[^<]+?>', '', text) for text in overview]
end = timer()
re_sub_time = end - start
print("re_sub time:", re_sub_time)
start = timer()
df['overview_copy'] = df['overview'].str.replace(r'<[^<>]*>', '', regex=True)
# df['overview_copy'] = df['overview'].str.replace(r'<[^<]+?>', '', regex=True)
end = timer()
str_replace_time = end - start
print("Pandas str.replace time:", str_replace_time)
print("Ratio:", str_replace_time / re_sub_time)
Note that the speed improvement is not due to the slight difference in regular expressions used in the other examples. I tested both regular expressions, and stripping tags is faster in the list with either regex.
Output:
re_sub time: 8.690000000000087e-05
Pandas str.replace time: 0.0010488999999999082
Ratio: 12.070195627156476
Upvotes: 0
Reputation: 626758
The Pandas way is using Series.str.replace
:
df['overview_copy'] = df['overview_copy'].str.replace(r'<[^<>]*>', '', regex=True)
Details:
<
- a <
char[^<>]*
- zero or more chars ther than <
and >
as many as possible>
- a >
char.See the regex demo.
Pandas output:
>>> df['overview_copy']
1 Environments subject.
2 property ;markets and exchange;
3
Name: overview_copy, dtype: object
>>>
Upvotes: 15