Reputation: 18047
I crawl a table from a web link and would like to rebuild a table by removing all script tags. Here are the source codes.
response = requests.get(url)
soup = BeautifulSoup(response.text)
table = soup.find('table')
for row in table.find_all('tr') :
for col in row.find_all('td'):
#remove all different script tags
#col.replace_with('')
#col.decompose()
#col.extract()
col = col.contents
How can I remove all different script tags? Take the follow cell as an exampple, which includes the tag a
, br
and td
.
<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>
My expected result is:
Signal et Communication
Ingénierie Réseaux et Télécommunications
Upvotes: 5
Views: 742
Reputation: 474191
You are asking about get_text()
:
If you only want the text part of a document or tag, you can use the
get_text()
method. It returns all the text in a document or beneath a tag, as a single Unicode string
td = soup.find("td")
td.get_text()
Note that .string
would return you None
in this case since td
has multiple children:
If a tag contains more than one thing, then it’s not clear what
.string
should refer to, so.string
is defined to beNone
Demo:
>>> from bs4 import BeautifulSoup
>>>
>>> soup = BeautifulSoup(u"""
... <td><a href="http://www.irit.fr/SC">Signal et Communication</a>
... <br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
... </td>
... """)
>>>
>>> td = soup.td
>>> print td.string
None
>>> print td.get_text()
Signal et Communication
Ingénierie Réseaux et Télécommunications
Upvotes: 5