Reputation: 5938
So I wrote some code to extract only what's within the <p>
tags of some HTML code. Here is my code
soup = BeautifulSoup(my_string, 'html')
no_tags=' '.join(el.string for el in soup.find_all('p', text=True))
It works how I want it to for most of the examples it is run on, but I have noticed that in examples such as
<p>hello, how are you <code>other code</code> my name is joe</p>
it returns nothing. I suppose this is because there are other tags within the <p>
tags. So just to be clear, what I would want it to return is
hello, how are you my name is joe
can someone help me out regarding how to deal with such examples?
Upvotes: 0
Views: 52
Reputation: 14841
Your guess is correct. According to BeautifulSoup documentation, .string
returns None
when there are more than 1 children (and that is the case in your example).
Now, you have a few options. First is to use .contents
and recursively iterate over it, checking the value of .string
on each of its visited children.
This approach can be a hassle in the long run. Fortunately enough, BeautifulSoup 4 offers method called .strings which enables you to do exactly what you want in an easy way.
Finally, if you know the text is going to be simple and you want an easy solution, you can also use regular expressions and replace all /<[^>]*>/
with an empty string. You must be, however, aware of the consequences.
Upvotes: 2