Reputation: 779
I have a list as below that came from Beautiful Soup.
soup = BeautifulSoup(page.content, 'html.parser')
area = soup.select("td strong")
For example
area=[
<strong><span style="font-size:1.4em;">120 Beats Per Minute (15)</span><br/><br/>Cinema</strong>,
<strong><span style="font-size:1.4em;">A Little Night Music</span><br/><br/>Theatre</strong>,
<strong><span style="font-size:1.4em;">A Wrinkle in Time (PG)</span><br/><br/>Cinema</strong>
]
I need to get rid of text except for Cinema, Theatre.
I've come up with the expression below but I can't apply this to the list
x[x.find('<br/><br/>')+10:].replace('</strong>','')
Any ideas how I can apply this expression to extract data from the list to make a new list? I've tried this :
clean_area=[]
for x in area:
clean_area.append(x[x.find('<br/><br/>')+10:].replace('</strong>',''))
But I get this error : TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
Upvotes: 1
Views: 75
Reputation: 36
I was answering your first post about an hour ago but you removed it.
I'm not sure if this is the best way to do it but here is what I came up with:
text = [
"""<strong><span style="font-size:1.4em;">120 Beats Per Minute (15)</span><br/><br/>Cinema</strong>""",
"""<strong><span style="font-size:1.4em;">A Little Night Music</span><br/><br/>Theatre</strong>""",
"""<strong><span style="font-size:1.4em;">A Wrinkle in Time (PG)</span><br/><br/>Cinema</strong>"""
]
text = ''.join(text) #Converting list of strings to one string
start = "<br/><br/>" #Start indication
end = "</" #End indication
clean_area = []
index = 0
while index < len(text):
index = text.find(start, index)
if index == -1:
break
clean_area.append(text[index+len(start):text.find(end, index)])
index += len(start)
print(clean_area)
Upvotes: 1
Reputation: 779
I could only get this working with 2 passes. I'm sure it's not the best way but it at least works.
soup = BeautifulSoup(result.content, "html.parser")
for x in soup.findAll("span"):
x.decompose()
area = soup.select("td strong")
a = str(area)
soup2 = BeautifulSoup(a)
tr = []
for tag in soup2.find_all(True):
tr.append(tag.text)
clean_area = []
for i in tr[::3]:
clean_area.append(i)
Upvotes: 0
Reputation: 2211
What you want to use is decompose
this will take out any tags you do not want.
In this case it is the span
so
for x in soup.findAll("span"):
x.decompose()
print(soup.text)
returns
Cinema, Theatre
Upvotes: 1