Chace Mcguyer
Chace Mcguyer

Reputation: 415

Python3 BeautifulSoup returning concatenated strings

I am trying to pull a list of actors from this html once I find it

actors_anchor = soup.find('a', href = re.compile('Actor&p'))
parent_tag = actors_anchor.parent
next_td_tag = actors_anchor_parent.findNext('td')

next_td_tag

<font size="2">Wes Bentley<br><a href="/people/chart/
?view=Actor&amp;id=brycedallashoward.htm">Bryce Dallas Howard</a><br><a
href="/people/chart/?view=Actor&amp;id=robertredford.htm">Robert        
Redford</a><br><a href="/people/chart/ view=Actor&amp;id=karlurban.htm">Karl Urban</a></br></br></br></font>

The problem is that when I pull the text it returns a single string with no whitespace between names

print(next_td_tag.get_text())
'''this returns'''
'Wes BentleyBryce Dallas HowardRobert RedfordKarl Urban'

I need to get these names into a list where each name is separated like ['Wes Bentley','Bryce Dallas Howard','Robert Redford', 'Karl Urban']

any suggestions would be much obliged.

Upvotes: 1

Views: 560

Answers (3)

宏杰李
宏杰李

Reputation: 12168

import bs4

html = '''<font size="2">Wes Bentley<br><a href="/people/chart/
?view=Actor&amp;id=brycedallashoward.htm">Bryce Dallas Howard</a><br><a
href="/people/chart/?view=Actor&amp;id=robertredford.htm">Robert        
Redford</a><br><a href="/people/chart/ view=Actor&amp;id=karlurban.htm">Karl Urban</a></br></br></br></font>'''

soup = bs4.BeautifulSoup(html, 'lxml')

text = soup.get_text(separator='|') # concat the stings by separator 
# 'Wes Bentley|Bryce Dallas Howard|Robert        \nRedford|Karl Urban'
split_text = text.replace('        \n', '').split('|') # than split string in separator.
# ['Wes Bentley', 'Bryce Dallas Howard', 'RobertRedford', 'Karl Urban']

# do it one line 
list_text = soup.get_text(separator='|').replace('        \n', '').split('|')

or use string generator to avoid manually split string into list:

[i.replace('        \n', '') for i in soup.strings]

Upvotes: 0

furas
furas

Reputation: 142661

You can use stripped_strings to get all strings as list

html = '''<td><font size="2">Wes Bentley<br><a href="/people/chart/
?view=Actor&amp;id=brycedallashoward.htm">Bryce Dallas Howard</a><br><a
href="/people/chart/?view=Actor&amp;id=robertredford.htm">Robert Redford</a><br><a href="/people/chart/ view=Actor&amp;id=karlurban.htm">Karl Urban</a></br></br></br></font></td>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

next_td_tag = soup.find('td')

print(list(next_td_tag.stripped_strings))

result

['Wes Bentley', 'Bryce Dallas Howard', 'Robert Redford', 'Karl Urban']

stripped_strings is generator so you can use it with for-loop or get all elements using list()

Upvotes: 1

alecxe
alecxe

Reputation: 473873

Locate all a elements inside the found td:

[a.get_text() for a in next_td_tag.find_all('a')]

This though would not cover the "Wes Bentley" text which is hanging without an a element.

We can approach it differently and locate all the text nodes instead:

next_td_tag.find_all(text=True)

You might need to clean it up and remove the "empty" items:

texts = [text.strip().replace("\n", " ") for text in next_td_tag.find_all(text=True)]
texts = [text for text in texts if text]
print(texts)

Would print:

['Wes Bentley', 'Bryce Dallas Howard', 'Robert Redford', 'Karl Urban']

Upvotes: 1

Related Questions