Reputation: 333
I have two divs I am trying to scrape, with the same name (but there are other divs on the page also with a partial name match, that I dont want). The first I just need the text inside each span element. In the second I need the text inside the span element, for the first row then I need the text inside the
tags for row 2 and 3.
I'm not even too sure why I need to slice at the end of the divs (I think because the div class col returns more than the 2 relevant divs, but adding :1 at the end of divs seems to help)
My questions are - how to get an exact match on the div name How to scrape inside the p tags How to combine the results from the above. I can get the text inside the span tags, as shown below but as I say above I need the text inside the p tags also and combine the results.
The data is coming from the player details section in this URL - https://www.skysports.com/football/player/141016/alisson-ramses-becker
The html looks like this
<div class="row-table details -bp30">
<div class="col">
<p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p> <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p> <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>
</div>
<div class="col">
<p>Club: <span itemprop="affiliation">Liverpool</span></p><p>Squad: 13</p> <p>Position: Goal Keeper</p>
</div>
</div>
Relevant piece of my program
premier_soup1 = player_soup.find('div', {'class': 'row-table details -bp30'})
premier_soup_tr = premier_soup1.find_all('div', {'class': 'col'})
divs = player_soup.find_all( 'div', {'class': 'col'})
for div in divs[:1]:
para = div.find_all('p')
print(para)
Output -
[<p class="text-h4 title">Player Details</p>, <p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p>, <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>, <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>, <p>Club: <span itemprop="affiliation">Liverpool</span></p>, <p>Squad: 13</p>, <p>Position: Goal Keeper</p>]
Also - I know I can get the span text with this
divs = player_soup.find_all( 'div', {'class': 'col'})
for div in divs[:1]:
spans = div.find_all('span')
for span in spans:
print(span.text, ",", end=' ')
Output -
Alisson Ramses Becker , 02/10/1992 , Brazil , Liverpool ,
Upvotes: 2
Views: 2775
Reputation: 15558
Assuming you have rights to scrap this site and there are no APIs or json returns, one slow way to do it is:
from bs4 import BeautifulSoup as bs
html = '''
<div class="row-table details -bp30">
<div class="col">
<p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p> <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p> <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>
</div>
<div class="col">
<p>Club: <span itemprop="affiliation">Liverpool</span></p><p>Squad: 13</p> <p>Position: Goal Keeper</p>
</div>
</div>
'''
soup = bs(html,'html5lib')
data = [d.find_all('p') for d in soup.find_all('div',{'class':'col'})]
value = []
for i in data:
for j in i:
value.append(j.text)
print(value)
Upvotes: 1
Reputation: 3097
Your main question is how to extract the text from <p>
, which does not contain <span>
.
NavigableString A string corresponds to a bit of text within a tag. So you can extract text if they are instances of NavigableString
from bs4 import BeautifulSoup,NavigableString
html = "your example"
soup = BeautifulSoup(html,"lxml")
for e in soup.find("p"):
print(e,type(e))
#Name: <class 'bs4.element.NavigableString'>
#<strong><span itemprop="name">Alisson Ramses Becker</span></strong> <class 'bs4.element.Tag'>
Real code:
resultset = soup.find_all("p")
maintext = []
for result in resultset:
for element in result:
if isinstance(element, NavigableString):
maintext.append(element)
print(maintext)
# ['Name: ', 'Date of birth:', 'Place of birth:', 'Club: ', 'Squad: 13', 'Position: Goal Keeper']
Equal to
[element for result in resultset for element in result if isinstance(element, NavigableString)]
My full test code
from bs4 import BeautifulSoup,NavigableString
html = """
<div class="row-table details -bp30">
<div class="col">
<p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span></strong></p> <p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p> <p>Place of birth:<span itemprop="nationality"> Brazil</span></p>
</div>
<div class="col">
<p>Club: <span itemprop="affiliation">Liverpool</span></p><p>Squad: 13</p> <p>Position: Goal Keeper</p>
</div>
</div>
"""
soup = BeautifulSoup(html,"lxml")
resultset = soup.find_all("p")
fr = [element for result in resultset for element in result if isinstance(element, NavigableString)]
spanset = [e.text for e in soup.find_all("span",{"itemprop":True})]
setA = ["".join(z) for z in zip(fr,spanset)]
final = setA + fr[len(spanset):]
print(final)
Output
['Name: Alisson Ramses Becker', 'Date of birth:02/10/1992', 'Place of birth: Brazil', 'Club: Liverpool', 'Squad: 13', 'Position: Goal Keeper']
Upvotes: 1