Reputation: 918
I am developing a python web scraper with BeautifulSoup that parses "product listings" from this website and extracts some information for each product listing (i.e., price, vendor, etc.). I am able to extract many of this information but one (i.e., the product quantity), which seems to be hidden from the raw html. Looking at the webpage through my browser what I see is (unid = units):
product_name 1 unid $10.00
but the html for that doesn't show any integer value that I can extract. It shows this html text:
<div class="e-col5 e-col5-offmktplace ">
<div class="kWlJn zYaQqZ gQvJw"> </div>
<div class="imgnum-unid"> unid</div>
</div>
My question is how do I get this hidden content of e-col5
which stores the product quantity?
import re
import requests
from bs4 import BeautifulSoup
page = requests.get("https://ligamagic.com.br/?view=cards%2Fsearch&card=Hapatra%2C+Vizier+of+Poisons")
soup = BeautifulSoup(page.content, 'html.parser')
vendor = soup.find_all('div', class_="estoque-linha", mp="2")
print(vendor[1].find(class_='e-col1').find('img')['title'])
print(vendor[1].find(class_='e-col2').find_all(class_='ed-simb')[1].string)
print(vendor[1].find(class_='e-col5'))
EDIT: Hidden content stands for JavasSript dynamically updated content in this case.
Upvotes: 3
Views: 2056
Reputation: 417
@ewwink found out the way to pull out unid
but was unable to pull out prices. I have tried to pull out prices in this answer.
Target div snippet:
<div mp="2" id="line_e3724364" class="estoque-linha primeiro"><div class="e-col1"><a href="b/?p=e3724364" target="_blank"><img title="Rayearth Games" src="//www.lmcorp.com.br/arquivos/up/ecom/comparador/155937.jpg"></a></div><div class="e-col9-mobile"><div class="e-mob-edicao"><img src="//www.lmcorp.com.br/arquivos/up/ed_mtg/AKH_R.gif" height="19"></div><div class="e-mob-edicao-lbl"><p>Amonkhet</p></div><div class="e-mob-preco e-mob-preco-desconto"><font color="gray" class="mob-preco-desconto"><s>R$ 1,00</s></font><br>R$ 0,85</div></div><div class="e-col2"><a href="./?view=cards/search&card=ed=akh" class="ed-simb"><img src="//www.lmcorp.com.br/arquivos/up/ed_mtg/AKH_R.gif" height="21"></a><font class="nomeedicao"><a href="./?view=cards/search&card=ed=akh" class="ed-simb">Amonkhet</a></font></div><div class="e-col3"><font color="gray" class="mob-preco-desconto"><s>R$ 1,00</s></font><br>R$ 0,85</div>
<div class="e-col4 e-col4-offmktplace">
<img src="https://www.lmcorp.com.br/arquivos/img/bandeiras/pten.gif" title="Português/Inglês"> <font class="azul" onclick="cardQualidade(3);">SP</font>
</div>
<div class="e-col5 e-col5-offmktplace "><div class="cIiVr lHfXpZ mZkHz"> </div> <div class="imgnum-unid"> unid</div></div><div class="e-col8 e-col8-offmktplace "><div><a target="_blank" href="b/?p=e3724364" class="goto" title="Visitar Loja">Ir à loja</a></div></div></div>
If we look closely, we can,
for item in soup.findAll('div', {"id": re.compile('^line')}):
print(re.findall("R\$ (.*?)</div>", str(item), re.DOTALL))
Output [truncated]:
['10,00</s></font><br/>R$ 8,00', '10,00</s></font><br/>R$ 8,00']
['9,50</s></font><br/>R$ 8,55', '9,50</s></font><br/>R$ 8,55']
['9,50</s></font><br/>R$ 8,55', '9,50</s></font><br/>R$ 8,55']
['9,75</s></font><br/>R$ 8,78', '9,75</s></font><br/>R$ 8,78']
[]
[]
It extracts major chunks, and we'll get the prices. But this also skips multiple items.
To get all the data, we can use OCR API and Selenium to accomplish this. We can capture elements of interest by using the following snippet :
from selenium import webdriver
from PIL import Image
from io import BytesIO
fox = webdriver.Firefox()
fox.get('https://ligamagic.com.br/?view=cards%2Fsearch&card=Hapatra%2C+Vizier+of+Poisons')
#element = fox.find_element_by_id('line_e3724364')
element = fox.find_elements_by_tag_name('s')
location = element.location
size = element.size
png = fox.get_screenshot_as_png() # saves screenshot of entire page
fox.quit()
im = Image.open(BytesIO(png)) # uses PIL library to open image in memory
left = location['x']
top = location['y']
right = location['x'] + size['width']
bottom = location['y'] + size['height']
im = im.crop((left, top, right, bottom)) # defines crop points
im.save('screenshot.png') # saves new cropped image
Took help from https://stackoverflow.com/a/15870708.
We can iterate like we did above using re.findall()
to save all the images. After we have all the images, we can then use OCR Space to extract text data. Here's a quick snippet :
import requests
def ocr_space_file(filename, overlay=False, api_key='api_key', language='eng'):
payload = {'isOverlayRequired': overlay,
'apikey': api_key,
'language': language,
}
with open(filename, 'rb') as f:
r = requests.post('https://api.ocr.space/parse/image',
files={filename: f},
data=payload,
)
return r.content.decode()
e = ocr_space_file(filename='1.png')
print(e) # prints JSON
1.png :
JSON response from ocr.space :
{"ParsedResults":[{"TextOverlay":{"Lines":[],"HasOverlay":false,"Message":"Text overlay is not provided as it is not requested"},"TextOrientation":"0","FileParseExitCode":1,"ParsedText":"RS 0',85 \r\n","ErrorMessage":"","ErrorDetails":""}],"OCRExitCode":1,"IsErroredOnProcessing":false,"ProcessingTimeInMilliseconds":"1996","SearchablePDFURL":"Searchable PDF not generated as it was not requested."}
It gives us, "ParsedText" : "RS 0',85 \r\n"
.
Upvotes: 1
Reputation: 19184
the unid
is saved in JS array
vetFiltro[0]=["e3724364",0,1,....];
the 1
is the unid, you can get it with regex
# e-col5
unitID = vendor[1].get('id').replace('line_', '') # line_e3724364 => e3724364
regEx = r'"%s",\d,(\d+)' % unitID
unit = re.search(regEx, page.text).group(1)
print(unit + ' unids')
Upvotes: 2
Reputation: 1150
If you take a closer look the unid
is just an image in a div moved by a class
to the correct number.
For example unid
1:
.jLsXy {
background-image: url(arquivos/up/comp/imgunid/files/img/181224lSfWip8i1lmcj2a520836c8932ewcn.jpg);
}
is the image containing numbers.
.gBpKxZ {
background-position: -424px -23px;
}
is the class for number 1
So find the matching css to the number and create your table ( easy way ) but not best way.
Edit: Seems like changing the position(class) each time reloaded so its more hard to match the number with the image :( so the number 1 could be taken from many places.
Edit2
I was using chrome devtools.
If you inspect the unid
you will find the css
for each class aswell.
So after checking the url it was clear.
Upvotes: 1