delirium
delirium

Reputation: 918

BeautifulSoup: Parse JavaScript dynamic content

I am developing a python web scraper with BeautifulSoup that parses "product listings" from this website and extracts some information for each product listing (i.e., price, vendor, etc.). I am able to extract many of this information but one (i.e., the product quantity), which seems to be hidden from the raw html. Looking at the webpage through my browser what I see is (unid = units):

product_name       1 unid      $10.00 

but the html for that doesn't show any integer value that I can extract. It shows this html text:

<div class="e-col5 e-col5-offmktplace ">
  <div class="kWlJn zYaQqZ gQvJw">&nbsp;</div> 
  <div class="imgnum-unid"> unid</div>
</div>

My question is how do I get this hidden content of e-col5 which stores the product quantity?

import re
import requests
from bs4 import BeautifulSoup

page = requests.get("https://ligamagic.com.br/?view=cards%2Fsearch&card=Hapatra%2C+Vizier+of+Poisons")
soup = BeautifulSoup(page.content, 'html.parser')
vendor = soup.find_all('div', class_="estoque-linha", mp="2")
print(vendor[1].find(class_='e-col1').find('img')['title'])
print(vendor[1].find(class_='e-col2').find_all(class_='ed-simb')[1].string)
print(vendor[1].find(class_='e-col5'))

EDIT: Hidden content stands for JavasSript dynamically updated content in this case.

Upvotes: 3

Views: 2056

Answers (3)

0x48piraj
0x48piraj

Reputation: 417

@ewwink found out the way to pull out unid but was unable to pull out prices. I have tried to pull out prices in this answer.

Target div snippet:

<div mp="2" id="line_e3724364" class="estoque-linha primeiro"><div class="e-col1"><a href="b/?p=e3724364" target="_blank"><img title="Rayearth Games" src="//www.lmcorp.com.br/arquivos/up/ecom/comparador/155937.jpg"></a></div><div class="e-col9-mobile"><div class="e-mob-edicao"><img src="//www.lmcorp.com.br/arquivos/up/ed_mtg/AKH_R.gif" height="19"></div><div class="e-mob-edicao-lbl"><p>Amonkhet</p></div><div class="e-mob-preco e-mob-preco-desconto"><font color="gray" class="mob-preco-desconto"><s>R$ 1,00</s></font><br>R$ 0,85</div></div><div class="e-col2"><a href="./?view=cards/search&amp;card=ed=akh" class="ed-simb"><img src="//www.lmcorp.com.br/arquivos/up/ed_mtg/AKH_R.gif" height="21"></a><font class="nomeedicao"><a href="./?view=cards/search&amp;card=ed=akh" class="ed-simb">Amonkhet</a></font></div><div class="e-col3"><font color="gray" class="mob-preco-desconto"><s>R$ 1,00</s></font><br>R$ 0,85</div>
                            <div class="e-col4 e-col4-offmktplace">
                                <img src="https://www.lmcorp.com.br/arquivos/img/bandeiras/pten.gif" title="Português/Inglês"> <font class="azul" onclick="cardQualidade(3);">SP</font>

                            </div>
                        <div class="e-col5 e-col5-offmktplace "><div class="cIiVr lHfXpZ mZkHz">&nbsp;</div> <div class="imgnum-unid"> unid</div></div><div class="e-col8 e-col8-offmktplace "><div><a target="_blank" href="b/?p=e3724364" class="goto" title="Visitar Loja">Ir à loja</a></div></div></div>

If we look closely, we can,

for item in soup.findAll('div', {"id": re.compile('^line')}):
 print(re.findall("R\$ (.*?)</div>", str(item), re.DOTALL))

Output [truncated]:

['10,00</s></font><br/>R$ 8,00', '10,00</s></font><br/>R$ 8,00']
['9,50</s></font><br/>R$ 8,55', '9,50</s></font><br/>R$ 8,55']
['9,50</s></font><br/>R$ 8,55', '9,50</s></font><br/>R$ 8,55']
['9,75</s></font><br/>R$ 8,78', '9,75</s></font><br/>R$ 8,78']
[]
[]

It extracts major chunks, and we'll get the prices. But this also skips multiple items.

To get all the data, we can use OCR API and Selenium to accomplish this. We can capture elements of interest by using the following snippet :

from selenium import webdriver
from PIL import Image
from io import BytesIO

fox = webdriver.Firefox()
fox.get('https://ligamagic.com.br/?view=cards%2Fsearch&card=Hapatra%2C+Vizier+of+Poisons')
#element = fox.find_element_by_id('line_e3724364')
element = fox.find_elements_by_tag_name('s')
location = element.location
size = element.size
png = fox.get_screenshot_as_png() # saves screenshot of entire page
fox.quit()

im = Image.open(BytesIO(png)) # uses PIL library to open image in memory

left = location['x']
top = location['y']
right = location['x'] + size['width']
bottom = location['y'] + size['height']


im = im.crop((left, top, right, bottom)) # defines crop points
im.save('screenshot.png') # saves new cropped image

Took help from https://stackoverflow.com/a/15870708.

We can iterate like we did above using re.findall() to save all the images. After we have all the images, we can then use OCR Space to extract text data. Here's a quick snippet :

import requests


def ocr_space_file(filename, overlay=False, api_key='api_key', language='eng'):

    payload = {'isOverlayRequired': overlay,
               'apikey': api_key,
               'language': language,
               }
    with open(filename, 'rb') as f:
        r = requests.post('https://api.ocr.space/parse/image',
                          files={filename: f},
                          data=payload,
                          )
    return r.content.decode()

e = ocr_space_file(filename='1.png')

print(e) # prints JSON

1.png :

enter image description here

JSON response from ocr.space :

{"ParsedResults":[{"TextOverlay":{"Lines":[],"HasOverlay":false,"Message":"Text overlay is not provided as it is not requested"},"TextOrientation":"0","FileParseExitCode":1,"ParsedText":"RS 0',85 \r\n","ErrorMessage":"","ErrorDetails":""}],"OCRExitCode":1,"IsErroredOnProcessing":false,"ProcessingTimeInMilliseconds":"1996","SearchablePDFURL":"Searchable PDF not generated as it was not requested."}

It gives us, "ParsedText" : "RS 0',85 \r\n".

Upvotes: 1

ewwink
ewwink

Reputation: 19184

the unid is saved in JS array

vetFiltro[0]=["e3724364",0,1,....];

the 1 is the unid, you can get it with regex

# e-col5
unitID = vendor[1].get('id').replace('line_', '') # line_e3724364 => e3724364
regEx = r'"%s",\d,(\d+)' % unitID
unit = re.search(regEx, page.text).group(1)
print(unit + ' unids')

Upvotes: 2

Fabian
Fabian

Reputation: 1150

If you take a closer look the unid is just an image in a div moved by a class to the correct number.

For example unid 1:

.jLsXy {
    background-image: url(arquivos/up/comp/imgunid/files/img/181224lSfWip8i1lmcj2a520836c8932ewcn.jpg);
}

is the image containing numbers.

.gBpKxZ {
background-position: -424px -23px;
}

is the class for number 1

So find the matching css to the number and create your table ( easy way ) but not best way.

Edit: Seems like changing the position(class) each time reloaded so its more hard to match the number with the image :( so the number 1 could be taken from many places.

Edit2 I was using chrome devtools. If you inspect the unid you will find the css for each class aswell. So after checking the url it was clear.

Upvotes: 1

Related Questions