Mina Mino
Mina Mino

Reputation: 33

Scrape Text After Specific Text and Before Specific Text


<script type="text/javascript">


                        'sku': 'T3246B5',
                        'Name': 'TAS BLACKY',
                        'Price': '111930',
                        'categories': 'Tas,Wanita,Sling Bags,Di bawah Rp 200.000',
                        'brand': '',
                        'visibility': '4',
                        'instock': "1",
                        'stock': "73.0000"

            </script>

I want to scrape the text between : 'stock': " and .0000" so the desireable result is 73

What I used to know is to do something like this:

for url2 in urls2:
        req2 = Request(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
        html2 = uReq(req2).read()
        page_soup2 = soup(html2, "html.parser")


        # Grab text
        stock = page_soup2.findAll("p", {"class": "stock"})
        stocks = stock[0].text

I used something like this in my previous code, It works before the web change their code.

But now there is more than 1 ("script", {"type": "text/javascript"}) in the entire page I want to scrape. So I dont know how to find the right ("script", {"type": "text/javascript"})

I also don't know hot to get the specific text before and after the text.

I have googled it all this day but can't find the solution. Please help.

I found that strings = 'stock': " and .0000" is unique in the entire page, only 1 'stock': and only 1 .0000"

So I think it could be the sign of location where I want to scrape the text.

Please help, thank you for your kindness.

I also apologize for my lack of English, and I am actually unfamiliar with programming. I'm just trying to learn from Google, but I can't find the answer. Thank you for your understanding.

the url = view-source:sophieparis.com/blacky-bag.html

Upvotes: 3

Views: 255

Answers (3)

QHarr
QHarr

Reputation: 84465

I would write a regex that targets the javascript dictionary variable that houses the values of interest. You can apply this direct to response.text with no need for bs4.

enter image description here

The dictionary variable is called productObject, and you want the non-empty dictionary which is the second occurrence of productObject = {..} i.e. not the one which has 'var ' preceeding it. You can use negative lookbehind to specify this requirement.

Use hjson to handle property names enclosed in single quotes.


Py

import requests, re, hjson

r = requests.get('https://www.sophieparis.com/blacky-bag.html')
p = re.compile(r'(?<!var\s)productObject = ([\s\S]*?})')
data = hjson.loads(p.findall(r.text)[0])
print(data)

enter image description here


Regex: try

enter image description here

Upvotes: 1

chitown88
chitown88

Reputation: 28565

Since you are sure 'stock' only shows up in the script tag you want, you can pull out that text that contains 'stock. Once you have that, it's a matter of trimming off the excess, and change to double quotes to get it into a valid json format and then simply read that in using json.loads()

import requests
from bs4 import BeautifulSoup
import json


url2 = 'https://www.sophieparis.com/blacky-bag.html'

req2 = requests.get(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})

page_soup2 = BeautifulSoup(req2.text, "html.parser")


scripts = page_soup2.find_all('script')

for script in scripts:
    if 'stock' in script.text:
        jsonStr = script.text
        jsonStr = jsonStr.split('productObject = ')[-1].strip()
        jsonStr = jsonStr.rsplit('}',1)[0].strip() + '}'

        jsonData = json.loads(jsonStr.replace("'",'"'))

print (jsonData['stock'].split('.')[0])

Output:

print (jsonData['stock'].split('.')[0])

71

You could also do this without the loop and just grab the script that has the string stock in it using 1 line:

jsonStr = page_soup2.find('script', text=re.compile(r'stock')).text

Full code would look something like:

import requests
from bs4 import BeautifulSoup
import json
import re


url2 = 'https://www.sophieparis.com/blacky-bag.html'

req2 = requests.get(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})

page_soup2 = BeautifulSoup(req2.text, "html.parser")

jsonStr = page_soup2.find('script', text=re.compile(r'stock')).text
jsonStr = jsonStr.split('productObject = ')[-1].strip()
jsonStr = jsonStr.rsplit('}',1)[0].strip() + '}'

jsonData = json.loads(jsonStr.replace("'",'"'))

print (jsonData['stock'].split('.')[0])

Upvotes: 1

Sam
Sam

Reputation: 533

If you want to provide me with the webpage you wish to scrape the data from, I'll see if I can fix the code to pull the information.

Upvotes: 0

Related Questions