user3761151
user3761151

Reputation: 143

Parsing JS with Beautiful soup

I have some page parsed with beautiful soup. But there I have js code :

<script type="text/javascript">   


var utag_data = {
            customer_id   : "_PHL2883198554", 
            customer_type : "New",
            loyalty_id : "N",
            declined_loyalty_interstitial : "false",
            site_version  : "Desktop Site",
            site_currency: "de_DE_EURO",
            site_region: "uk",
            site_language: "en-GB",


            customer_address_zip : "",
            customer_email_hash :  "",
            referral_source :  "",
            page_type : "product",
            product_category_name : ["Lingerie"],
            product_category_id :[jQuery("meta[name=defaultParent]").attr("content")],
            product_id : ["5741462261401"],
            product_image_url : ["http://images.urbanoutfitters.com/is/image/UrbanOutfitters/5741462261401_001_b?$detailmain$"],
            product_brand : ["Pretty Polly"],
            product_selling_price : ["20.0"],
            promo_id : "6",
            product_referral : ["WOMENS-SHAPEWEAR-LINGERIE-SOLUTIONS-EU"],
            product_name : ["Pretty Polly Shape It Up Tummy Shaping Camisole"],
            is_online_only : true,
            is_back_in_stock : false
}
</script>

How can I get some values from this input? Should I work with this example like with text? I mean write it to some variable and split and then take some data?

Thanks

Upvotes: 4

Views: 9492

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21446

Once you have the text of the script via

js_text = soup.find('script', type="text/javascript").text

for example. Then you can use regex to find the data, I'm sure there is an easier way to do this but regex shouldn't be hard as well.

import re
regex =  re.compile('\n^(.*?):(.*?)$|,', re.MULTILINE) #compile regex
js_text = re.findall(regex, js_text) #  find first item @ new line to : and 2nd item @ from : to the end of the line or , 
js_text = [jt.strip() for jt in js_text] #  to strip away all of the extra white space.

this will return a list of names and values in name|value|name2|value2... order which you can mess around with or convert to dictionary later on.

Upvotes: 5

Related Questions