Reputation: 143
I have some page parsed with beautiful soup. But there I have js code :
<script type="text/javascript">
var utag_data = {
customer_id : "_PHL2883198554",
customer_type : "New",
loyalty_id : "N",
declined_loyalty_interstitial : "false",
site_version : "Desktop Site",
site_currency: "de_DE_EURO",
site_region: "uk",
site_language: "en-GB",
customer_address_zip : "",
customer_email_hash : "",
referral_source : "",
page_type : "product",
product_category_name : ["Lingerie"],
product_category_id :[jQuery("meta[name=defaultParent]").attr("content")],
product_id : ["5741462261401"],
product_image_url : ["http://images.urbanoutfitters.com/is/image/UrbanOutfitters/5741462261401_001_b?$detailmain$"],
product_brand : ["Pretty Polly"],
product_selling_price : ["20.0"],
promo_id : "6",
product_referral : ["WOMENS-SHAPEWEAR-LINGERIE-SOLUTIONS-EU"],
product_name : ["Pretty Polly Shape It Up Tummy Shaping Camisole"],
is_online_only : true,
is_back_in_stock : false
}
</script>
How can I get some values from this input? Should I work with this example like with text? I mean write it to some variable and split and then take some data?
Thanks
Upvotes: 4
Views: 9492
Reputation: 21446
Once you have the text of the script via
js_text = soup.find('script', type="text/javascript").text
for example. Then you can use regex to find the data, I'm sure there is an easier way to do this but regex shouldn't be hard as well.
import re
regex = re.compile('\n^(.*?):(.*?)$|,', re.MULTILINE) #compile regex
js_text = re.findall(regex, js_text) # find first item @ new line to : and 2nd item @ from : to the end of the line or ,
js_text = [jt.strip() for jt in js_text] # to strip away all of the extra white space.
this will return a list of names and values in name|value|name2|value2... order which you can mess around with or convert to dictionary later on.
Upvotes: 5