Reputation: 315
So I have been trying to learn a bit using scrape where I managed to scrape a site where it returns a massive of different var values such as:
var FancyboxI18nClose = 'Close';
var FancyboxI18nNext = 'Next';
var FancyboxI18nPrev = 'Previous';
var PS_CATALOG_MODE = false;
var added_to_wishlist = '.';
var ajax_allowed = true;
var ajaxsearch = true;
var attribute_anchor_separator = '-';
var attributesCombinations = [{"id_attribute":"100","id_attribute_group":"1","attribute":"38_5"},{"id_attribute":"101","id_attribute_group":"1","attribute":"39"},{"id_attribute":"103","id_attribute_group":"1","attribute":"40"},{"id_attribute":"104","id_attribute_group":"1","attribute":"40_5"},{"id_attribute":"105","id_attribute_group":"1","attribute":"41"},{"id_attribute":"107","id_attribute_group":"1","attribute":"42"},{"id_attribute":"108","id_attribute_group":"1","attribute":"42_5"},{"id_attribute":"109","id_attribute_group":"1","attribute":"43"},{"id_attribute":"111","id_attribute_group":"1","attribute":"44"},{"id_attribute":"112","id_attribute_group":"1","attribute":"44_5"},{"id_attribute":"132","id_attribute_group":"1","attribute":"45"},{"id_attribute":"113","id_attribute_group":"1","attribute":"46"}];
There is alot more of course and they all contain just in var. However what I want to do is to only be able to scrape one of the values - var attributesCombinations meaning that I basically just want to print out that value where I afterwards can use json.loads where I can scrape the json easier aswell.
What I tried to do is following:
try:
product_li_tags = bs4.find_all(text=re.compile('attributesCombinations'))
except Exception:
product_li_tags = []
but that gave ma e result of all of the "var" start to where attributesCombinations
.
['var CUSTOMIZE_TEXTFIELD = 1;\nvar FancyboxI18nClose = \'Close\';\nvar FancyboxI18nNext = \'Next\';\nvar FancyboxI18nPrev = \'Previous\';\nvar PS_CATALOG_MODE = false;\nvar added_to_wishlist = \'The product was successfully added to your wishlist.\';\nvar ajax_allowed = true;\nvar ajaxsearch = true;\nvar allowBuyWhenOutOfStock = false;\nvar attribute_anchor_separator = \'-\';\nvar attributesCombinations = [{"id_attribute":"100","id_attribute_group":"1","att...........
How do I make it so it only prints out var attributesCombinations ?
Upvotes: 0
Views: 126
Reputation: 19154
do not use re.compile
in bs4, run it directly.
match = re.compile('var\s*attributesCombinations\s*=\s*(\[.*?\])').findall(htmlString)
attributesCombinations = json.loads(match[0])
print(attributesCombinations)
Upvotes: 1
Reputation: 3925
A regular expression that extracts (just) the parts from attributesCombinations
to the end of the statement is
var attributesCombinations = (\[.*?\])
In Python, you can create the regular expression easily as
re.compile(r'var attributesCombinations = (\[.*?\])');
Upvotes: 2