Reputation: 358
In a google form like this for example: how would you create a list of this 'field IDs'
var FB_PUBLIC_LOAD_DATA_ = [null,[null,[[831400739,"Product Title",null,0,[[1089277187,null,0]
]
]
,[2054606931,"SKU",null,0,[[742914399,null,0]
]
]
,[1620039602,"Size",null,0,[[2011436433,null,0]
]
]
,[445859665,"First Name",null,0,[[638818998,null,0]
]
]
,[1417046530,"Last Name",null,0,[[1952962866,null,0]
]
]
,[903472958,"E-mail",null,0,[[916445513,null,0]
]
]
,[549969484,"Phone Number",null,0,[[848461347,null,0
This is the relevant section of the HTML ^
I have the code so far:
from bs4 import BeautifulSoup as bs
a = requests.get(url, proxies=proxies)
soup = bs(a.text, 'html.parser')
fields = soup.find_all('script', {'type': 'text/javascript'})
form_info = fields[1]
print(form_info)
but this returns, lot's of irrelevant data and unless I include lots of str.replace()
, str.split()
sections of code I can't see an easy way to do this. That would also be extremely messy.
I do not have to use BeautifulSoup although it seems the obvious way to go.
In the example above I would need a list like:
[1089277187, 742914399, 2011436433, 638818998, 1952962866, 916445513, 848461347]
Upvotes: 0
Views: 541
Reputation: 4518
Beautiful soup is used to query HTML tags. Therefore on way to extract the data from the JavaScript variable is to use regex. You could do a match on [[
. However this will return 831400739
. This could be manually excluded after the regex by skipping the first item.
import re
script = '''var FB_PUBLIC_LOAD_DATA_ = [null,[null,[[831400739,"Product Title",null,0,[[1089277187,null,0]
]
]
,[2054606931,"SKU",null,0,[[742914399,null,0]
]
]
,[1620039602,"Size",null,0,[[2011436433,null,0]
]
]
,[445859665,"First Name",null,0,[[638818998,null,0]
]
]
,[1417046530,"Last Name",null,0,[[1952962866,null,0]
]
]
,[903472958,"E-mail",null,0,[[916445513,null,0]
]
]
,[549969484,"Phone Number",null,0,[[848461347,null,0'''
match = re.findall('(?<=\[\[)(\d+)', script)
# (?<= ) means to look for the following (but not include it in the results):
# \[\[ means find 2 square brackets characters. The backslash is used to tell regex to use the character [ and not the function.
# (\d+) means to match the start of a digit of any size (and return it in results)
results = [x for x in match[1:]] # Skip the first item, which is 831400739
print(results)
This will output:
['1089277187', '742914399', '2011436433', '638818998', '1952962866', '916445513', '848461347']
You might want to cast the results to a integers. Also to make the code more robust, you might want to remove spaces & new lines before calling regex function e.g: formatted = script.replace(" ", "").replace('\n', '').replace('\r', '')
Upvotes: 1