Get field ids from a google form, python BeautifulSoup

Question

In a google form like this for example: how would you create a list of this 'field IDs'

var FB_PUBLIC_LOAD_DATA_ = [null,[null,[[831400739,"Product Title",null,0,[[1089277187,null,0]
]
]
,[2054606931,"SKU",null,0,[[742914399,null,0]
]
]
,[1620039602,"Size",null,0,[[2011436433,null,0]
]
]
,[445859665,"First Name",null,0,[[638818998,null,0]
]
]
,[1417046530,"Last Name",null,0,[[1952962866,null,0]
]
]
,[903472958,"E-mail",null,0,[[916445513,null,0]
]
]
,[549969484,"Phone Number",null,0,[[848461347,null,0

This is the relevant section of the HTML ^

I have the code so far:

    from bs4 import BeautifulSoup as bs
    a = requests.get(url, proxies=proxies)
    soup = bs(a.text, 'html.parser')
    fields = soup.find_all('script', {'type': 'text/javascript'})
    form_info = fields[1]
    print(form_info)

but this returns, lot's of irrelevant data and unless I include lots of str.replace(), str.split() sections of code I can't see an easy way to do this. That would also be extremely messy.

I do not have to use BeautifulSoup although it seems the obvious way to go.

In the example above I would need a list like:

[1089277187, 742914399, 2011436433, 638818998, 1952962866, 916445513, 848461347]

Greg · Accepted Answer

Beautiful soup is used to query HTML tags. Therefore on way to extract the data from the JavaScript variable is to use regex. You could do a match on [[. However this will return 831400739. This could be manually excluded after the regex by skipping the first item.

import re

script = '''var FB_PUBLIC_LOAD_DATA_ = [null,[null,[[831400739,"Product Title",null,0,[[1089277187,null,0]
]
]
,[2054606931,"SKU",null,0,[[742914399,null,0]
]
]
,[1620039602,"Size",null,0,[[2011436433,null,0]
]
]
,[445859665,"First Name",null,0,[[638818998,null,0]
]
]
,[1417046530,"Last Name",null,0,[[1952962866,null,0]
]
]
,[903472958,"E-mail",null,0,[[916445513,null,0]
]
]
,[549969484,"Phone Number",null,0,[[848461347,null,0'''

match = re.findall('(?<=\[\[)(\d+)', script) 
# (?<= ) means to look for the following (but not include it in the results):
# \[\[ means find 2 square brackets characters. The backslash is used to tell regex to use the character [ and not the function.
# (\d+) means to match the start of a digit of any size (and return it in results)

results = [x for x in match[1:]] # Skip the first item, which is 831400739
print(results)

This will output:

['1089277187', '742914399', '2011436433', '638818998', '1952962866', '916445513', '848461347']

You might want to cast the results to a integers. Also to make the code more robust, you might want to remove spaces & new lines before calling regex function e.g: formatted = script.replace(" ", "").replace('\n', '').replace('\r', '')

Get field ids from a google form, python BeautifulSoup

Answers (1)

Related Questions