Reputation: 69
So I need to extract a variable from an html webpage If anyone could assist me.
Here is what webpage contains
<script>
var id = "5010";
</script>
I pretty much just need to extract that value from a webpage in py. If anyone could help would be nice, sorry if this is hard to understand i'm dumb.
Upvotes: 1
Views: 8245
Reputation: 14906
I find it easy to use the python string split() function to handle this sort of thing.
EDIT: big update to handle new requirements
Something simple like:
html = """
<script>
var id = \"5010\";
var id2 = \"8888\";
var idX = \"XoX\";
</script>"""
varlist = {}
vars = html.split("var ")[1:] # get each var entry
for v in vars:
name = v.split("=")[0].strip() # first part is the var [name = "]
value = v.split("\"")[1] # second part is the value [ = "..."]
varlist[name] = value # store it for printing below
print("Varlist - " + str(varlist))
---------------------
OUTPUT: Varlist - {'id': '5010', 'id2': '8888', 'idX': 'XoX'}
split()
returns a list of strings, broken-apart around the part you search for. The second parameter indicates the maximum number of splits. So by splitting on a string, restricting it to one split, then taking the [0]
or [1]
element, it's possible to pick the input apart to get the data needed.
In the above, the first split is on var
. This gives a list, since the string is split wherever there was a var
, so the first part of each of these entries is the var name (and we throw away the junk from the beginning).
Then the code loops for each of these splits, fetching the var name by splitting on =
, getting the [0]
side. Next is the var value, which is always contained in quotes, so splitting on "
should give a 3-item list, the [1]
element being the value of the var. These are added to a python dictionary just for the purposes of the example.
If your values aren't always in quotes, perhaps it could be split on the ;
instead, etc. Any sort of guaranteed pattern can be used.
Upvotes: 1
Reputation: 741
You can do this using urllib and regular expression searching.
import urllib.request
import re
url = "https://stackoverflow.com/questions/53111019/python-get-data-value-from-inside-script-html-tag"
response = urllib.request.urlopen(url)
html = response.read().decode('utf-8')
#print(html)
between_script_tags = re.search('<script>(.*)</script>', html)
print(between_script_tags)
URlLib extracts the HTML from the page, and then 're.search()' is finding any text in the HTML between '' and ''
However this will only get you this in plain text. E.g. in your case it will return a string of "var id = "5010";
"`
You could go further to split this:
output = between_script_tags.split(“ ”)
This would make output a list of three things: ['var', 'id', '=', '"5010";']
From here this is quite simple to extract the data you want.
Upvotes: 1