Reputation: 20342
I am trying to take some data off a intra-net site at work. I have lots and lots of items in a List; I am trying to parse both of these.
The List looks like this:
var $input = $(".typeahead");
unique_options_search = new Set([
"phe_daily_smgm",
"ex_legacy",
"dt_legacy",
etc., etc., etc.
]);
Is it simply a matter of logging to to the site and fetching that data element?
from bs4 import BeautifulSoup as bs
import requests
from lxml import html
import urllib.request
REQUEST_URL = 'https://corp-intranet-internal.com/admin/?page=0'
response = requests.get(REQUEST_URL, auth=('[email protected]', 'my_pass'))
xml_data = response.text.encode('utf-8', 'ignore')
html_page = urllib.request.urlopen(REQUEST_URL)
delay = 5 # seconds
soup = bs(html_page, "lxml")
There must be more to it than this, right. At least, I have to identify that list and parse it, but I'm not sure how to do that.
Upvotes: 1
Views: 80
Reputation: 1548
Assuming you already have the top string captured (the whole "var $input ... ]);" thing, and your list is always going to be what's in the brackets, then you could extract what's in the brackets, then break the remainder into a list:
import re
mycode = """
var $input = $(".typeahead");
unique_options_search = new Set([
"barra_phe2s_daily_smgm",
"barra_eue4dukl_monthly_legacy",
"barra_eue4duk_monthly_legacy",
"barra_ussc4s_daily_legacy",
"barra_ussinm1_daily_smgm",
]);
"""
inbracks = mycode[mycode.index('[')+1:]
mylist = re.findall(r"['\"](.*?)['\"]", inbracks)
I'm sure there's a more complex regular expression you can use that says "Get every string you find within quotation marks, AFTER the first occurrence of '['." But instead, I just chopped off mycode
to everything following the first occurrence of the '[' character, then did the re.findall
on it.
Note that BeautifulSoup lets you parse tag-based things like HTML and XML. But when it seems something like the code in mycode
, which is the sort of thing you might find somewhere in a <script>
tag perhaps, then BeautifulSoup just treats it as "some string".
Upvotes: 1