ASH
ASH

Reputation: 20342

Trying to parse items in a list, which are embedded in a web page

I am trying to take some data off a intra-net site at work. I have lots and lots of items in a List; I am trying to parse both of these.

The List looks like this:

    var $input = $(".typeahead");
    unique_options_search = new Set([

    "phe_daily_smgm",

    "ex_legacy",

    "dt_legacy",

   etc., etc., etc.

    ]);

Is it simply a matter of logging to to the site and fetching that data element?

from bs4 import BeautifulSoup as bs
import requests
from lxml import html
import urllib.request

REQUEST_URL = 'https://corp-intranet-internal.com/admin/?page=0'
response = requests.get(REQUEST_URL, auth=('[email protected]', 'my_pass'))
xml_data = response.text.encode('utf-8', 'ignore')
html_page = urllib.request.urlopen(REQUEST_URL)
delay = 5 # seconds
soup = bs(html_page, "lxml")

There must be more to it than this, right. At least, I have to identify that list and parse it, but I'm not sure how to do that.

Upvotes: 1

Views: 80

Answers (1)

Bill M.
Bill M.

Reputation: 1548

Assuming you already have the top string captured (the whole "var $input ... ]);" thing, and your list is always going to be what's in the brackets, then you could extract what's in the brackets, then break the remainder into a list:

import re

mycode = """
    var $input = $(".typeahead");
    unique_options_search = new Set([

    "barra_phe2s_daily_smgm",

    "barra_eue4dukl_monthly_legacy",

    "barra_eue4duk_monthly_legacy",

    "barra_ussc4s_daily_legacy",

    "barra_ussinm1_daily_smgm",

    ]);
"""

inbracks = mycode[mycode.index('[')+1:]
mylist = re.findall(r"['\"](.*?)['\"]", inbracks)

I'm sure there's a more complex regular expression you can use that says "Get every string you find within quotation marks, AFTER the first occurrence of '['." But instead, I just chopped off mycode to everything following the first occurrence of the '[' character, then did the re.findall on it.

Note that BeautifulSoup lets you parse tag-based things like HTML and XML. But when it seems something like the code in mycode, which is the sort of thing you might find somewhere in a <script> tag perhaps, then BeautifulSoup just treats it as "some string".

Upvotes: 1

Related Questions