Anna Plym
Anna Plym

Reputation: 83

Python, looping over list of urls to parse html content

Below is the html source of url:

<h1>Queue &lt;&lt;hotspot-00:26:BB:05:BB:10&gt;&gt; Statistics </h1>
<ul>
  <li>Source-addresses: 10.10.1.130
  <li>Destination-address: ::/0
  <li>Max-limit: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
  <li>Limit-at: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
  <li>Last update: Mon Sep 23 21:41:16 2019

</ul>

and here's my code:

Note that links is list of urls

for link in links:
    page = requests.get(link).text
    sp1 = BeautifulSoup(page, "html.parser").findAll('h1')
    sp2 = BeautifulSoup(page, "html.parser").findAll('li')
    print(sp1,sp2)

Current OUTPUT

[<h1>Queue &lt;&lt;hotspot-00:26:BB:05:BB:10&gt;&gt; Statistics </h1>] [<li>Source-addresses: 10.10.1.130
  <li>Destination-address: ::/0
  <li>Max-limit: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
  <li>Limit-at: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
  <li>Last update: Tue Sep 24 00:27:05 2019

Trying to edit my code to get the following output.

hotspot-00:26:BB:05:BB:10, Limit-at: 1.02Mb/2.04Mb (Total: <i>unlimited

Upvotes: 1

Views: 461

Answers (1)

R. Arctor
R. Arctor

Reputation: 728

First of all you don't need to create two BeautifulSoup objects. As for your question:

import re

for link in links:
    soup = BeautifulSoup(requests.get(link).content, "html.parser")
    header = soup.find('h1').text
    header = re.sub(r'.*<<(.*)>>.*', r'\g<1>', header)
    limit = [elem.text.strip() for elem in soup.find_all('li') if re.search(r'^Limit-at:', elem.text)][0].split('\n')[0]
    print(header, limit)

I used the html you provided to test the above solution.

So you're getting lists there because you are using find_all which always returns a list.

For the header I used find same thing but it only returns the first match. Then I do some regex substitution to remove all but the desired portion of the header test.

For the limit things are a little trickier because it's in a nested li element. So loop through all of the li elements adding the one whose text attribute begins with 'Limit-at:'. Because that'll be a list I grab the 0 element, splitting that on the new line character, this produces a new list. Then grab the zero element of that to get rid of the 'Last Update' portion of that text.

Upvotes: 1

Related Questions