Python, looping over list of urls to parse html content

Question

Below is the html source of url:

Queue <<hotspot-00:26:BB:05:BB:10>> Statistics 

  Source-addresses: 10.10.1.130
  
Destination-address: ::/0
  
Max-limit: 1.02Mb/2.04Mb (Total: unlimited)
  
Limit-at: 1.02Mb/2.04Mb (Total: unlimited)
  
Last update: Mon Sep 23 21:41:16 2019

and here's my code:

Note that links is list of urls

for link in links:
    page = requests.get(link).text
    sp1 = BeautifulSoup(page, "html.parser").findAll('h1')
    sp2 = BeautifulSoup(page, "html.parser").findAll('li')
    print(sp1,sp2)

Current OUTPUT

[Queue <<hotspot-00:26:BB:05:BB:10>> Statistics 
] [Source-addresses: 10.10.1.130
  
Destination-address: ::/0
  
Max-limit: 1.02Mb/2.04Mb (Total: unlimited)
  
Limit-at: 1.02Mb/2.04Mb (Total: unlimited)
  Last update: Tue Sep 24 00:27:05 2019

Trying to edit my code to get the following output.

hotspot-00:26:BB:05:BB:10, Limit-at: 1.02Mb/2.04Mb (Total: unlimited

R. Arctor · Accepted Answer

First of all you don't need to create two BeautifulSoup objects. As for your question:

import re

for link in links:
    soup = BeautifulSoup(requests.get(link).content, "html.parser")
    header = soup.find('h1').text
    header = re.sub(r'.*<<(.*)>>.*', r'\g<1>', header)
    limit = [elem.text.strip() for elem in soup.find_all('li') if re.search(r'^Limit-at:', elem.text)][0].split('
')[0]
    print(header, limit)

I used the html you provided to test the above solution.

So you're getting lists there because you are using find_all which always returns a list.

For the header I used find same thing but it only returns the first match. Then I do some regex substitution to remove all but the desired portion of the header test.

For the limit things are a little trickier because it's in a nested li element. So loop through all of the li elements adding the one whose text attribute begins with 'Limit-at:'. Because that'll be a list I grab the 0 element, splitting that on the new line character, this produces a new list. Then grab the zero element of that to get rid of the 'Last Update' portion of that text.

Python, looping over list of urls to parse html content

Answers (1)

Related Questions