Reputation: 83
Below is the html
source of url:
<h1>Queue <<hotspot-00:26:BB:05:BB:10>> Statistics </h1>
<ul>
<li>Source-addresses: 10.10.1.130
<li>Destination-address: ::/0
<li>Max-limit: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
<li>Limit-at: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
<li>Last update: Mon Sep 23 21:41:16 2019
</ul>
and here's my code:
Note that links
is list of urls
for link in links:
page = requests.get(link).text
sp1 = BeautifulSoup(page, "html.parser").findAll('h1')
sp2 = BeautifulSoup(page, "html.parser").findAll('li')
print(sp1,sp2)
Current OUTPUT
[<h1>Queue <<hotspot-00:26:BB:05:BB:10>> Statistics </h1>] [<li>Source-addresses: 10.10.1.130
<li>Destination-address: ::/0
<li>Max-limit: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
<li>Limit-at: 1.02Mb/2.04Mb (Total: <i>unlimited</i>)
<li>Last update: Tue Sep 24 00:27:05 2019
Trying to edit my code to get the following output.
hotspot-00:26:BB:05:BB:10, Limit-at: 1.02Mb/2.04Mb (Total: <i>unlimited
Upvotes: 1
Views: 461
Reputation: 728
First of all you don't need to create two BeautifulSoup objects. As for your question:
import re
for link in links:
soup = BeautifulSoup(requests.get(link).content, "html.parser")
header = soup.find('h1').text
header = re.sub(r'.*<<(.*)>>.*', r'\g<1>', header)
limit = [elem.text.strip() for elem in soup.find_all('li') if re.search(r'^Limit-at:', elem.text)][0].split('\n')[0]
print(header, limit)
I used the html you provided to test the above solution.
So you're getting lists there because you are using find_all
which always returns a list.
For the header I used find
same thing but it only returns the first match. Then I do some regex substitution to remove all but the desired portion of the header test.
For the limit things are a little trickier because it's in a nested li
element. So loop through all of the li
elements adding the one whose text attribute begins with 'Limit-at:'. Because that'll be a list I grab the 0 element, splitting that on the new line character, this produces a new list. Then grab the zero element of that to get rid of the 'Last Update' portion of that text.
Upvotes: 1