Reputation: 55
I'm creating a web-scraper and am trying to request multiple urls that share the same url path except for a numbered id.
My code to scrape one url is as follows:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://beta.companieshouse.gov.uk/company/00930291/officers')
soup = bs(r.content, 'lxml')
names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]
print(names)
The url shares the same structure except for the company numbers. I've tried the following code to try and get it to scrape multiple pages, but without success:
import requests
from bs4 import BeautifulSoup as bs
pages = []
for i in range(11003058, 11003059, 00930291):
```url = 'https://beta.companieshouse.gov.uk/company/' + str(i) + '/officers'
```pages.append(url)
for item in pages:
```page = requests.get(item)
```soup = bs(page.text, 'lxml')
names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]
print(names)
This is only giving me the first page (/11003058/officers), why is it not looping through them? Can anyone help?
Upvotes: 3
Views: 1645
Reputation: 389
Range in loops: The loop always includes start_value and excludes end_value during iteration
Try this:
import requests
from bs4 import BeautifulSoup as bs
pages = ['11003058', '11003059', '00930291']
i=0
while i<len(pages):
url = 'https://beta.companieshouse.gov.uk/company/' + pages(i) + '/officers'
pages.append(url)
i+1
for item in pages:
page = requests.get(item)
soup = bs(page.text, 'lxml')
names = [item.text.strip() for item in soup.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')]
print(names)
Upvotes: 0
Reputation: 4315
That should resolve your problems:
The range() function returns a sequence of numbers, starting from 0 by default, and increments by 1 (by default), and ends at a specified number.
Syntax:
range(start, stop, step)
https://docs.python.org/3/library/functions.html#func-range
Replace your code to:
company_id = ["11003058","11003059","00930291"]
for i in company_id:
url = 'https://beta.companieshouse.gov.uk/company/' + str(i) + '/officers'
pages.append(url)
You should initialize soup as list before iterate pages:
soup = [ ]
And append in soup list:
for item in pages:
page = requests.get(item)
soup.append(bs(page.text, 'lxml'))
print names list:
names = []
for items in soup:
h2Obj = items.select('[class^=appointment]:not(.appointments-list):has([id^="officer-role-"]:contains(Director)) h2')
for i in h2Obj:
tagArray = i.findChildren()
for tag in tagArray:
if isinstance(tag,Tag) and tag.name in 'a':
names.append(tag.text)
O/P:
['MASRAT, Suheel', 'MARSHALL, Jack', 'SUTTON, Tim', 'COOMBES, John Frederick', 'BROWN, Alistair Stuart', 'COOMBES, Kenneth', 'LAFONT, Jean-Jacques Mathieu', 'THOMAS-KEEPING, Lindsay Charles', 'WILLIAMS, Janet Elizabeth', 'WILLIAMS, Roderick', 'WRAGG, Barry']
Add top of the script:
from bs4.element import Tag
Upvotes: 1
Reputation: 506
The syntax for range
is range(start, stop, step)
. It loops from start
to stop - 1
and increases by step
each time. You're doing something weird here because in your case stop
equals start + 1
so it is only going to loop once, with the start
value.
I suppose you just want to get those 3 urls :
for i in (11003058, 11003059, 00930291):
Upvotes: 0