Reputation: 1102
I'm developing some scraping code and it keeps returning some errors which i imagine others might be able to help with.
First I run this snippet:
import pandas as pd
from urllib.parse import urljoin
import requests
base = "http://www.reed.co.uk/jobs"
url = "http://www.reed.co.uk/jobs?datecreatedoffset=Today&pagesize=100"
r = requests.get(url).content
soup = BShtml(r, "html.parser")
df = pd.DataFrame(columns=["links"], data=[urljoin(base, a["href"]) for a in soup.select("div.pages a.page")])
df
I run the above snippet on the first page of today's job postings. And I extract the URLs at the bottom of the page in order to find the total number of pages that exist at that point in time. The below regular expressions take this out for me:
df['partone'] = df['links'].str.extract('([a-z][a-z][a-z][a-z][a-z][a-z]=[0-9][0-9].)', expand=True)
df['maxlink'] = df['partone'].str.extract('([0-9][0-9][0-9])', expand=True)
pagenum = df['maxlink'][4]
pagenum = pd.to_numeric(pagenum, errors='ignore')
Note on the third line above, the number of pages is always contained within the second from last (out of five) URLs in this list. I'm sure there's a more elegant way of doing this, but it suffices as is. I then feed the number i've taken from the URL into a loop:
result_set = []
loopbasepref = 'http://www.reed.co.uk/jobs?cached=True&pageno='
loopbasesuf = '&datecreatedoffset=Today&pagesize=100'
for pnum in range(1,pagenum):
url = loopbasepref + str(pnum) + loopbasesuf
r = requests.get(url).content
soup = BShtml(r, "html.parser")
df2 = pd.DataFrame(columns=["links"], data=[urljoin(base, a["href"]) for a in soup.select("div", class_="results col-xs-12 col-md-10")])
result_set.append(df2)
print(df2)
This is where I get an error. What i'm attempting to do is loop through all the pages that list jobs starting at page 1 and going up to page N where N = pagenum, and then extract the URL that links through to each individual job page and store that in a dataframe. I've tried various combinations of soup.select("div", class_="")
but receive an error each time that reads: TypeError: select() got an unexpected keyword argument 'class_'
.
If anyone has any thoughts about this, and can see a good way forward, i'd appreciate the help!
Cheers
Chris
Upvotes: 2
Views: 739
Reputation: 180441
You can just keep looping until there is no next page:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
base = "http://www.reed.co.uk"
url = "http://www.reed.co.uk/jobs?datecreatedoffset=Today&pagesize=100"
def all_urls():
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
# get the urls from the first page
yield [urljoin(base, a["href"]) for a in soup.select("div.details h3.title a[href^=/jobs]")]
nxt = soup.find("a", title="Go to next page")
# title="Go to next page" is missing when there are no more pages
while nxt:
# wash/repeat until no more pages
r = requests.get(urljoin(base, nxt["href"])).content
soup = BeautifulSoup(r, "html.parser")
yield [urljoin(base, a["href"]) for a in soup.select("div.details h3.title a[href^=/jobs]")]
nxt = soup.find("a", title="Go to next page")
Just loop over the generator function to get the urls from each page:
for u in (all_urls()):
print(u)
I also use a[href^=/jobs]
in the selector as there are other tags that match so we make sure we just pull the job paths.
In your own code, the correct way to use a selector would be:
soup.select("div.results.col-xs-12.col-md-10")
You syntax is for find or find_all where you use class_=...
for css classes:
soup.find_all("div", class_="results col-xs-12 col-md-10")
But that is not the correct selector regardless.
Not sure why you are creating multiple dfs but if that is what you want:
def all_urls():
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
yield pd.DataFrame([urljoin(base, a["href"]) for a in soup.select("div.details h3.title a[href^=/jobs]")],
columns=["Links"])
nxt = soup.find("a", title="Go to next page")
while nxt:
r = requests.get(urljoin(base, nxt["href"])).content
soup = BeautifulSoup(r, "html.parser")
yield pd.DataFrame([urljoin(base, a["href"]) for a in soup.select("div.details h3.title a[href^=/jobs]")],
columns=["Links"])
nxt = soup.find("a", title="Go to next page")
dfs = list(all_urls())
That would give you a list of dfs:
In [4]: dfs = list(all_urls())
dfs[0].head()
In [5]: dfs[0].head(10)
Out[5]:
Links
0 http://www.reed.co.uk/jobs/tufting-manager/308...
1 http://www.reed.co.uk/jobs/financial-services-...
2 http://www.reed.co.uk/jobs/head-of-finance-mul...
3 http://www.reed.co.uk/jobs/class-1-drivers-req...
4 http://www.reed.co.uk/jobs/freelance-middlewei...
5 http://www.reed.co.uk/jobs/sage-200-consultant...
6 http://www.reed.co.uk/jobs/bereavement-support...
7 http://www.reed.co.uk/jobs/property-letting-ma...
8 http://www.reed.co.uk/jobs/graduate-recruitmen...
9 http://www.reed.co.uk/jobs/solutions-delivery-...
But if you want just one then use the original code with itertools.chain:
from itertools import chain
df = pd.DataFrame(columns=["links"], data=list(chain.from_iterable(all_urls())))
Which will give you all the links in one df:
In [7]: from itertools import chain
...: df = pd.DataFrame(columns=["links"], data=list(chain.from_iterable(all_
...: urls())))
...:
In [8]: df.size
Out[8]: 675
Upvotes: 1