C.Robin
C.Robin

Reputation: 1102

Correct div-class combination for soup.select()

I'm developing some scraping code and it keeps returning some errors which i imagine others might be able to help with.

First I run this snippet:

import  pandas as pd
from urllib.parse import urljoin
import requests 

base = "http://www.reed.co.uk/jobs"

url = "http://www.reed.co.uk/jobs?datecreatedoffset=Today&pagesize=100"
r = requests.get(url).content
soup = BShtml(r, "html.parser")

df = pd.DataFrame(columns=["links"], data=[urljoin(base, a["href"]) for a in soup.select("div.pages a.page")])
df

I run the above snippet on the first page of today's job postings. And I extract the URLs at the bottom of the page in order to find the total number of pages that exist at that point in time. The below regular expressions take this out for me:

df['partone'] = df['links'].str.extract('([a-z][a-z][a-z][a-z][a-z][a-z]=[0-9][0-9].)', expand=True)
df['maxlink'] = df['partone'].str.extract('([0-9][0-9][0-9])', expand=True)
pagenum = df['maxlink'][4]
pagenum = pd.to_numeric(pagenum, errors='ignore')

Note on the third line above, the number of pages is always contained within the second from last (out of five) URLs in this list. I'm sure there's a more elegant way of doing this, but it suffices as is. I then feed the number i've taken from the URL into a loop:

result_set = []

loopbasepref = 'http://www.reed.co.uk/jobs?cached=True&pageno='
loopbasesuf = '&datecreatedoffset=Today&pagesize=100'
for pnum in range(1,pagenum):
    url = loopbasepref + str(pnum) + loopbasesuf
    r = requests.get(url).content
    soup = BShtml(r, "html.parser")
    df2 = pd.DataFrame(columns=["links"], data=[urljoin(base, a["href"]) for a in  soup.select("div", class_="results col-xs-12 col-md-10")])
    result_set.append(df2)
    print(df2)

This is where I get an error. What i'm attempting to do is loop through all the pages that list jobs starting at page 1 and going up to page N where N = pagenum, and then extract the URL that links through to each individual job page and store that in a dataframe. I've tried various combinations of soup.select("div", class_="") but receive an error each time that reads: TypeError: select() got an unexpected keyword argument 'class_'.

If anyone has any thoughts about this, and can see a good way forward, i'd appreciate the help!

Cheers

Chris

Upvotes: 2

Views: 739

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180441

You can just keep looping until there is no next page:

import  requests
from bs4 import BeautifulSoup
from urllib.parse import  urljoin

base = "http://www.reed.co.uk"
url = "http://www.reed.co.uk/jobs?datecreatedoffset=Today&pagesize=100"

def all_urls():
    r = requests.get(url).content
    soup = BeautifulSoup(r, "html.parser")
    # get the urls from the first page
    yield  [urljoin(base, a["href"]) for a in soup.select("div.details h3.title a[href^=/jobs]")]
    nxt = soup.find("a", title="Go to next page")
    # title="Go to next page" is missing when there are no more pages
    while nxt:
        # wash/repeat until no more pages
        r = requests.get(urljoin(base, nxt["href"])).content
        soup = BeautifulSoup(r, "html.parser")
        yield  [urljoin(base, a["href"]) for a in soup.select("div.details h3.title a[href^=/jobs]")]
        nxt = soup.find("a", title="Go to next page")

Just loop over the generator function to get the urls from each page:

for u in (all_urls()):
    print(u)

I also use a[href^=/jobs] in the selector as there are other tags that match so we make sure we just pull the job paths.

In your own code, the correct way to use a selector would be:

soup.select("div.results.col-xs-12.col-md-10")

You syntax is for find or find_all where you use class_=... for css classes:

soup.find_all("div", class_="results col-xs-12 col-md-10")

But that is not the correct selector regardless.

Not sure why you are creating multiple dfs but if that is what you want:

def all_urls():
    r = requests.get(url).content
    soup = BeautifulSoup(r, "html.parser")
    yield pd.DataFrame([urljoin(base, a["href"]) for a in soup.select("div.details h3.title a[href^=/jobs]")],
                       columns=["Links"])
    nxt = soup.find("a", title="Go to next page")
    while nxt:
        r = requests.get(urljoin(base, nxt["href"])).content
        soup = BeautifulSoup(r, "html.parser")
        yield pd.DataFrame([urljoin(base, a["href"]) for a in soup.select("div.details h3.title a[href^=/jobs]")],
                           columns=["Links"])
        nxt = soup.find("a", title="Go to next page")


dfs = list(all_urls())

That would give you a list of dfs:

In [4]: dfs = list(all_urls())
dfs[0].head()
In [5]: dfs[0].head(10)
Out[5]: 
                                               Links
0  http://www.reed.co.uk/jobs/tufting-manager/308...
1  http://www.reed.co.uk/jobs/financial-services-...
2  http://www.reed.co.uk/jobs/head-of-finance-mul...
3  http://www.reed.co.uk/jobs/class-1-drivers-req...
4  http://www.reed.co.uk/jobs/freelance-middlewei...
5  http://www.reed.co.uk/jobs/sage-200-consultant...
6  http://www.reed.co.uk/jobs/bereavement-support...
7  http://www.reed.co.uk/jobs/property-letting-ma...
8  http://www.reed.co.uk/jobs/graduate-recruitmen...
9  http://www.reed.co.uk/jobs/solutions-delivery-...

But if you want just one then use the original code with itertools.chain:

 from itertools import chain
 df = pd.DataFrame(columns=["links"], data=list(chain.from_iterable(all_urls())))

Which will give you all the links in one df:

In [7]:  from itertools import chain
   ...:  df = pd.DataFrame(columns=["links"], data=list(chain.from_iterable(all_
   ...: urls())))
   ...: 

In [8]: df.size
Out[8]: 675

Upvotes: 1

Related Questions