Seamus Lam
Seamus Lam

Reputation: 155

Parsing Web page with R

this is my first time posting here. I do not have much experience (less than a week) with html parsing/web scraping and have difficulties parsing this webpage:

https://www.jobsbank.gov.sg/

What I wan to do is to parse the content of all available job listing in the web.

my approach:

  1. click search on an empty search bar which will return me all records listed. The resulting web page is: https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do

  2. provide the search result web address to R and identify all the job listing links

  3. supply the job listing links to R and ask R to go to each listing and extract the content.

  4. look for next page and repeat step 2 and 3.

However, the problem is that the resulting webpage I got from step 1 does not direct me to the search result page. Instead, it will direct me back to the home page.

Is there anyway to overcome this problem?

Suppose I managed to get the web address for the search result, I intent to use the following code:

base_url <- "https://www.jobsbank.gov.sg/ICMSPortal/portlets/JobBankHandler/SearchResult.do"
base_html <- getURLContent(base_url,cainfo="cacert.pem")[[1]]
links <- strsplit(base_html,"a href=")[[1]]

Upvotes: 0

Views: 395

Answers (1)

Spacedman
Spacedman

Reputation: 94222

Learn to use the web developer tools in your web browser (hint: Use Chrome or Firefox).

Learn about HTTP GET and HTTP POST requests.

Notice the search box sends a POST request.

See what the Form Data parameters are (they seem to be {actionForm.checkValidRequest}:YES {actionForm.keyWord}:my search string )

Construct a POST request using one of the R http packages with that form data in.

Hope the server doesn't care about the cookies, if it does, get the cookies and feed it cookies.

Hence you end up using postForm from RCurl package:

 p = postForm(url, .params=list(checkValidRequest="YES", keyword="finance")

And then just extract the table from p. Getting the next page involves constructing another form request with a bunch of different form parameters.

Basically, a web request is more than just a URL, there's all this other conversation going on between the browser and the server involving form parameters, cookies, sometimes there's AJAX requests going on internally to the web page updating parts.

There's a lot of "I can't scrape this site" questions on SO, and although we could spoonfeed you the precise answer to this exact problem, I do feel the world would be better served if we just told you to go learn about the HTTP protocol, and Forms, and Cookies, and then you'll understand how to use the tools better.

Note I've never seen a job site or a financial site that doesn't like you scraping its content - although I can't see a warning about it on this site, that doesn't mean it's not there and I would be careful about breaking the Terms and Conditions of Use. Otherwise you might find all your requests failing.

Upvotes: 2

Related Questions