Massimo Potenzi
Massimo Potenzi

Reputation: 107

How to scrape data from this website?

I need to login to this site http://bit.do for scraping purpose. Data are protected by password but I can't figure out how to log in the access them in R.

I tried

 library (rvest)

 url       <-"http://bit.d o/#login/admin"   
 pgsession <-html_session(url)               
 pgform    <-html_form(pgsession)[[1]]       


 filled_form <- set_values(pgform,
 'username' = "test0001", 
 'password' = "qwerty1234")

 submit_form(pgsession,filled_form)

 url <- 'http://bit.d o/admin/url/http%3A||2F||2Fedition.cnn.com||2F2017||2F07||2F21||2Fopinions||2Ftrump-russia-putin-lain-opinion||2Findex.html'
 data_page <- read_html(url)
 data_link<- html_nodes(data_page,'td > a')
 data_click<- html_nodes(data_page,'td span:nth-child(1)')

but I get this kind of error

 Submitting with 'NULL'
 Error in xml2::url_absolute(form$url, session$url) : 
 Not compatible with STRSXP: [type=NULL].

How could I do? These are my testing credential username: test0001, password: qwerty1234. Here's an example of protected data I want to scrape http://bit.d o/admin/url/http%3A||2F||2Fedition.cnn.com||2F2017||2F07||2F21||2Fopinions||2Ftrump-russia-putin-lain-opinion||2Findex.html

IMPORTANT: NOTE THAT DUE TO A StackOverflow RESTRICTION I PUT A SPACE BETWEEN the d and o in domain name

Upvotes: 0

Views: 281

Answers (1)

Oriol Mirosa
Oriol Mirosa

Reputation: 2826

Since the form has no url field, when you call submit_form(pgsession, filled_form) a call to xml2::url_absolute(form$url, session$url) takes place that doesn't work because form$url is NULL. In order to get past this, you need to give a value – even if it is empty – to the form$url that is called by url_absolute. Try adding the following line after you populate the filled_form with set_values:

filled_form$url <- ''

Upvotes: 1

Related Questions