VadymB
VadymB

Reputation: 75

How to properly set cookies to get URL content using httr

I need to download information from web site that is protected using cookies. I pass this protection manually and then insert cookies to httr.

Here is similar topic, but it does not solve my problem: (Copying cookie for httr)

library(httr)
url<-"http://smida.gov.ua/db/emitent/year/xml/showform/32153/125/templ"

cook<-"_SMIDA=9117a9eb136353bd6956651bd59acd37; __utmt=1; __utma=29983421.1729484844.1413489369.1413625619.1413627797.3; __utmb=29983421.7.10.1413627797; __utmc=29983421; __utmz=29983421.1413489369.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"

response <- GET(url, config(cookie= cook))

content(x = response,as = 'text', encoding = "UTF-8")   

So when I use content it return me information, that I am not logged in( as I do without cookie)

How can I solve this problem?

Test credentials are login: mytest2, pass: qwerty12)

Upvotes: 6

Views: 7474

Answers (2)

ypa y yhm
ypa y yhm

Reputation: 219

You can just try this:

url <- "http://httpbin.org/get"
httr::GET(url)
httr::GET(url, httr::add_headers(a = 1, b = 2))
httr::GET(url, httr::set_cookies(a = 1, b = 2))
httr::GET(url, httr::add_headers(a = 1, b = 2), httr::set_cookies(a = 1, b = 2))
httr::GET(url, httr::add_headers(a = 1, b = 2, cookie = 'c=3;d=4'), httr::set_cookies(a = 1, b = 2))
# codes ref by: https://httr.r-lib.org/reference/GET.html

And these will be the outs with commands:

httr::GET(url)
#| Response [http://httpbin.org/get]
#|   Date: 2024-07-31 02:14
#|   Status: 200
#|   Content-Type: application/json
#|   Size: 378 B
#| {
#|   "args": {}, 
#|   "headers": {
#|     "Accept": "application/json, text/xml, application/xml, */*", 
#|     "Accept-Encoding": "deflate, gzip, br, zstd", 
#|     "Host": "httpbin.org", 
#|     "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7", 
#|     "X-Amzn-Trace-Id": "Root=1-66a99dfc-3ee62d216a517e6844e8815f"
#|   }, 
#|   "origin": "101.200.73.219", 
#| ...

httr::GET(url, httr::add_headers(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#|   Date: 2024-07-31 02:14
#|   Status: 200
#|   Content-Type: application/json
#|   Size: 408 B
#| {
#|   "args": {}, 
#|   "headers": {
#|     "A": "1", 
#|     "Accept": "application/json, text/xml, application/xml, */*", 
#|     "Accept-Encoding": "deflate, gzip, br, zstd", 
#|     "B": "2", 
#|     "Host": "httpbin.org", 
#|     "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7", 
#|     "X-Amzn-Trace-Id": "Root=1-66a99dfc-2fddaa4e49a8325309990191"
#| ...

httr::GET(url, httr::set_cookies(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#|   Date: 2024-07-31 02:14
#|   Status: 200
#|   Content-Type: application/json
#|   Size: 404 B
#| {
#|   "args": {}, 
#|   "headers": {
#|     "Accept": "application/json, text/xml, application/xml, */*", 
#|     "Accept-Encoding": "deflate, gzip, br, zstd", 
#|     "Cookie": "a=1;b=2", 
#|     "Host": "httpbin.org", 
#|     "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7", 
#|     "X-Amzn-Trace-Id": "Root=1-66a99dfc-44b9d09700c6b7f87e086e40"
#|   }, 
#| ...

httr::GET(url, httr::add_headers(a = 1, b = 2), httr::set_cookies(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#|   Date: 2024-07-31 02:14
#|   Status: 200
#|   Content-Type: application/json
#|   Size: 434 B
#| {
#|   "args": {}, 
#|   "headers": {
#|     "A": "1", 
#|     "Accept": "application/json, text/xml, application/xml, */*", 
#|     "Accept-Encoding": "deflate, gzip, br, zstd", 
#|     "B": "2", 
#|     "Cookie": "a=1;b=2", 
#|     "Host": "httpbin.org", 
#|     "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7", 
#| ...

httr::GET(url, httr::add_headers(a = 1, b = 2, cookie = 'c=3;d=4'), httr::set_cookies(a = 1, b = 2))
#| Response [http://httpbin.org/get]
#|   Date: 2024-07-31 02:14
#|   Status: 200
#|   Content-Type: application/json
#|   Size: 434 B
#| {
#|   "args": {}, 
#|   "headers": {
#|     "A": "1", 
#|     "Accept": "application/json, text/xml, application/xml, */*", 
#|     "Accept-Encoding": "deflate, gzip, br, zstd", 
#|     "B": "2", 
#|     "Cookie": "c=3;d=4", 
#|     "Host": "httpbin.org", 
#|     "User-Agent": "libcurl/7.81.0 r-curl/5.2.1 httr/1.4.7", 
#| ...

So, the httr::set_cookies is like a warp to httr::add_headers, but the httr::add_headers have bigger priority while they both appears to setting cookies.

But, httr::set_cookies(...) is friendly to read rather than httr::add_headers(cookie = ....), so I think you can still just use it.

Upvotes: 0

hrbrmstr
hrbrmstr

Reputation: 78832

This would be the way to set_cookies with GET & httr:

GET("http://smida.gov.ua/db/emitent/year/xml/showform/32153/125/templ", 
    set_cookies(`_SMIDA` = "7cf9ea4bfadb60bbd0950e2f8f4c279d",
                `__utma` = "29983421.138599299.1413649536.1413649536.1413649536.1",
                `__utmb` = "29983421.5.10.1413649536",
                `__utmc` = "29983421",
                `__utmt` = "1",
                `__utmz` = "29983421.1413649536.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"))

That worked for me, well at least I think it did as I cannot read the language. A table comes back with the same structure and no prompt to login.

Unfortunately the captcha on login prevents the use of Rselenium (or other, similar, crawling packages), so you'll have to continue to manually grab those cookies (or use a utility to extract them from the session).

Finally, you probably want to seriously consider changing those credentials, now :-)


EDIT: @VadymB and I both found that the code didn't work until we rebooted RStudio. Your mileage may vary.

Upvotes: 6

Related Questions