Reputation: 93
I want to scrape a website. The website I want to scrape doesn’t have an API.
What I want to do is this (in Python):
import requests
with requests.Session() as conn:
url = "http://demo.ilias.de/login.php"
auth = {
"username": "benjamin",
"password": "iliasdemo"
}
conn.post(url, data=auth)
response = conn.get(url)
do_work(response)
When trying to do the same thing with HTTPoison, the website responds with "Please enable session cookies in your browser!". Elixir code:
HTTPoison.post "http://demo.ilias.de/login.php",
"{\"username\":\"benjamin\", \"password\":\"iliasdemo\"}"
I guess the problem is with cookies.
UPD#1. It seems that not all cookies are saved since :hackney.cookies(headers)
(headers
being from %HTTPoison.Response{headers: headers}
) does not output some of those cookies (e.g. authchallenge
) I see both in my browser and in the response of the Python code above. Could it be the case that hackney doesn't actually post anything?
Upvotes: 1
Views: 2292
Reputation: 11
I had a similar problem:
I make a GET request to a server api and it responds with a 301 redirection at the same location and a "Set-Cookie" header with a sessionId. If you follow the redirection without sending back their cookie, they respond with the same redirection and a new SessionId cookie. And this motive continues like this if you never sent them back their cookie. On the other hand if you send them back their cookie they respond with a 200 status code and your asked data.
The problem is hackney and consequently HTTPoison cannot follow this scenario. It actually has a :follow_redirect option that when set it follows redirections but falls short in grabbing the cookies and sending them back in between redirections.
All browsers I tried (firefox, chrome, IE) where able to pass this scenario. Python and wget did the job as well.
Anyway, to make it short, I wrote a workaround for my case which may give some ideas to others with similar problems:
defmodule GVHTTP do
defmacro __using__(_) do
quote do
use HTTPoison.Base
def cookies_from_resp_headers(recv_headers) when is_list(recv_headers) do
List.foldl(recv_headers, [], fn
{"Set-Cookie", c}, acc -> [c|acc]
_, acc -> acc
end)
|> Enum.map(fn(raw_cookie) ->
:hackney_cookie.parse_cookie(raw_cookie)
|> (fn
[{cookie_name, cookie_value} | cookie_opts] ->
{ cookie_name, cookie_value,
cookie_opts
}
_error ->
nil
end).()
end)
|> Enum.filter(fn
nil -> false
_ -> true
end)
end
def to_request_cookie(cookies) do
cookies
|> Enum.map(fn({ cookie_name, cookie_value, _cookie_opts}) ->
cookie_name <> "=" <> cookie_value
end)
|> Enum.join("; ")
|> (&("" == &1 && [] || [&1])).() # "" => [], "foo1=abc" => ["foo1=abc"]
end
def get(url, headers \\ [], options \\ []) do
case options[:follow_redirect] do
true ->
hackney_options = case options[:max_redirect] do
0 -> options # allow HTTPoison to handle the case of max_redirect overflow error
_ -> Keyword.drop(options, [:follow_redirect, :max_redirect])
end
case request(:get, url, "", headers, hackney_options) do
{:ok, %HTTPoison.Response{status_code: code, headers: recv_headers}} when code in [301, 302, 307] ->
{_, location} = List.keyfind(recv_headers, "Location", 0)
req_cookie =
cookies_from_resp_headers(recv_headers)
|> to_request_cookie()
new_options =
options
|> Keyword.put(:max_redirect, (options[:max_redirect] || 5) - 1)
|> Keyword.put(:hackney, [cookie:
[options[:hackney][:cookie]|req_cookie]
|> List.delete(nil)
|> Enum.join("; ")
]) # add any new cookies along with the previous ones to the request
get(location, headers, new_options)
resp ->
resp
end
_ ->
request(:get, url, "", headers, options)
end
end
end # quote end
end # __using__ end
end
Upvotes: 1