Reputation: 484
I'm trying to screen-scrape a URL containing special characters like the Danish character 'ø'
.
The URL is:
url = "http://www.zara.com/dk/da/dame/tilbehør/tilbehør/stribet-hue-c271008p2195502.html"
In order to have OpenURI recognize it as a valid URL, I do:
url = Addressable::URI.parse(url).normalize.to_s
and parse it with:
doc = Nokogiri::HTML(open(url))
which returns:
OpenURI::HTTPError: 404 Not Found
I have no clue why OpenURI returns a 404, because the normalized URL works fine in a browser.
Why this is the case and what I have to do to fix it?
Upvotes: 3
Views: 2618
Reputation: 484
I found out that the problem was with the server of the URL I was trying to parse. They rejected the default User-Agent used by OpenURI.
From the documentation on OpenURI, it says that additional header fields can be specified by an optional hash argument:
open("http://www.ruby-lang.org/en/",
"User-Agent" => "Ruby/#{RUBY_VERSION}",
"From" => "[email protected]",
"Referer" => "http://www.ruby-lang.org/") {|f|
# ...
}
I just used a different User-Agent and everything worked fine.
Upvotes: 6