Daniel Friis
Daniel Friis

Reputation: 484

Why does OpenURI return a 404, when the parsed URL works fine in browser?

I'm trying to screen-scrape a URL containing special characters like the Danish character 'ø'.

The URL is:

url = "http://www.zara.com/dk/da/dame/tilbehør/tilbehør/stribet-hue-c271008p2195502.html"

In order to have OpenURI recognize it as a valid URL, I do:

url = Addressable::URI.parse(url).normalize.to_s

and parse it with:

doc = Nokogiri::HTML(open(url))

which returns:

OpenURI::HTTPError: 404 Not Found

I have no clue why OpenURI returns a 404, because the normalized URL works fine in a browser.

Why this is the case and what I have to do to fix it?

Upvotes: 3

Views: 2618

Answers (1)

Daniel Friis
Daniel Friis

Reputation: 484

I found out that the problem was with the server of the URL I was trying to parse. They rejected the default User-Agent used by OpenURI.

From the documentation on OpenURI, it says that additional header fields can be specified by an optional hash argument:

open("http://www.ruby-lang.org/en/",
  "User-Agent" => "Ruby/#{RUBY_VERSION}",
  "From" => "[email protected]",
  "Referer" => "http://www.ruby-lang.org/") {|f|
  # ...
}

I just used a different User-Agent and everything worked fine.

Upvotes: 6

Related Questions