Robo Robok
Robo Robok

Reputation: 22663

Unicode characters causing 404 error in file_get_contents()

I have an app visiting URLs automatically through links. It works good as long as the URL doesn't contain Unicode.

For example, I have a link:

<a href="https://example.com/catalog/kraków/list.html">Kraków</a>

The link contains just pure ó character in the source. When I try to do:

$href = $crawler->filter('a')->attr('href');
$html = file_get_contents($href);

It returns 404 error. If I visit that URL in the browser, it's fine, because the browser replaces ó to %C3%B3.

What should I do to make is possible to visit that URL via file_get_contents()?

Upvotes: 0

Views: 350

Answers (1)

MaartenDev
MaartenDev

Reputation: 5792

urlencode can be used to encode url parts. The following snippet extracts the path /catalog/kraków/list.html and encodes the contents: catalog, kraków and list.html instead of the entire url to preserve the path.

Checkout the following solution:

function encodeUri($uri){
    $urlParts = parse_url($uri);

    $path = implode('/', array_map(function($pathPart){
        return strpos($pathPart, '%') !== false ? $pathPart : urlencode($pathPart);
    },explode('/', $urlParts['path'])));

    $query = array_key_exists('query', $urlParts) ? '?' . $urlParts['query'] : '';

    return $urlParts['scheme'] . '://' . $urlParts['host']  . $path . $query;
}


$href = $crawler->filter('a')->attr('href');
$html = file_get_contents(encodeUri($href)); // outputs: https://example.com/catalog/krak%C3%B3w/list.html

parse_url docs: https://www.php.net/manual/en/function.parse-url.php

Upvotes: 1

Related Questions