Reputation: 22663
I have an app visiting URLs automatically through links. It works good as long as the URL doesn't contain Unicode.
For example, I have a link:
<a href="https://example.com/catalog/kraków/list.html">Kraków</a>
The link contains just pure ó character in the source. When I try to do:
$href = $crawler->filter('a')->attr('href');
$html = file_get_contents($href);
It returns 404 error. If I visit that URL in the browser, it's fine, because the browser replaces ó to %C3%B3.
What should I do to make is possible to visit that URL via file_get_contents()
?
Upvotes: 0
Views: 350
Reputation: 5792
urlencode can be used to encode url parts. The following snippet extracts the path /catalog/kraków/list.html
and encodes the contents: catalog
, kraków
and list.html
instead of the entire url to preserve the path.
Checkout the following solution:
function encodeUri($uri){
$urlParts = parse_url($uri);
$path = implode('/', array_map(function($pathPart){
return strpos($pathPart, '%') !== false ? $pathPart : urlencode($pathPart);
},explode('/', $urlParts['path'])));
$query = array_key_exists('query', $urlParts) ? '?' . $urlParts['query'] : '';
return $urlParts['scheme'] . '://' . $urlParts['host'] . $path . $query;
}
$href = $crawler->filter('a')->attr('href');
$html = file_get_contents(encodeUri($href)); // outputs: https://example.com/catalog/krak%C3%B3w/list.html
parse_url docs: https://www.php.net/manual/en/function.parse-url.php
Upvotes: 1