Reputation: 3763
Can I use non latin characters in my robots.txt file and sitemap.xml like this?
robots.txt
User-agent: *
Disallow: /somefolder/
Sitemap: http://www.domainwithåäö.com/sitemap.xml
sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.domainwithåäö.com/</loc></url>
<url><loc>http://www.domainwithåäö.com/subpage1</loc></url>
<url><loc>http://www.domainwithåäö.com/subpage2</loc></url>
</urlset>
Or should I do like this?
robots.txt
User-agent: *
Disallow: /somefolder/
Sitemap: http://www.xn--domainwith-z5al6t.com/sitemap.xml
sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.xn--domainwith-z5al6t.com/</loc></url>
<url><loc>http://www.xn--domainwith-z5al6t.com/subpage1</loc></url>
<url><loc>http://www.xn--domainwith-z5al6t.com/subpage2</loc></url>
</urlset>
Upvotes: 5
Views: 611
Reputation: 99
They must be ASCII encoded as follows:
Upvotes: 0
Reputation: 11184
On https://support.google.com/webmasters/answer/183668 Google writes: "Make sure that your URLs follow the RFC-3986 standard for URIs, the RFC-3987 standard for IRIs", so I guess the correct answer is that you have to follow these two standards.
My best guess is that it doesn't matter, because Google consider the two URLs identical. That might also be what's stated in the standards, but I'm not good at reading these, so I can't confirm nor deny that.
Using the the xn--
format works. I haven't tried using Unicode characters to see if that also works.
Upvotes: 1
Reputation: 388
As your example contains a URI with characters NOT in the US-ASCII table, you will need to percent encode them.
Example from Bing:
Your URL:
http://www.domain.com/папка/
To Disallow: /папка/
Without Percent encoding (Not Compatible):
Disallow: /папка/
With Percent encoding (Compatile):
Disallow: /%D0%BF%D0%B0%D0%BF%D0%BA%D0%B0/
This Bing blog post may be of help.
For the XML sitemap, non-ASCII characters can be used but must be encoded to match the encoding readability of your server. See this guide by Google for a more detailed explanation with examples.
Upvotes: 0