Yakuhzi
Yakuhzi

Reputation: 1249

Swift URL.path changes encoding of utf-8 characters

Why does converting a String to an URL in Swift 4.2 and then converting the URL back to a String using url.path change the encoding of special characters like german umlauts (ä, ö, ü), even if I use a utf-8 encoding?

I wrote some sample code to show my problem. I encoded the strings to base64 in order to show that there is a difference.

I also have a similar unsolved problem with special characters and swift here.

Sample Code

let string = "/path/to/file"
let stringUmlauts = "/path/to/file/with/umlauts/testäöü"

let base64 = Data(string.utf8).base64EncodedString()
let base64Umlauts = Data(stringUmlauts.utf8).base64EncodedString()

print(base64, base64Umlauts)

let url = URL(fileURLWithPath: string)
let urlUmlauts = URL(fileURLWithPath: stringUmlauts)

let base64Url = Data(url.path.utf8).base64EncodedString()
let base64UrlUmlauts = Data(urlUmlauts.path.utf8).base64EncodedString()

print(base64Url, base64UrlUmlauts)

Output

The base64 and base64Url string stay the same but the base64Umlauts and the base64UrlUmlauts are different.

"L3BhdGgvdG8vZmlsZQ==" for base64

"L3BhdGgvdG8vZmlsZQ==" for base64Url

"L3BhdGgvdG8vZmlsZS93aXRoL3VtbGF1dHMvdGVzdMOkw7bDvA==" for base64Umlauts

"L3BhdGgvdG8vZmlsZS93aXRoL3VtbGF1dHMvdGVzdGHMiG/MiHXMiA==" for base64UrlUmlauts

When I put the base64Umlauts and base64UrlUmlauts strings into an online Base64 decoder, they both show /path/to/file/with/umlauts/testäöü, but the ä, ö, ü are different (not visually).

Upvotes: 3

Views: 2120

Answers (1)

rmaddy
rmaddy

Reputation: 318774

stringUmlauts.utf8 uses the Unicode characters äöü.

But urlUmlauts.path.utf8 uses the Unicode characters aou each followed by the combining ¨.

This is why you get different base64 encoding - the characters look the same but are actually encoded differently.

What's really interesting is that Array(stringUmlauts) and Array(urlUmlauts.path) are the same. The difference doesn't appear until you perform the UTF-8 encoding of the otherwise exact same String values.

Since the base64 encoding is irrelevant, here's a more concise test:

let stringUmlauts = "/path/to/file/with/umlauts/testäöü"
let urlUmlauts = URL(fileURLWithPath: stringUmlauts)

print(stringUmlauts, urlUmlauts.path) // Show the same

let rawStr = stringUmlauts
let urlStr = urlUmlauts.path

print(rawStr == urlStr) // true
print(Array(rawStr) == Array(urlStr)) // true
print(Array(rawStr.utf8) == Array(urlStr.utf8)) // false!!!

So how is the UTF-8 encoding of two equal strings different?

One solution to this is to use precomposedStringWithCanonicalMapping on the result of path.

let urlStr = urlUmlauts.path.precomposedStringWithCanonicalMapping

Now you get true from:

print(Array(rawStr.utf8) == Array(urlStr.utf8)) // now true

Upvotes: 4

Related Questions