Reputation: 444
I'm trying to make JavaScript download a unicode string as a text file. I'm at the point where I need to convert the unicode string into a data URL, so that the user can open the URL and download the file. Here is a simplification of my code:
var myString = "⌀怴ꁴ㥍䯖챻巏ܛ肜怄셀겗孉贜짥孍ಽ펾曍㩜䝺捄칡⡴얳锭劽嫍ᯕ�";
var link = document.createElement('a');
link.setAttribute('href', 'data:text/plain;base64,' + myString);
I don't know what character set to use or how to encode my string - I've tried combinations of encodeURI()
and btoa()
, but haven't managed to get anything working. encodeURI()
gives me the error Uncaught URI Error: malformed URI
for some characters like U+da7b
.
I would prefer the final downloaded file to have the same characters as the initial string.
Upvotes: 1
Views: 3280
Reputation: 53626
This problem is called out as the "Unicode Problem" in the Base 64 documentation on MDN, with the solution using a TextEncoder:
function base64ToBytes(base64) {
const binString = atob(base64);
return Uint8Array.from(binString, (m) => m.codePointAt(0));
}
function bytesToBase64(bytes) {
const binString = String.fromCodePoint(...bytes);
return btoa(binString);
}
const encoded = bytesToBase64(new TextEncoder().encode("a Ā 𐀀 文 🦄"));
console.log(`encoded:`, encoded);
const decoded = new TextDecoder().decode(base64ToBytes("YSDEgCDwkICAIOaWhyDwn6aE"));
console.log(`decoded:`, decoded);
Unlike solutions involving encodeURIComponent
, which don't work when the data-url is used in something that does not apply URI decoding, such as <script>
src attributes, this approach will work for everything:
// use the above approach for string input
function base64(string) {
const bytes = new TextEncoder().encode(string);
const binString = String.fromCodePoint(...bytes);
return btoa(binString);
}
// declare a unicorn:
const unicorn = `🦄`;
// dynamically generate a new javascript module that exports
// a function called test() that console logs our unicorn:
const customScript = `
export function test() {
console.log("${unicorn}");
}
`;
console.log(`script is:`, customScript);
// turn this script into a data URL using TextEncoder and btoa:
const dataURL = `data:text/javascript;base64,${base64(customScript)}`
// then import that module by turning it into a data-url,
// which should console log our unicorn emoji:
import(dataURL)
.then((lib) => lib.test())
.catch(e => console.log(`loading failed: ${e.message}`));
If we try to do this with encodeURIComponent
, the browser will grind to a halt:
// dynamically generate a new javascript module
const unicorn = `🦄`;
const customScript = `
export function test() {
console.log("${unicorn}");
}
`;
console.log(`script is:`, customScript);
// turn this script into a data URL with encodeURICompoennt:
const dataURL = `data:text/javascript;base64,${btoa(encodeURIComponent(customScript))}`;
// then import that as new JS module, which will fail.
import(dataURL)
.then((lib) => lib.test())
.catch(e => console.log(`loading failed: ${e.message}`));
(With the universal caveat that if the server's CSP is set to disallow data:
, this won't do anything. But when it is permitted, the correct way to get text with Unicode data turned into a base64 data URL is by using TextEncoder)
Upvotes: 0
Reputation: 597205
You don't need to use base64 when using text in a data:
URL, simply percent-encoding the text will suffice, eg:
var link = document.createElement('a');
link.setAttribute('href', 'data:text/plain;charset=UTF-8,' + encodeURIComponent(myString));
encodeURIComponent()
charset-encodes the text to UTF-8, and then url-encodes the UTF-8 bytes, hence the inclusion of charset=UTF-8
in the data:
URL.
But, if you still want to use base64, you don't need to url-encode the text. Just charset-encode the text to bytes, then base64-encode the bytes, and specify the charset used in the data:
URL, eg:
var link = document.createElement('a');
link.setAttribute('href', 'data:text/plain;charset=UTF-8;base64,' + btoa(unescape(encodeURIComponent(myString))));
Upvotes: 1
Reputation: 3968
This is working for me
decodeURIComponent(atob(btoa(encodeURIComponent("中文"))))
// Output: 中文
And for your case on \uDA7B
, it fails because it's one of the high surrogates (D800-DBFF), it is meaningful only when used as part of a surrogate pair.
That's why you have the URIError when you do
encodeURIComponent('\uDA7B') // ERROR
Pair it with a character from the low surrogates (DC00-DFFF) and it works:
encodeURIComponent('\uDA7B\uDC01')
Upvotes: -1
Reputation: 89
You could try setting the download attribute and using URL encoding with text/plain.
const myString = '⌀怴ꁴ㥍䯖챻巏ܛ肜怄셀겗孉贜짥孍ಽ펾曍㩜䝺捄칡⡴얳锭劽嫍ᯕ�';
const link = document.createElement('a');
link.setAttribute('download', 'filename');
link.append("Download!");
link.setAttribute('href', 'data:,' + encodeURI(myString));
document.body.appendChild(link);
Upvotes: -1