Alexanderbira
Alexanderbira

Reputation: 444

Making a data URI from unicode string

I'm trying to make JavaScript download a unicode string as a text file. I'm at the point where I need to convert the unicode string into a data URL, so that the user can open the URL and download the file. Here is a simplification of my code:

var myString = "⌀怴ꁴ㥍䯖챻巏ܛ肜怄셀겗孉贜짥孍ಽ펾曍㩜䝺捄칡⡴얳锭劽嫍ᯕ�";

var link = document.createElement('a');
link.setAttribute('href', 'data:text/plain;base64,' + myString);

I don't know what character set to use or how to encode my string - I've tried combinations of encodeURI() and btoa(), but haven't managed to get anything working. encodeURI() gives me the error Uncaught URI Error: malformed URI for some characters like U+da7b.
I would prefer the final downloaded file to have the same characters as the initial string.

Upvotes: 1

Views: 3280

Answers (4)

This problem is called out as the "Unicode Problem" in the Base 64 documentation on MDN, with the solution using a TextEncoder:

function base64ToBytes(base64) {
  const binString = atob(base64);
  return Uint8Array.from(binString, (m) => m.codePointAt(0));
}

function bytesToBase64(bytes) {
  const binString = String.fromCodePoint(...bytes);
  return btoa(binString);
}

const encoded = bytesToBase64(new TextEncoder().encode("a Ā 𐀀 文 🦄"));
console.log(`encoded:`, encoded);

const decoded = new TextDecoder().decode(base64ToBytes("YSDEgCDwkICAIOaWhyDwn6aE"));
console.log(`decoded:`, decoded);

Unlike solutions involving encodeURIComponent, which don't work when the data-url is used in something that does not apply URI decoding, such as <script> src attributes, this approach will work for everything:

// use the above approach for string input
function base64(string) {
  const bytes = new TextEncoder().encode(string);
  const binString = String.fromCodePoint(...bytes);
  return btoa(binString);
}

// declare a unicorn:
const unicorn = `🦄`;

// dynamically generate a new javascript module that exports
// a function called test() that console logs our unicorn:
const customScript = `
  export function test() {
    console.log("${unicorn}");
  }
`;

console.log(`script is:`, customScript);

// turn this script into a data URL using TextEncoder and btoa:
const dataURL = `data:text/javascript;base64,${base64(customScript)}`

// then import that module by turning it into a data-url,
// which should console log our unicorn emoji:
import(dataURL)
  .then((lib) => lib.test())
  .catch(e => console.log(`loading failed: ${e.message}`));

If we try to do this with encodeURIComponent, the browser will grind to a halt:

// dynamically generate a new javascript module
const unicorn = `🦄`;
const customScript = `
  export function test() {
    console.log("${unicorn}");
  }
`;

console.log(`script is:`, customScript);

// turn this script into a data URL with encodeURICompoennt:
const dataURL = `data:text/javascript;base64,${btoa(encodeURIComponent(customScript))}`;

// then import that as new JS module, which will fail.
import(dataURL)
  .then((lib) => lib.test())
  .catch(e => console.log(`loading failed: ${e.message}`));

(With the universal caveat that if the server's CSP is set to disallow data:, this won't do anything. But when it is permitted, the correct way to get text with Unicode data turned into a base64 data URL is by using TextEncoder)

Upvotes: 0

Remy Lebeau
Remy Lebeau

Reputation: 597205

You don't need to use base64 when using text in a data: URL, simply percent-encoding the text will suffice, eg:

var link = document.createElement('a');
link.setAttribute('href', 'data:text/plain;charset=UTF-8,' + encodeURIComponent(myString));

encodeURIComponent() charset-encodes the text to UTF-8, and then url-encodes the UTF-8 bytes, hence the inclusion of charset=UTF-8 in the data: URL.

But, if you still want to use base64, you don't need to url-encode the text. Just charset-encode the text to bytes, then base64-encode the bytes, and specify the charset used in the data: URL, eg:

var link = document.createElement('a');
link.setAttribute('href', 'data:text/plain;charset=UTF-8;base64,' + btoa(unescape(encodeURIComponent(myString))));

Upvotes: 1

lastr2d2
lastr2d2

Reputation: 3968

This is working for me

decodeURIComponent(atob(btoa(encodeURIComponent("中文"))))
// Output: 中文

And for your case on \uDA7B, it fails because it's one of the high surrogates (D800-DBFF), it is meaningful only when used as part of a surrogate pair.

That's why you have the URIError when you do

encodeURIComponent('\uDA7B') // ERROR

Pair it with a character from the low surrogates (DC00-DFFF) and it works:

encodeURIComponent('\uDA7B\uDC01')

Upvotes: -1

Lino Contreras
Lino Contreras

Reputation: 89

You could try setting the download attribute and using URL encoding with text/plain.

const myString = '⌀怴ꁴ㥍䯖챻巏ܛ肜怄셀겗孉贜짥孍ಽ펾曍㩜䝺捄칡⡴얳锭劽嫍ᯕ�';

const link = document.createElement('a');
link.setAttribute('download', 'filename');
link.append("Download!");
link.setAttribute('href', 'data:,' + encodeURI(myString));

document.body.appendChild(link);

Upvotes: -1

Related Questions