TextDecoder.prototype.ignoreBOM not working as expected

Question

I am using fetch api to pull csv data and trying to create a csv file.

However I see that the first two characters are ÿþ which are BOM characters.

I however, during decoding set ignoreBOM: true but its not working and i always see ÿþ at the start of the CSV

below is my code

    const fetchData = await newPage.evaluate(async () => {
      let url = $('.csvLink').attr('href');
      console.log(`in here is the ${url}`);
      const stream = await fetch(url);
      let reader = stream.body.getReader();
      let receivedLength = 0;
      while(true) {
        const {done, value} = await reader.read();
        if (done) {
          break;
        }
        receivedLength += value.length;
        let v = new TextDecoder("ISO-8859-1", {fatal: false, ignoreBOM: false}).decode(value, {stream: true});
        await window.writeToStream(v, false);
      }

Any help to fix this would be really great. Thanks.

Kaiido · Accepted Answer

ignoreBOM only applies for UTF-8 UTF-16BE and UTF-16LE encodings.

If you have a BOM at the beginning of your file, then it's probably not encoded as CP-1252 but rather as UTF and if that BOM is read as ÿþ in CP-1252, then that's probably even UTF-16LE:

const littleEnd_BOM = new Uint8Array( [ 0xFF, 0xFE ] );
const as_CP1252 = new TextDecoder( 'iso-8859-1' ).decode( littleEnd_BOM );

console.log( as_CP1252 );

So, first thing wrong in your code is that you don't want to initialize your TextDecoder to CP-1252, but to UTF-16LE.

Now, there is some confusion about which value you did set ignoreBOM to, at one point you wrote you set it to true, while in the code snippet it's set to false (the default).

If you want the BOM to stay in the output, then set it to true. The parser will ignore that BOM mark, meaning it will treat it as a normal character sequence.

If on the other hand you want it to be removed from the output, then leave it as false, the parser will treat it specially, and remove it from the output.

Note that even though the BOM is here, it may not be printed from a DOMString:

const UTF16LE_text = new Uint16Array(
      [ ..."\ufeffhello" ].map( (char) => char.charCodeAt(0) )
  );
// to check we really wrote a LE-BOM "FFFE"
const BOM = [ ...new Uint8Array( UTF16LE_text.buffer ).slice( 0, 2 ) ]
  .map( (v) => v.toString( 16 ) ).join('');
console.log( 'BOM:', BOM );

const ignoring_decoder = new TextDecoder( 'UTF-16LE', { ignoreBOM: true } );
const ignored = ignoring_decoder.decode( UTF16LE_text );
console.log( 'ignoreBOM:true  - text:', ignored );
console.log( 'ignoreBOM:true  - char at 0:', ignored.charCodeAt( 0 ) );

const removing_decoder = new TextDecoder( 'UTF-16LE' );
const removed = removing_decoder.decode( UTF16LE_text );
console.log( 'ignoreBOM:false - text:', removed );
console.log( 'ignoreBOM:false - char at 0:', removed.charCodeAt( 0 ) );

But an other issue you may face is that you are reading the fetched chunk of text as they come, that is by randomly sized chunk of data.
Text can't be processed this way, you need to parse it from well-defined byte position to be able to parse it correctly.

Luckily, there is a stream option to the TextDecoder.decode() method. Using it, the decoder should be able to read the stream correctly, but for this option to work, you need to store the TextDecoder outside of your while loop so it can keep the previous buffer in memory.

const fetchData = await newPage.evaluate(async () => {
  let url = $('.csvLink').attr('href');
  const stream = await fetch(url);
  let reader = stream.body.getReader();
  let receivedLength = 0;
  // declare the decoder outside of the loop
  const decoder = new TextDecoder("UTF-16LE");
  while(true) {
    const {done, value} = await reader.read();
    receivedLength += value.length;
    // always use the same decoder
    const v = decoder.decode(value, {stream: true});
    await window.writeToStream(v, false);
    if (done) { 
      break;
    }
  }
}

TextDecoder.prototype.ignoreBOM not working as expected

Answers (1)

Related Questions