James Paterson
James Paterson

Reputation: 2915

Fixing Facebook JSON Encoding in Node Js

I'm trying to decode the JSON you get from Facebook when you download your data. I'm using Node JS. The data has lots of weird unicode escapes that don't really make sense. Example:

"messages": [
    {
      "sender_name": "Emily Chadwick",
      "timestamp_ms": 1480314292125,
      "content": "So sorry that was in my pocket \u00f0\u009f\u0098\u0082\u00f0\u009f\u0098\u0082\u00f0\u009f\u0098\u0082",
      "type": "Generic"
    }
]

Which should decode as So sorry that was in my pocket 😂😂😂. Using fs.readFileSync(filename, "utf8") gets me So sorry that was in my pocket ððð instead, which is mojibake.

This question mentions that it's screwed up latin1 encoding, and that you can encode to latin1 and then decode to utf8. I tried to do that with:

import iconv from 'iconv-lite';
function readFileSync_fixed(filename) {
    var content = fs.readFileSync(filename, "binary");
    return iconv.decode(iconv.encode(content, "latin1"), "utf-8")
}
console.log(JSON.parse(readFileSync_fixed(filename)))

But I still get the mojibake version. Can anyone point me in the right direction? I'm unfamiliar with how iconv works in regard to this.

Upvotes: 1

Views: 1388

Answers (2)

dyooreen
dyooreen

Reputation: 11

For this have very simple solution

Fist install utf8 package

npm i utf8.

Your code will be look like this

const fs = require('fs');
const utf8 = require('utf8');
let rawdata = fs.readFileSync('JSON_FILE_NAME');
let data = JSON.parse(rawdata);

for (let i = 0;i < data["messages"].length;i++) {
    if (data["messages"][i]["content"] != undefined) {
        console.log(utf8.decode(data["messages"][i]["content"]))
    }
} 

Upvotes: 1

James Paterson
James Paterson

Reputation: 2915

Solved... in a way. If there's a better way to do it, let me know.

So, here's the amended function

readFacebookJson(filename) {
    var content = fs.readFileSync(filename, "utf8");
    const json = JSON.parse(converted)
    return json
}

fixEncoding(string) {
    return iconv.decode(iconv.encode(string, "latin1"), "utf8")
}

It wasn't the readFileSync() screwing things up, it was the JSON.parse(). So - we read the file as utf8 like usual, however, we then need to do the latin1 encoding/decoding on the strings that are now properties of the JSON file, not the whole JSON file before it's parsed. I did this with a map().

messages = readFacebookJson(filename).messages.map(message => {
    const toReturn = message;
    toReturn.sender_name = fixEncoding(toReturn.sender_name)
    if (typeof message.content !== "undefined") {
        toReturn.content = fixEncoding(message.content)
    }
    return toReturn;
}),

The issue here is of course that some properties might be missed. So make sure you know what properties contain what.

Upvotes: 2

Related Questions