How to find out charset of text file loaded by input[type="file"] in Javascript

I want to read user's file and gave him modified version of this file. I use input with type file to get text file, but how I can get charset of loaded file, because in different cases it can be various... Uploaded file has format .txt or something similar and isn't .html :)

var handler = document.getElementById('handler');
var reader = new FileReader();

handler.addEventListener('click', function() {
    reader.readAsText(firstSub.files[0], /* Here I need use a correctly charset */);
});

reader.addEventListener("loadend", function() {
    console.dir(reader.result.split('\n'));
});

Upvotes: 5

Views: 11405

Answers (3)

gignu
gignu

Reputation: 2485

The other solutions didn't work for what I was trying to do, so I decided to create my own module that can detect the charset and language of text files.

You load it via the <script> tag and then use the languageEncoding function to retrieve the charset/encoding:

// index.html

<script src="https://unpkg.com/detect-file-encoding-and-language/umd/language-encoding.min.js"></script>
// app.js

languageEncoding(file).then(fileInfo => console.log(fileInfo));
// Possible result: { language: english, encoding: UTF-8, confidence: { language: 0.96, encoding: 1 } }

For a more complete example/instructions check out this part of the documentation!

Upvotes: 1

Roman Karagodin
Roman Karagodin

Reputation: 859

In my case (I made a small web app that accepts subtitle .srt files and removes time codes and line breaks, making a printable text), it was enough to foresee 2 types of encoding: UTF-8 and CP1251 (in all cases I tried – with both Latin and Cyrillic letters – these two types are enough). At first I try encoding with UTF-8, and if it is not successful, some characters are replaced by '�'-signs. So, I check the result for presence of these signs, and, if found, the procedure is repeated with CP1251 encoding. So, here is my code:

function onFileInputChange(inputDomElement, utf8 = true) {
    const file = inputDomElement.files[0];
    const reader = new FileReader();
    reader.readAsText(file, utf8 ? 'UTF-8' : 'CP1251');
    reader.onload = () => {
        const result = reader.result;
        if (utf8 && result.includes('�')) {
            onFileInputChange(inputDomElement, false);
            console.log('The file encoding is not utf-8! Trying CP1251...');
        } else {
            document.querySelector('#textarea1').value = file.name.replace(/\.(srt|txt)$/, '').replace(/_+/g, '\ ').toUpperCase() + '\n' + result;
        }
    }
}

Upvotes: 7

Chetan Jadhav CD
Chetan Jadhav CD

Reputation: 1146

You should check out this library encoding.js

They also have a working demo. I would suggest you first try it out with the files that you'll typically work with to see if it detects the encoding correctly and then use the library in your project.

Upvotes: 4

Related Questions