Reputation: 19263
What is the fastest, easiest tool or method to convert text files between character sets?
Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa.
Everything goes: one-liners in your favorite scripting language, command-line tools or other utilities for OS, web sites, etc.
On Linux/UNIX/OS X/cygwin:
Gnu iconv suggested by Troels Arvin is best used as a filter. It seems to be universally available. Example:
$ iconv -f UTF-8 -t ISO-8859-15 in.txt > out.txt
As pointed out by Ben, there is an online converter using iconv.
recode (manual) suggested by Cheekysoft will convert one or several files in-place. Example:
$ recode UTF8..ISO-8859-15 in.txt
This one uses shorter aliases:
$ recode utf8..l9 in.txt
Recode also supports surfaces which can be used to convert between different line ending types and encodings:
Convert newlines from LF (Unix) to CR-LF (DOS):
$ recode ../CR-LF in.txt
Base64 encode file:
$ recode ../Base64 in.txt
You can also combine them.
Convert a Base64 encoded UTF8 file with Unix line endings to Base64 encoded Latin 1 file with Dos line endings:
$ recode utf8/Base64..l1/CR-LF/Base64 file.txt
On Windows with Powershell (Jay Bazuzi):
PS C:\> gc -en utf8 in.txt | Out-File -en ascii out.txt
(No ISO-8859-15 support though; it says that supported charsets are unicode, utf7, utf8, utf32, ascii, bigendianunicode, default, and oem.)
Do you mean iso-8859-1 support? Using "String" does this e.g. for vice versa
gc -en string in.txt | Out-File -en utf8 out.txt
Note: The possible enumeration values are "Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, Ascii".
Upvotes: 623
Views: 637539
Reputation: 3247
NOTE: THIS WILL OVERWRITE YOUR ORGINIAL FILE. MAKE A BACKUP FIRST.
Upvotes: 2
Reputation: 1386
If you have vim
you can use this:
Not tested for every encoding.
The cool part about this is that you don't have to know the source encoding
vim +"set nobomb | set fenc=utf8 | x" filename.txt
Be aware that this command modify directly the file
+
: Used by vim to directly enter command when opening a file. Usualy used to open a file at a specific line: vim +14 file.txt
|
: Separator of multiple commands (like ;
in bash)set nobomb
: no utf-8 BOMset fenc=utf8
: Set new encoding to utf-8 doc linkx
: Save and close filefilename.txt
: path to the file"
: qotes are here because of pipes. (otherwise bash will use them as bash pipe)Upvotes: 115
Reputation: 6412
Stand-alone utility approach
iconv -f ISO-8859-1 -t UTF-8 in.txt > out.txt
-f ENCODING the encoding of the input
-t ENCODING the encoding of the output
You don't have to specify either of these arguments. They will default to your current locale, which is usually UTF-8.
Upvotes: 305
Reputation: 2029
Assuming, you don't know the input encoding and still wish to automate most of the conversion, I concluded this one liner from summing up previous answers.
iconv -f $(chardetect input.text | awk '{print $2}') -t utf-8 -o output.text
Upvotes: 8
Reputation: 11471
There is also a web tool to convert file encoding: https://webtool.cloud/change-file-encoding
It supports wide range of encodings, including some rare ones, like IBM code page 37.
Upvotes: 1
Reputation: 3698
In powershell:
function Recode($InCharset, $InFile, $OutCharset, $OutFile) {
# Read input file in the source encoding
$Encoding = [System.Text.Encoding]::GetEncoding($InCharset)
$Text = [System.IO.File]::ReadAllText($InFile, $Encoding)
# Write output file in the destination encoding
$Encoding = [System.Text.Encoding]::GetEncoding($OutCharset)
[System.IO.File]::WriteAllText($OutFile, $Text, $Encoding)
}
Recode Windows-1252 "$pwd\in.txt" utf8 "$pwd\out.txt"
For a list of supported encoding names:
https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding
Upvotes: 1
Reputation: 3698
Try EncodingChecker
File Encoding Checker is a GUI tool that allows you to validate the text encoding of one or more files. The tool can display the encoding for all selected files, or only the files that do not have the encodings you specify.
File Encoding Checker requires .NET 4 or above to run.
For encoding detection, File Encoding Checker uses the UtfUnknown Charset Detector library. UTF-16 text files without byte-order-mark (BOM) can be detected by heuristics.
Upvotes: 5
Reputation: 507
If macOS GUI applications are your bread and butter, SubEthaEdit is the text editor I usually go to for encoding-wrangling — its "conversion preview" allows you to see all invalid characters in the output encoding, and fix/remove them.
And it's open-source now, so yay for them 😉.
Upvotes: 0
Reputation: 31508
The character encoding of all matching text files gets detected automatically and all matching text files are converted to utf-8
encoding:
$ find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -bi "$1" |sed -e "s/.*[ ]charset=//") -t utf-8 -o converted "$1" && mv converted "$1"' -- {} \;
To perform these steps, a sub shell sh
is used with -exec
, running a one-liner with the -c
flag, and passing the filename as the positional argument "$1"
with -- {}
. In between, the utf-8
output file is temporarily named converted
.
Whereby file -bi
means:
-b
, --brief
Do not prepend filenames to output lines (brief mode).
-i
, --mime
Causes the file command to output mime type strings rather than the more traditional human readable ones. Thus it may say for example text/plain; charset=us-ascii
rather than ASCII text
. The sed
command cuts this to only us-ascii
as is required by iconv
.
The find
command is very useful for such file management automation.
Click here for more find
galore.
Upvotes: 17
Reputation: 91
Simply change encoding of loaded file in IntelliJ IDEA IDE, on the right of status bar (bottom), where current charset is indicated. It prompts to Reload or Convert, use Convert. Make sure you backed up original file in advance.
Upvotes: 1
Reputation: 2220
My favorite tool for this is Jedit (a java based text editor) which has two very convenient features :
Upvotes: 0
Reputation: 1705
Use this Python script: https://github.com/goerz/convert_encoding.py Works on any platform. Requires Python 2.7.
Upvotes: 1
Reputation: 23939
With ruby:
ruby -e "File.write('output.txt', File.read('input.txt').encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: ''))"
Source: https://robots.thoughtbot.com/fight-back-utf-8-invalid-byte-sequences
Upvotes: 1
Reputation: 452
DOS/Windows: use Code page
chcp 65001>NUL
type ascii.txt > unicode.txt
Command chcp
can be used to change the code page. Code page 65001 is Microsoft name for UTF-8. After setting code page, the output generated by following commands will be of code page set.
Upvotes: 6
Reputation: 1226
to write properties file (Java) normally I use this in linux (mint and ubuntu distributions):
$ native2ascii filename.properties
For example:
$ cat test.properties
first=Execução número um
second=Execução número dois
$ native2ascii test.properties
first=Execu\u00e7\u00e3o n\u00famero um
second=Execu\u00e7\u00e3o n\u00famero dois
PS: I writed Execution number one/two in portugues to force special characters.
In my case, in first execution I received this message:
$ native2ascii teste.txt
The program 'native2ascii' can be found in the following packages:
* gcj-5-jdk
* openjdk-8-jdk-headless
* gcj-4.8-jdk
* gcj-4.9-jdk
Try: sudo apt install <selected package>
When I installed the first option (gcj-5-jdk) the problem was finished.
I hope this help someone.
Upvotes: 1
Reputation: 628
On Windows I was able to use Notepad++ to do the conversion from ISO-8859-1 to UTF-8. Click "Encoding"
and then "Convert to UTF-8"
.
Upvotes: 19
Reputation: 19829
I've put this into .bashrc
:
utf8()
{
iconv -f ISO-8859-1 -t UTF-8 $1 > $1.tmp
rm $1
mv $1.tmp $1
}
..to be able to convert files like so:
utf8 MyClass.java
Upvotes: 19
Reputation: 16423
iconv -f FROM-ENCODING -t TO-ENCODING file.txt
Also there are iconv-based tools in many languages.
Upvotes: 24
Reputation: 46506
Get-Content -Encoding UTF8 FILE-UTF8.TXT | Out-File -Encoding UTF7 FILE-UTF7.TXT
The shortest version, if you can assume that the input BOM is correct:
gc FILE.TXT | Out-File -en utf7 file-utf7.txt
Upvotes: 24
Reputation: 35580
Under Linux you can use the very powerful recode command to try and convert between the different charsets as well as any line ending issues. recode -l will show you all of the formats and encodings that the tool can convert between. It is likely to be a VERY long list.
Upvotes: 40