Reputation: 4202
Heey i'm trying to create new files with the find
command in bash on Ubuntu.
I can easily list the files and know how to create a new file from it however i don't want the encoding to come with it.
Right now i'm using this command: find ./Polish\ 2\ \(copy\)/ -name '*.txt' -type f -exec sh -c 'cat <"$0" >"$0.txt"' {} \;
however if a file is for example not in UTF-8 format i'd still want to write the new file $0.txt
in a UTF-8 format.
I came upon this idea since whenever i'm doing this manually:
The default behavior of gedit is saving to UTF8 in my case. However with over 30.000 files to do this to i don't want to do this manually..
Any solutions with default builtin tools?
The file may be edited on the fly instead of creating a seperate file like i did in my example.
Also what does happen when trying to convert the file with iconv
if the file is already in UTF-8
format?
I'd love to have all the files in the end without BOM
Upvotes: 3
Views: 4277
Reputation: 437090
There's no unambiguous method for identifying a file's character encoding by its contents alone, so the best you can do is to assume the most likely input encoding (CP1252
, as you state) when you convert to UTF-8, using iconv
; to avoid converting files that already are UTF-8-encoded, you can use file
to detect them:
Note: For simplicity, I've changed find
's target directory to .
find . -type f -name '*.txt' -exec bash -c '
descr=$(file -b "$0")
if [[ $descr != *UTF-8* ]]; then
iconv -f CP1252 -t UTF-8 "$0" > "$0.$$" && mv "$0.$$" "$0"
elif [[ $descr == *"with BOM"* ]]; then
tail -c +4 "$0" > "$0.$$" && mv "$0.$$" "$0"
fi
' {} \;
Note: If you convert this command to a single-line-statement, you'll need additional ;
instances, namely, after:
the descr=...
statement, the iconv ...
statement, and the tail ...
statement.
Note:
file
's -b
option is not POSIX-compliant and the standard also doesn't prescribe mentioning a file's encoding or BOM presence in the output.
In practice, however, the above should work on both Linux and macOS/BSD systems.
A UTF-8 "BOM" (Unicode signature, as used primarily on Windows) is 3 bytes long, so if it is detected in the input file via -file
, tail -c +4
skips it, outputting a "BOM-less" UTF-8 file.
Upvotes: 6