Baklap4
Baklap4

Reputation: 4202

Bulk-convert non-UTF-8 and UTF-8-with-BOM files to UTF-8

Heey i'm trying to create new files with the find command in bash on Ubuntu.

I can easily list the files and know how to create a new file from it however i don't want the encoding to come with it.

Right now i'm using this command: find ./Polish\ 2\ \(copy\)/ -name '*.txt' -type f -exec sh -c 'cat <"$0" >"$0.txt"' {} \; however if a file is for example not in UTF-8 format i'd still want to write the new file $0.txt in a UTF-8 format.

I came upon this idea since whenever i'm doing this manually:

  1. I open the nonUTF8 file in gedit.
  2. Copy the contents.
  3. Create a new blank file.
  4. Open it with gedit.
  5. Paste the copied contents to the file and save

The default behavior of gedit is saving to UTF8 in my case. However with over 30.000 files to do this to i don't want to do this manually..

Any solutions with default builtin tools?

EDIT

The file may be edited on the fly instead of creating a seperate file like i did in my example.

Also what does happen when trying to convert the file with iconv if the file is already in UTF-8 format?

EDIT 2.0

I'd love to have all the files in the end without BOM

Upvotes: 3

Views: 4277

Answers (1)

mklement0
mklement0

Reputation: 437090

There's no unambiguous method for identifying a file's character encoding by its contents alone, so the best you can do is to assume the most likely input encoding (CP1252, as you state) when you convert to UTF-8, using iconv; to avoid converting files that already are UTF-8-encoded, you can use file to detect them:

Note: For simplicity, I've changed find's target directory to .

find . -type f -name '*.txt' -exec bash -c '
  descr=$(file -b "$0")
  if [[ $descr != *UTF-8* ]]; then
    iconv -f CP1252 -t UTF-8 "$0" > "$0.$$" && mv "$0.$$" "$0"
  elif [[ $descr == *"with BOM"* ]]; then
    tail -c +4 "$0" > "$0.$$" && mv "$0.$$" "$0"
  fi
' {} \;

Note: If you convert this command to a single-line-statement, you'll need additional ; instances, namely, after:
the descr=... statement, the iconv ... statement, and the tail ... statement.

Note:

  • file's -b option is not POSIX-compliant and the standard also doesn't prescribe mentioning a file's encoding or BOM presence in the output.
    In practice, however, the above should work on both Linux and macOS/BSD systems.

  • A UTF-8 "BOM" (Unicode signature, as used primarily on Windows) is 3 bytes long, so if it is detected in the input file via -file, tail -c +4 skips it, outputting a "BOM-less" UTF-8 file.

Upvotes: 6

Related Questions