Freundchen
Freundchen

Reputation: 23

LInux shell: conditional conversion of character encoding, multiple text files

The situation: I have a bunch of text files (.csv, to be precise), around 20000 that differ in character encoding: file -i *.csv gives me charset=us-ascii for most, but some are utf-16le.

The goal: I want them all to be encoded the same way, us-ascii here. I think of a one-liner that checks for each file in the directory the encoding, and if it is utf-16le, it converts it to us-ascii.

I only started to learn bash programming a few day ago, so this one still escapes me. Is it possible, something like running file -i on each file (did that), capturing the return value, check what encoding is given and if it is not us-ascii, convert it?

Thanks for helping me understand how to do that!

Upvotes: 0

Views: 4298

Answers (3)

rzymek
rzymek

Reputation: 9281

This will convert any non-us-ascii encoded *.csv files to us-ascii:

#!/bin/bash
for f in *.csv;do
    charset=`file -i README.md |grep -o 'charset=.*'|cut -d= -f2`
    if [ "$charset" != "us-ascii" ];then
      echo "$f $charset -> us-ascii"
      iconv -f "$charset" -t us-ascii < "$f" > "$f.tmp" \
        && mv "$f.tmp" "$f"
    fi
done

Upvotes: 1

flaschenpost
flaschenpost

Reputation: 2235

The other solutions don't care about the mixture of files, which sounds like a solution in the sense of:

for F in *.csv; do
    if [ `file -i "$F" | awk '{print $3;}'` = "charset=utf-16" ]; then
        iconv -f UTF-16 -t US-ASCII "$F" > "u.$F"
    fi
done

What makes it easier is the identity of us-ascii and utf-16 in the first few (128) characters - so if the file really is us-ascii, the conversion would not do any harm.

Upvotes: 2

Bill
Bill

Reputation: 5764

Pls try the following command:

iconv -f FROM-ENCODING -t TO-ENCODING *.csv

and replace FROM-ENCODING and TO-ENCODING with appropriate values.

You can use the following script, or something similar for your needs.

for file in  *.csv
do
    iconv -f FROM-ENCODING -t TO-ENCODING "$file" > "$file.new"
done

You can also use recode command.

recode FROM-ENCODING..TO-ENCODING file.csv

Finally, look at this Best way to convert text files between character sets? if you are interested in learning more about iconv and/or recode

Upvotes: 1

Related Questions