James Brown
James Brown

Reputation: 37404

Bash and file command sample size

I'm writing a bash script to process some files automatically and one subjob is to use iconv to re-encode source files if they are not of my liking. For that I use:

enc=$(file -b --mime-encoding "$file")                   # get the encoding

if [ "$enc" = "iso-8859-1" ] || [ "$enc" = "us-ascii" ]  # no need to encode these
then                                                     
    unset enc
fi

cat "$file" |                                            # conditional encoding below
    ( [[ "${enc}" ]] && iconv -f "$enc" -t iso-8859-1 || cat ) |
    awk '{# code to process file further}' > "$newfile"

The problem is that I have a file which is UTF8 but file falsely recognizes it as ASCII. The first non-ASCII character is character #314206 which is on line #1028. Apparently there is some sample size for file, for example if I convert the file from fixed width to character delimited the first non-ASCII character is char #80872 and file recognizes the file encoding correctly. So I guess there is a sample size which is between those 2 values.

(TL;DR) Is there a way to instruct file to take a larger sample or read the whole source file, or some other bash friendly way of finding out the encoding?

I played around with file -P but couldn't affect the outcome with that. man file didn't help me any further and googling file command sample size was not very promising.

(if you wonder about the conditional approach there are some other tasks to process also not shown in the code sample)

Upvotes: 4

Views: 72

Answers (1)

randomir
randomir

Reputation: 18697

By default, file will only analyze the first 1048576 bytes of the file.

An option to control this limit was added in commit d04de269, and it's available in file since version 5.26 (2016-04-16). It is controlled with the -P option, parameter named bytes:

-P, --parameter name=value
    Set various parameter limits.
        Name         Default    Explanation
        ...
        bytes        1048576    max number of bytes to read from file

So, you can just set the bytes limit to the size of your largest file, e.g. 100 MB:

$ file -P bytes=104857600 file

Upvotes: 5

Related Questions