Reputation: 37404
I'm writing a bash script to process some files automatically and one subjob is to use iconv
to re-encode source files if they are not of my liking. For that I use:
enc=$(file -b --mime-encoding "$file") # get the encoding
if [ "$enc" = "iso-8859-1" ] || [ "$enc" = "us-ascii" ] # no need to encode these
then
unset enc
fi
cat "$file" | # conditional encoding below
( [[ "${enc}" ]] && iconv -f "$enc" -t iso-8859-1 || cat ) |
awk '{# code to process file further}' > "$newfile"
The problem is that I have a file which is UTF8 but file
falsely recognizes it as ASCII. The first non-ASCII character is character #314206 which is on line #1028. Apparently there is some sample size for file
, for example if I convert the file from fixed width to character delimited the first non-ASCII character is char #80872 and file
recognizes the file encoding correctly. So I guess there is a sample size which is between those 2 values.
(TL;DR)
Is there a way to instruct file
to take a larger sample or read the whole source file, or some other bash friendly way of finding out the encoding?
I played around with file -P
but couldn't affect the outcome with that. man file
didn't help me any further and googling file command sample size was not very promising.
(if you wonder about the conditional approach there are some other tasks to process also not shown in the code sample)
Upvotes: 4
Views: 72
Reputation: 18697
By default, file
will only analyze the first 1048576
bytes of the file.
An option to control this limit was added in commit d04de269
, and it's available in file
since version 5.26 (2016-04-16). It is controlled with the -P
option, parameter named bytes
:
-P, --parameter name=value Set various parameter limits. Name Default Explanation ... bytes 1048576 max number of bytes to read from file
So, you can just set the bytes
limit to the size of your largest file, e.g. 100 MB:
$ file -P bytes=104857600 file
Upvotes: 5