Reputation: 42910
I got a handful of UTF-8-encoded text files (with text in japanese), and added them to a Subversion repository.
To my surprise, one of them got the auto-property svn:mime-type
set to application/octet-stream
, while the others did not get any specific encoding information.
The files are valid UTF-8, file
reports "UTF-8 Unicode text, with CRLF line terminators" for all of them.
What is going on here? How does Subversion decide if a file should be treated as binary or not?
Upvotes: 4
Views: 620
Reputation: 42910
I found the explanation in the Subversion sources, in svn_io_is_binary_data
:
/* Right now, this function is going to be really stupid. It's
going to examine the block of data, and make sure that 15%
of the bytes are such that their value is in the ranges 0x07-0x0D
or 0x20-0x7F, and that none of those bytes is 0x00. If those
criteria are not met, we're calling it binary.
NOTE: Originally, I intended to target 85% of the bytes being in
the specified ranges, but I flubbed the condition. At any rate,
folks aren't complaining, so I'm not sure that it's worth
adjusting this retroactively now. --cmpilato */
With Japanese text in UTF-8, most code points will use three bytes, each of which being >= 0x80
.
The reason not more of my files triggered this behavior was a small preamble with chars in the ASCII range.
Upvotes: 4