Anders Lindahl
Anders Lindahl

Reputation: 42910

Why do Subversion give some of my UTF-8 text files content type "application/octet-stream"?

I got a handful of UTF-8-encoded text files (with text in japanese), and added them to a Subversion repository.

To my surprise, one of them got the auto-property svn:mime-type set to application/octet-stream, while the others did not get any specific encoding information.

The files are valid UTF-8, file reports "UTF-8 Unicode text, with CRLF line terminators" for all of them.

What is going on here? How does Subversion decide if a file should be treated as binary or not?

Upvotes: 4

Views: 620

Answers (1)

Anders Lindahl
Anders Lindahl

Reputation: 42910

I found the explanation in the Subversion sources, in svn_io_is_binary_data:

/* Right now, this function is going to be really stupid.  It's
  going to examine the block of data, and make sure that 15%
  of the bytes are such that their value is in the ranges 0x07-0x0D
  or 0x20-0x7F, and that none of those bytes is 0x00.  If those
  criteria are not met, we're calling it binary.

  NOTE:  Originally, I intended to target 85% of the bytes being in
  the specified ranges, but I flubbed the condition.  At any rate,
  folks aren't complaining, so I'm not sure that it's worth
  adjusting this retroactively now.  --cmpilato  */

With Japanese text in UTF-8, most code points will use three bytes, each of which being >= 0x80.

The reason not more of my files triggered this behavior was a small preamble with chars in the ASCII range.

Upvotes: 4

Related Questions