Harm
Harm

Reputation: 777

Diacritic characters in filenames causing differences between subversion and git (MacOS)

I have filenames with diacritic characters (i.e. Exposé.pdf).

$ svn stat
!    Exposé.pdf
?    Exposé.pdf

I am using subversion and git next to each other (not git-svn). I am migrating from subversion to git and want to have a co-existence for a while. So I have large repositories on multiple devices. When I clone a repos with git and I add the already existing subversion .svn folder to the repo I get subversion differences (! Item is missing, ? Item is not under vcs) but the filenames seems to be exactly the same, but under the hood they are not! I have tried (See https://www.git-tower.com/help/mac/faq-and-tips/faq/unicode-filenames)

git config --global core.precomposeunicode true 

but that does not make any difference. Any clues?

Upvotes: 2

Views: 762

Answers (1)

torek
torek

Reputation: 489608

The "multiple devices" is likely the problem. Exactly what the fix or workaround may be is not clear. See technical details below.

In general, you should not set core.precomposeunicode yourself, in the same way that you should not set core.ignorecase yourself.1 These settings—along with core.symlnks—are something that Git sets by itself to record how your computer behaves, at the time you run git init or git clone.2 If you have set this with --global, I would recommend that you remove the setting from your personal Git configuration:

git config --global --unset core.precomposeunicode

The reason to unset this globally is that setting a value with --global disables the auto-sense feature in new repositories.

When autosensing is enabled, you can always clone an existing repository to a new copy. The new clone will have the correct (local) setting for the immediate local conditions. This new clone should not be transported from one machine to another by any means other than git clone.


1These can be spelled with any random capitalization you like. The Git documentation does so using camelCase, calling them core.precomposeUnicode and core.ignoreCase. You can set them for specific testing purposes or for weird edge cases where you want to deal with a repository that was built in some sort of undesirable way. But this amounts to lying to Git, so be careful with it! Do it locally (not globally) while experimenting.

2There's another special case here. The OSes that have these ... "features" of doing harm to your file names, in the name of shielding you from ugly reality, often actually do this on a per-file-system basis. The case folding feature of MacOS, for instance, is changeable at the time you build a disk image. Symlink support on Windows depends on the version of Windows and several additional items. So it's possible to pick up a Git repository intact, move it to a different file system, and then need to change the settings. This is one reason it's often wiser to git clone from one file system to another, rather than using tar or rar or zip or even cp -r to move a Git repository: the clone will set the settings correctly, while the non-clone copy operation won't.


File names are byte strings, except when they're not

The fundamental problem here is that Git wants to believe that file names are nothing but byte strings with two or three constraints,3 established by Linux, and no other constraints established by any other OS. These byte strings generally should be, but are not required to be, valid UTF-8 sequences as well. Ideally, the OS will let Git use these byte-strings as-is, unmolested.

On Windows and MacOS, this ideal immediately crashes hard into reality. The most obvious and immediate problem is that on Linux, you can create a file named README and then create a second, different file called readme, and both files will coexist. On Windows and MacOS, the moment you create either of these files, you can no longer create the second file: any attempt to do so just re-uses the first one.

In other words, Linux has case-sensitive file names, while Windows and MacOS don't. This means a Linux user is free to create README.txt and readme.txt files and put both into a single repository. The Windows or MacOS user who clones this repository is unable to work with both files at the same time.

Nonetheless, a Git user on Windows or MacOS can work with these files. It's just painful to do so. I show a method in my answer to “Changes not staged for commit" even after git commit -am b/c origin has a file with de-capitalize filename. This same method will apply here, with equal amounts of pain.

This same rule applies to certain Unicode file names. In particular, Unicode has multiple ways to spell some accented characters such as á, ü, and so on. For instance, if we have a file named schön (pretty), we can spell that using the letter sequence:

s c h umlaut-o n

(each of which is a single Unicode code point), or we can spell it using:

s c h o combining-umlaut n

These are different byte-code sequences and therefore should—according to Git at least—be different files, even though both will display as the name schön on your screen.

MacOS says these two names will display the same and therefore I will not allow one of them. If you supply the "wrong" spelling to the OS, it will either correct it or simply reject it. Note that this is somewhat different from the case-folding situation: MacOS will permit you to create either readme or README, but not both. It will only allow one form of schön.

Because Git builds new commits from the index, not from the file system, and the index is an ordinary data file, you can put either desired spelling, or even both, into the index. This means you can put either or both into new commits. Any existing commits have the existing spelling(s) and cannot be changed.

Loading existing commits (via git checkout) copies the committed spelling into the index, where it remains as-is. The core.precomposeunicode setting tells Git whether and how your OS will modify the file's (or files') name(s) when Git tries to copy the file from the index to the work-tree. Git can then try to undo any damage, if appropriate. But not all cases can be handled, especially those where the file appears in a commit with both spellings, much like case-folding in README vs readme.

(See also Git's internal self-test for MacOS precompose-unicode, in t/t3910-mac-os-precompose.sh.)


3The constraints are:

  • no string begins or ends with a slash (the latter is sort of trivially handled by the fact that Git won't store a directory, and the former by just not using the leading slash if there is one);
  • no string has two slashes in a row; and
  • no string has an embedded NUL byte (this rule comes from the C language in which Git is written and is supported by these OSes, so it's not really a problem).

The slash rules are because Linux treats slash as a directory/sub-directory or directory/file-name separator. MacOS of course does exactly the same, and Windows supports this with most of its interfaces, despite using backslash internally. So all three systems are happy with the slash limitation. However, some Windows file systems use UTF-16-LE internally as well, which creates an additional minefield around what are called Surrogate Escapes. I do not know how Windows deals with these. Ideally the minefield does not leak from internal to external interfaces, but then again, ideally, Windows would use forward slash and UTF-8. :-)

Upvotes: 3

Related Questions