chestozo
chestozo

Reputation: 1301

git diff shows unicode symbols in angle brackets

I have a file with unicode symbols (russian text). When I fix some typo I use git diff --color-words=. to see the changes I've done.

In case of unicode (cyrillic) symbols I get some mess with angle brackets like so:

$ cat p1
привет

$ cat p2
Привет

$ git diff --color-words=. --no-index p1 p2
diff --git 1/p1 2/p2
index d0f56e1..d84c480 100644
--- 1/p1
+++ 2/p2
@@ -1 +1 @@
<D0><BF><9F>ривет

It looks like git diff --color-words=. is checking the difference between bytes and not between symbols as I expect.

Is there any way to tell git to work properly with unicode symbols?

UPD about my environment: I get the same on Mac OS and on Linux host.

My shell vars are:

BASH=/bin/bash
HOSTTYPE=x86_64
LANG=ru_RU.UTF-8
OSTYPE=darwin10.0
PS1='\h:\W \u\$ '
SHELL=/bin/bash
SHELLOPTS=braceexpand:emacs:hashall:histexpand:history:interactive-comments:monitor
TERM=xterm-256color
TERM_PROGRAM=iTerm.app
_=-l

I have reset git config to default settings like so:

$ git config -l
core.repositoryformatversion=0
core.filemode=true
core.bare=false
core.logallrefupdates=true
core.ignorecase=true

git version

$ git --version
git version 1.7.3.5

Upvotes: 25

Views: 7777

Answers (6)

klm123
klm123

Reputation: 12865

toolbear's answer didn't work for me, since even with git --no-pager diff I saw unreadable characters as well (not brackets, but unreadable), so less was not the core problem.

I tried a ton of things, but the only thing, which helped is to include into .git\config explicit conversion from Cyrillic to utf-8 (I'm using windows 7)

[pager]
diff = iconv.exe -f cp1251 -t utf-8 | less  

note, I change specifically pager.diff here, since I had encoding problems only with diff command. For some weird reason log and reflog was working fine with me. But if you have encoding problems with other commands too, you should change pager for all the commands, like this:

[core]
...
pager = iconv.exe -f cp1251 -t utf-8 | less 

Upvotes: 1

Danny Lin
Danny Lin

Reputation: 2300

For several platforms setting LANG to C.UTF-8 (or en_US.UTF-8, etc.) would work:

$ echo '人' >test1.txt && echo '丁' >test2.txt
$ LANG=C.UTF-8 git diff --no-index --word-diff=plain --word-diff-regex=. -- test1.txt test2.txt
diff --git a/test1.txt b/test2.txt
index 3ef0891..3773917 100644
--- a/test1.txt
+++ b/test2.txt
@@ -1 +1 @@
[-人-]{+丁+}

However, LANG doesn't seem to be honored on some platforms (such as Git for Windows):

$ echo '人' >test1.txt && echo '丁' >test2.txt
$ LANG=C.UTF-8 git diff --no-index --word-diff=plain --word-diff-regex=. -- test1.txt test2.txt
diff --git a/test1.txt b/test2.txt
index 3ef0891..3773917 100644
--- a/test1.txt
+++ b/test2.txt
@@ -1 +1 @@
<E4>[-<BA><BA>-]{+<B8><81>+}

A workaround on these platforms is to provide raw bytes for UTF-8 chars (e.g. $'[^\x80-\xBF][\x80-\xBF]*' for '.') to git diff:

$ echo '人' >test1.txt && echo '丁' >test2.txt
$ git diff --no-index --word-diff=plain --word-diff-regex=$'[^\x80-\xBF][\x80-\xBF]*' -- test1.txt test2.txt
diff --git a/test1.txt b/test2.txt
index 3ef0891..3773917 100644
--- a/test1.txt
+++ b/test2.txt
@@ -1 +1 @@
[-人-]{+丁+}

Upvotes: 4

Magomed Abdurakhmanov
Magomed Abdurakhmanov

Reputation: 1924

For me best solution to this is setting export LESSCHARSET=utf-8.

In this case both git log -p and git diff shows unicode without problems.

Upvotes: 21

chestozo
chestozo

Reputation: 1301

The solution for me was to use git difftool.

I wrote this tool https://github.com/chestozo/dmp based on https://code.google.com/p/google-diff-match-patch/.

Sometimes it also gives better diff comparing to git diff --color-words=. :)

Upvotes: 3

user23987
user23987

Reputation:

For me less — the git pager — was to blame (thanks @kostix). Experiment by disabling the pager altogether:

git --no-pager diff p1 p2

My case was commit messages containing emojis; it's fundamentally the same problem though.

$ git log --oneline
93a1866 <U+1F43C>

$ git --no-pager log --oneline
93a1866 🐼

$ export LESS='--raw-control-chars'
$ git log --oneline
93a1866 🐼

$ git config --global core.pager 'less --raw-control-chars'
$ git log --oneline
93a1866 🐼

NB: the --RAW-CONTROL-CHARS option causes less to pass through ANSI color escapes, but will still munge other control chars (emoji included). My less is globally configured with --RAW-CONTROL-CHARS and my git pager with --raw-control-chars as above.

Upvotes: 39

frlan
frlan

Reputation: 7260

I have seen a lot of reports xterm is not really able to print Unicode characters in some cases. Maybe at least a starting point for a solution.

Upvotes: 0

Related Questions