Reputation: 1301
I have a file with unicode symbols (russian text).
When I fix some typo I use git diff --color-words=.
to see the changes I've done.
In case of unicode (cyrillic) symbols I get some mess with angle brackets like so:
$ cat p1
привет
$ cat p2
Привет
$ git diff --color-words=. --no-index p1 p2
diff --git 1/p1 2/p2
index d0f56e1..d84c480 100644
--- 1/p1
+++ 2/p2
@@ -1 +1 @@
<D0><BF><9F>ривет
It looks like git diff --color-words=.
is checking the difference between bytes and not between symbols as I expect.
Is there any way to tell git
to work properly with unicode symbols?
UPD about my environment: I get the same on Mac OS and on Linux host.
My shell vars are:
BASH=/bin/bash
HOSTTYPE=x86_64
LANG=ru_RU.UTF-8
OSTYPE=darwin10.0
PS1='\h:\W \u\$ '
SHELL=/bin/bash
SHELLOPTS=braceexpand:emacs:hashall:histexpand:history:interactive-comments:monitor
TERM=xterm-256color
TERM_PROGRAM=iTerm.app
_=-l
I have reset git config to default settings like so:
$ git config -l
core.repositoryformatversion=0
core.filemode=true
core.bare=false
core.logallrefupdates=true
core.ignorecase=true
git version
$ git --version
git version 1.7.3.5
Upvotes: 25
Views: 7777
Reputation: 12865
toolbear's answer didn't work for me, since even with git --no-pager diff
I saw unreadable characters as well (not brackets, but unreadable), so less
was not the core problem.
I tried a ton of things, but the only thing, which helped is to include into .git\config explicit conversion from Cyrillic to utf-8 (I'm using windows 7)
[pager]
diff = iconv.exe -f cp1251 -t utf-8 | less
note, I change specifically pager.diff
here, since I had encoding problems only with diff
command. For some weird reason log
and reflog
was working fine with me. But if you have encoding problems with other commands too, you should change pager for all the commands, like this:
[core]
...
pager = iconv.exe -f cp1251 -t utf-8 | less
Upvotes: 1
Reputation: 2300
For several platforms setting LANG
to C.UTF-8
(or en_US.UTF-8
, etc.) would work:
$ echo '人' >test1.txt && echo '丁' >test2.txt
$ LANG=C.UTF-8 git diff --no-index --word-diff=plain --word-diff-regex=. -- test1.txt test2.txt
diff --git a/test1.txt b/test2.txt
index 3ef0891..3773917 100644
--- a/test1.txt
+++ b/test2.txt
@@ -1 +1 @@
[-人-]{+丁+}
However, LANG
doesn't seem to be honored on some platforms (such as Git for Windows):
$ echo '人' >test1.txt && echo '丁' >test2.txt
$ LANG=C.UTF-8 git diff --no-index --word-diff=plain --word-diff-regex=. -- test1.txt test2.txt
diff --git a/test1.txt b/test2.txt
index 3ef0891..3773917 100644
--- a/test1.txt
+++ b/test2.txt
@@ -1 +1 @@
<E4>[-<BA><BA>-]{+<B8><81>+}
A workaround on these platforms is to provide raw bytes for UTF-8 chars (e.g. $'[^\x80-\xBF][\x80-\xBF]*'
for '.'
) to git diff:
$ echo '人' >test1.txt && echo '丁' >test2.txt
$ git diff --no-index --word-diff=plain --word-diff-regex=$'[^\x80-\xBF][\x80-\xBF]*' -- test1.txt test2.txt
diff --git a/test1.txt b/test2.txt
index 3ef0891..3773917 100644
--- a/test1.txt
+++ b/test2.txt
@@ -1 +1 @@
[-人-]{+丁+}
Upvotes: 4
Reputation: 1924
For me best solution to this is setting export LESSCHARSET=utf-8
.
In this case both git log -p
and git diff
shows unicode without problems.
Upvotes: 21
Reputation: 1301
The solution for me was to use git difftool.
I wrote this tool https://github.com/chestozo/dmp based on https://code.google.com/p/google-diff-match-patch/.
Sometimes it also gives better diff comparing to git diff --color-words=.
:)
Upvotes: 3
Reputation:
For me less
— the git pager — was to blame (thanks @kostix). Experiment by disabling the pager altogether:
git --no-pager diff p1 p2
My case was commit messages containing emojis; it's fundamentally the same problem though.
$ git log --oneline
93a1866 <U+1F43C>
$ git --no-pager log --oneline
93a1866 🐼
$ export LESS='--raw-control-chars'
$ git log --oneline
93a1866 🐼
$ git config --global core.pager 'less --raw-control-chars'
$ git log --oneline
93a1866 🐼
NB: the --RAW-CONTROL-CHARS
option causes less
to pass through ANSI color escapes, but will still munge other control chars (emoji included). My less
is globally configured with --RAW-CONTROL-CHARS
and my git pager with --raw-control-chars
as above.
Upvotes: 39
Reputation: 7260
I have seen a lot of reports xterm is not really able to print Unicode characters in some cases. Maybe at least a starting point for a solution.
Upvotes: 0