augustin
augustin

Reputation: 14729

git word diff on non-english text

How to have a word for word diff on a human language text (in Chinese)?

I have some plain text in Chinese in a git repository. The text has been edited and I'd like to see which words have been added/removed. One line in the file represents a whole paragraph of text, so a simple git diff is not enough: we know that something has changed in a certain number of paragraph but we cannot see which sentences/words have changed in it.

To make matter worse, as I said, the text is in Chinese. Unlike English and other Indo-European languages, Chinese does not use spaces as a word delimiter. The whole paragraph, together with Chinese punctuation marks, makes a unified block without any space included. Thus, git diff --word-diff does not help at all.

Is there a way to have a human-readable diff between two versions of such a text in Chinese? Is there an equivalent of --word-diff for each character?

Upvotes: 1

Views: 944

Answers (3)

haolee
haolee

Reputation: 937

icdiff can meet you need. When comparing Chinese text, this tool can show differences word by word.

Upvotes: 1

augustin
augustin

Reputation: 14729

I post this as an answer to my own question, however, it contains only part of the solution, a pointer in the right direction. Something is still missing.

From How can I visualize per-character differences in a unified diff file? Try either command:

git diff --word-diff-regex=. 
git diff --color-words=.  

Either of the two command above get me very close. However, I have 2 problems. If I simply type the command above and look at the output in the console, I am only shown the beginning of each paragraph. The whole line does not fit in the console and git truncates the end (i.e. most of it!).

Or if I try to redirect to a file:

git diff --color-words=. > diff.patch

and then use vim to view the file, I get some scrambled mess which looks more like binary code than anything human-readable.

Update:
I finally used this solution:

wget https://git.kernel.org/cgit/git/git.git/plain/contrib/diff-highlight/diff-highlight --no-check-certificate 
chmod u+x diff-highlight
git diff --color=always | ./diff-highlight | less -R  

Upvotes: 1

Landys
Landys

Reputation: 7727

The word-by-word diff should work as your own answer. From the doc, The relationship between --word-diff-regex and --color-words is as follows.

--color-words[=<regex>
  Equivalent to --word-diff=color plus (if a regex was specified) --word-diff-regex=<regex>.

Actually you can set the word-diff mode to porcelain to have a better view of the diff output in your console.

git diff --word-diff-regex=. --word-diff=porcelain

And to redirect the output to a file, you should not use --color-words (with default --word-diff as color), because the generated text file cannot recognize the color information encoded some way by git diff as the scrambled mess you got. You can just use --word-diff-regex=., and the default --word-diff mode is plain.

git diff --word-diff-regex=. > diff.patch

Upvotes: 0

Related Questions