Reputation: 14729
How to have a word for word diff on a human language text (in Chinese)?
I have some plain text in Chinese in a git repository. The text has been edited and I'd like to see which words have been added/removed. One line in the file represents a whole paragraph of text, so a simple git diff is not enough: we know that something has changed in a certain number of paragraph but we cannot see which sentences/words have changed in it.
To make matter worse, as I said, the text is in Chinese. Unlike English and other Indo-European languages, Chinese does not use spaces as a word delimiter. The whole paragraph, together with Chinese punctuation marks, makes a unified block without any space included. Thus, git diff --word-diff does not help at all.
Is there a way to have a human-readable diff between two versions of such a text in Chinese? Is there an equivalent of --word-diff for each character?
Upvotes: 1
Views: 944
Reputation: 937
icdiff can meet you need. When comparing Chinese text, this tool can show differences word by word.
Upvotes: 1
Reputation: 14729
I post this as an answer to my own question, however, it contains only part of the solution, a pointer in the right direction. Something is still missing.
From How can I visualize per-character differences in a unified diff file? Try either command:
git diff --word-diff-regex=.
git diff --color-words=.
Either of the two command above get me very close. However, I have 2 problems. If I simply type the command above and look at the output in the console, I am only shown the beginning of each paragraph. The whole line does not fit in the console and git truncates the end (i.e. most of it!).
Or if I try to redirect to a file:
git diff --color-words=. > diff.patch
and then use vim to view the file, I get some scrambled mess which looks more like binary code than anything human-readable.
Update:
I finally used this solution:
wget https://git.kernel.org/cgit/git/git.git/plain/contrib/diff-highlight/diff-highlight --no-check-certificate
chmod u+x diff-highlight
git diff --color=always | ./diff-highlight | less -R
Upvotes: 1
Reputation: 7727
The word-by-word diff
should work as your own answer. From the doc, The relationship between --word-diff-regex
and --color-words
is as follows.
--color-words[=<regex>
Equivalent to --word-diff=color plus (if a regex was specified) --word-diff-regex=<regex>.
Actually you can set the word-diff
mode to porcelain
to have a better view of the diff
output in your console.
git diff --word-diff-regex=. --word-diff=porcelain
And to redirect the output to a file, you should not use --color-words
(with default --word-diff
as color
), because the generated text file cannot recognize the color information encoded some way by git diff
as the scrambled mess you got. You can just use --word-diff-regex=.
, and the default --word-diff
mode is plain
.
git diff --word-diff-regex=. > diff.patch
Upvotes: 0