Reputation: 7533
I have two questions:
Is there any way to view a .docx
file on Github
? We have uploaded all of our assignments onto Github, but there is no way we can view it within the browser. It would be nice if we could view those .docx
files in the browser without downloading the file.
How can I use git diff
on the .docx
file format? I tried to use catdoc
but it didn't work for me. I think I have used git diff
on Windows for the .doc
format before, but it's not working for me on Mac.
Thanks a lot.
Upvotes: 21
Views: 32736
Reputation: 1324737
Is there any way to view a .docx file on Github?
Not yet (Q4 2016) unless the Word document is pure text.
How can I use git diff on the .docx file format?
Since git for Windows 1.9.5, and the Git for Windows 2.5.3 (Sept. 2015, and issue 355), you don't have to do any custom settings:
git diff -- myWord.docx
That will work. (It does for .doc
and .pdf
too)
And since Git for Windows 2.10.1, you can diff .docm
and .dotm
too (see PR 128).
jifb adds in the comments:
The
docx
etc. support is based on file conversion executableodt2txt
,antiword
,docx2txt
(.pl
) andpdftotext
which are invoked (configured in system-widegitattributes
andgitconfig
).
rtf
files are not converted (simple "cat
" in Git for Windows 2.28.0) but unconvertedrtf
is well comparable if produced by "old" programs like Wordpad/Ted.
Upvotes: 7
Reputation: 5221
This is problematic and according to the best of my knowledge, not possible on github or any other git host for that matter. While git can be used to version anything, things like git diff will return differences in two versions in plain text form. Illegible.
I feel that this is not without a reason though. There are unlimited file formats in the world and many of them are proprietary. Thus, in place of supporting every single format like VLC, git uses text files for everything.
Also even if git did somehow support docx, it wouldn't be able to display formatting changes inside the terminal, let alone cmd. If it's just text, better store it as a text file. Or manually checkout a previous version to compare the changes.
Upvotes: 1
Reputation: 2219
A .docx
file is actually a zip (you can change the file type and poke around inside). If the .docx
was treated as a directory then inside the main file is stored as an XML style file and it's text, not binary.
Sad thing is that there are no carriage returns. Otherwise doing a text diff on the 'document.xml' file inside the directory would be really useful. As an XML file line breaks in the file would not affect the content so they could be added.
Upvotes: 1
Reputation: 1317
The accepted solution (using strings / unzip ) didn't work very well for me on Linux Mint 19.3. The following seems to work pretty well for most doc/docx/rtf/xls files as well as their LibreOffice counterparts. Some of these might work on Windows via cygwin/git bash but I have not tested; if the packages I mention are not available in cygwin/git bash, then I would look for python/perl scripts that do the same conversion and substitute with those instead.
sudo apt install git pandoc catdoc odt2txt
. mkdir ~/.config/git/ && touch ~/.config/git/attributes
(on Windows this should be mkdir "%USERPROFILE%\.config\git" && echo "" > "%USERPROFILE%\.config\git\attributes"
)${projectDir}/.git/info/attributes
as desired): # handle windows *.reg files (utf-16 which git doesn't normally like)
*.reg diff=utf16
# handle misc common document formats
*.pdf diff=pdf
*.rtf diff=catdoc
# handle libre/open document formats
*.ods diff=ods2txt
*.odp diff=odp2txt
*.odt diff=odt2txt
# handle older common ms document formats
# note: ppt did not work for me
*.doc diff=catdoc
*.ppt diff=catppt
*.xls diff=xls2csv
# handle newer zipped ms document formats
# note: pptx and xlsx did not work for me
*.docx diff=pandoc
*.pptx diff=pandoc
*.xlsx diff=pandoc
~/.gitconfig
or in the project-scoped ${projectDir}/.git/config
). Much of this is based on this article but altered based on my own testing.[core]
autocrlf = false
[diff]
guitool = kdiff3
[diff "odp2txt"]
textconv = odp2txt
binary = true
[diff "odt2txt"]
textconv = odt2txt
binary = true
[diff "ods2txt"]
textconv = ods2txt
binary = true
[diff "catdoc"]
textconv = catdoc
binary = true
# note catppt did not work for me
[diff "catppt"]
textconv = catppt
binary = true
[diff "xls2csv"]
textconv = xls2csv
binary = true
[diff "xlsx2csv"]
textconv = xlsx2csv
binary = true
[diff "pandoc"]
textconv=pandoc --to=markdown
prompt = false
[diff "pdf2txt"]
textconv=pdf2txt
binary = true
[diff "utf16"]
textconv = iconv -c -f UTF-16LE -t ASCII
I was never able to successfully get diffs working for xlsx, ppt, or pptx even after downloading the latest version of pandoc from their github page. The docx conversion worked fine even with the super old version that is in the Mint/Ubuntu/Debian repos (v1.19.2.4 from 2016). For the xlsx/pptx samples I was using, I always got either "Invalid UTF-8 stream fatal" (old version) or "UTF-8 decoding error" (new version).
This could have been due to the sample files I was using (some samples from the web and some samples I created by converting LibreOffice documents), my system setup, the versions I was using or something else.
For completeness, after installing the newer pandoc, I was using:
$ uname -vipor
5.3.0-40-generic #32~18.04.1-Ubuntu SMP Mon Feb 3 14:05:59 UTC 2020 x86_64 x86_64 GNU/Linux
$ dpkg -l catdoc odt2txt pandoc git xlsx2csv|grep '^ii'
ii catdoc 1:0.95-4.1 amd64 text extractor for MS-Office files
ii git 1:2.17.1-1ubuntu0.5 amd64 fast, scalable, distributed revision control system
ii odt2txt 0.5-1build2 amd64 simple converter from OpenDocument Text to plain text
ii pandoc 2.9.2-1 amd64 general markup converter
ii xlsx2csv 0.20+20161027+git5785081-1 all convert xslx files to csv format
EDIT: Also tried using the package xlsx2csv
for xlsx conversion instead of pandoc and I had issues with that as well. Could be something to do with my samples but since I am not really doing anything special to create them I would consider that a coverage-gap / limitation of xlsx2csv/pandoc if so.
Upvotes: 10
Reputation: 4887
After half-heartedly circling around Stackoverflow and Google for years, I just found out today that the official git book has a walkthrough.
Install docx2txt
. On Ubuntu 16.04, I just used the official repositories:
sudo apt-get install docx2txt
Write a wrapper script (docx2txt
requires some arguments.) as follows:
#! /usr/bin/env bash
docx2txt "$1" -
I called the script d2t
, so I added that to a folder somewhere in my $PATH
. Remember to make it executable so that git can run it.
chmod +x d2t
mv d2t /somewhere/in/your/PATH
Now make your repository aware of this by adding this block to .git/config
:
[diff "word"]
textconv = d2t
*Note: the book suggests a command instead, which I assume you can use with the --global
flag as well to apply this filter to all repos should you so wish:
git config --global diff.word.textconv d2t
For the repository where you want this to work, edit .gitattributes
:
*.docx diff=word
Now you should be able to git diff
your docx documents.
diff --git a/goodpoint.docx b/goodpoint.docx
index 0d6e78c..4476023 100644
--- a/goodpoint.docx
+++ b/goodpoint.docx
@@ -1,7 +1,7 @@
Making many good points
1. Overview
-- 2l3k23lk
+- this is a test
- 23lkjl2k3j
2. Remarks
Edit: tried this on git 2.7.4. you can't checkout
and add
in patches without doing more work.
Upvotes: 3
Reputation: 181
Answering your second question -
Usually when you try
git diff filename.docx
you will get output of the form -
Binary files a/filename.docx and b/filename.docx differ
Not very helpful. A perfect way around that is to use Pandoc.
Create or edit file ~/.gitconfig (linux, Mac) or "c:\Documents and Settings\user.gitconfig" (Windows) to add (or use git config --global --edit
)
[diff "pandoc"]
textconv=pandoc --to=markdown
prompt = false
[alias]
wdiff = diff --word-diff=color --unified=1`
In your git controlled directory with .docx files, create or edit file .gitattributes (linux, Windows and Mac) to add
*.docx diff=pandoc
You can commit .gitattributes so that it stays for use in other computers, but you'll need to edit ~/.gitconfig in every new computer you want to use.
Now you can see a pretty coloured diff with the changes you have made to your .docx file since the last commit
git wdiff file.docx
More details can be found here.
Upvotes: 18
Reputation: 692
In .gitattributes use:
*.docx diff=zip
In .git/config use:
[diff "zip"]
textconv = unzip -c -a
As a bonus my settings for old word/excel and new word/excel:
In .gitattributes use:
*.doc diff=word
*.xsl diff=excel
*.xlsx diff=zip
*.docx diff=zip
In .git/config use:
[diff "word"]
textconv = strings
[diff "excel"]
textconv = strings
[diff "zip"]
textconv = unzip -c -a
Upvotes: 20