r3b00t
r3b00t

Reputation: 7533

View .docx file on Github and use git diff on .docx file format

I have two questions:

  1. Is there any way to view a .docx file on Github? We have uploaded all of our assignments onto Github, but there is no way we can view it within the browser. It would be nice if we could view those .docx files in the browser without downloading the file.

  2. How can I use git diff on the .docx file format? I tried to use catdoc but it didn't work for me. I think I have used git diff on Windows for the .doc format before, but it's not working for me on Mac.

Thanks a lot.

Upvotes: 21

Views: 32736

Answers (7)

VonC
VonC

Reputation: 1324737

Is there any way to view a .docx file on Github?

Not yet (Q4 2016) unless the Word document is pure text.

How can I use git diff on the .docx file format?

Since git for Windows 1.9.5, and the Git for Windows 2.5.3 (Sept. 2015, and issue 355), you don't have to do any custom settings:

git diff -- myWord.docx

That will work. (It does for .doc and .pdf too)

And since Git for Windows 2.10.1, you can diff .docm and .dotm too (see PR 128).


jifb adds in the comments:

The docx etc. support is based on file conversion executable odt2txt, antiword, docx2txt(.pl) and pdftotext which are invoked (configured in system-wide gitattributes and gitconfig).

rtf files are not converted (simple "cat" in Git for Windows 2.28.0) but unconverted rtf is well comparable if produced by "old" programs like Wordpad/Ted.

Upvotes: 7

Akash
Akash

Reputation: 5221

This is problematic and according to the best of my knowledge, not possible on github or any other git host for that matter. While git can be used to version anything, things like git diff will return differences in two versions in plain text form. Illegible.

I feel that this is not without a reason though. There are unlimited file formats in the world and many of them are proprietary. Thus, in place of supporting every single format like VLC, git uses text files for everything.

Also even if git did somehow support docx, it wouldn't be able to display formatting changes inside the terminal, let alone cmd. If it's just text, better store it as a text file. Or manually checkout a previous version to compare the changes.

Upvotes: 1

innov8
innov8

Reputation: 2219

A .docx file is actually a zip (you can change the file type and poke around inside). If the .docx was treated as a directory then inside the main file is stored as an XML style file and it's text, not binary.

Sad thing is that there are no carriage returns. Otherwise doing a text diff on the 'document.xml' file inside the directory would be really useful. As an XML file line breaks in the file would not affect the content so they could be added.

Upvotes: 1

zpangwin
zpangwin

Reputation: 1317

The accepted solution (using strings / unzip ) didn't work very well for me on Linux Mint 19.3. The following seems to work pretty well for most doc/docx/rtf/xls files as well as their LibreOffice counterparts. Some of these might work on Windows via cygwin/git bash but I have not tested; if the packages I mention are not available in cygwin/git bash, then I would look for python/perl scripts that do the same conversion and substitute with those instead.

  1. Install prerequisites: sudo apt install git pandoc catdoc odt2txt.
  2. Note that catdoc and odt2txt include multiple tools for handling doc/xls/ppt/odt/ods/odp formats not just the ones in the package name. Likewise, pandoc handles all of the newer zipped 'x' formats.
  3. I wanted my attributes to apply as Global (e.g. User-scoped) rather than per-project as done in the other answers. To create User-scoped git attributes file, use mkdir ~/.config/git/ && touch ~/.config/git/attributes (on Windows this should be mkdir "%USERPROFILE%\.config\git" && echo "" > "%USERPROFILE%\.config\git\attributes")
  4. Setup git attributes file (either the user-scoped file mentioned in the previous step or the project-scoped file ${projectDir}/.git/info/attributes as desired):
    # handle windows *.reg files (utf-16 which git doesn't normally like)
    *.reg diff=utf16

    # handle misc common document formats
    *.pdf diff=pdf
    *.rtf diff=catdoc

    # handle libre/open document formats
    *.ods diff=ods2txt
    *.odp diff=odp2txt
    *.odt diff=odt2txt

    # handle older common ms document formats
    # note: ppt did not work for me
    *.doc diff=catdoc
    *.ppt diff=catppt
    *.xls diff=xls2csv

    # handle newer zipped ms document formats
    # note: pptx and xlsx did not work for me
    *.docx diff=pandoc
    *.pptx diff=pandoc
    *.xlsx diff=pandoc
  1. Create .gitconfig definitions (either in the user-scoped ~/.gitconfig or in the project-scoped ${projectDir}/.git/config). Much of this is based on this article but altered based on my own testing.
[core]
        autocrlf = false
    [diff]
        guitool = kdiff3
    [diff "odp2txt"]
        textconv = odp2txt
        binary = true
    [diff "odt2txt"]
        textconv = odt2txt
        binary = true
    [diff "ods2txt"]
        textconv = ods2txt
        binary = true
    [diff "catdoc"]
        textconv = catdoc
        binary = true
    # note catppt did not work for me
    [diff "catppt"]
        textconv = catppt
        binary = true
    [diff "xls2csv"]
        textconv = xls2csv
        binary = true
    [diff "xlsx2csv"]
        textconv = xlsx2csv
        binary = true
    [diff "pandoc"]
        textconv=pandoc --to=markdown
        prompt = false
    [diff "pdf2txt"]
        textconv=pdf2txt
        binary = true
    [diff "utf16"]
        textconv = iconv -c -f UTF-16LE -t ASCII

I was never able to successfully get diffs working for xlsx, ppt, or pptx even after downloading the latest version of pandoc from their github page. The docx conversion worked fine even with the super old version that is in the Mint/Ubuntu/Debian repos (v1.19.2.4 from 2016). For the xlsx/pptx samples I was using, I always got either "Invalid UTF-8 stream fatal" (old version) or "UTF-8 decoding error" (new version).

This could have been due to the sample files I was using (some samples from the web and some samples I created by converting LibreOffice documents), my system setup, the versions I was using or something else.

For completeness, after installing the newer pandoc, I was using:

$ uname -vipor
5.3.0-40-generic #32~18.04.1-Ubuntu SMP Mon Feb 3 14:05:59 UTC 2020 x86_64 x86_64 GNU/Linux

$ dpkg -l catdoc odt2txt pandoc git xlsx2csv|grep '^ii'
ii  catdoc         1:0.95-4.1          amd64        text extractor for MS-Office files
ii  git            1:2.17.1-1ubuntu0.5 amd64        fast, scalable, distributed revision control system
ii  odt2txt        0.5-1build2         amd64        simple converter from OpenDocument Text to plain text
ii  pandoc         2.9.2-1             amd64        general markup converter
ii  xlsx2csv       0.20+20161027+git5785081-1 all          convert xslx files to csv format

EDIT: Also tried using the package xlsx2csv for xlsx conversion instead of pandoc and I had issues with that as well. Could be something to do with my samples but since I am not really doing anything special to create them I would consider that a coverage-gap / limitation of xlsx2csv/pandoc if so.

Upvotes: 10

icedwater
icedwater

Reputation: 4887

After half-heartedly circling around Stackoverflow and Google for years, I just found out today that the official git book has a walkthrough.

  1. Install docx2txt. On Ubuntu 16.04, I just used the official repositories:

    sudo apt-get install docx2txt
    
  2. Write a wrapper script (docx2txt requires some arguments.) as follows:

    #! /usr/bin/env bash
    docx2txt "$1" -
    
  3. I called the script d2t, so I added that to a folder somewhere in my $PATH. Remember to make it executable so that git can run it.

    chmod +x d2t
    mv d2t /somewhere/in/your/PATH
    
  4. Now make your repository aware of this by adding this block to .git/config:

    [diff "word"]
        textconv = d2t
    

    *Note: the book suggests a command instead, which I assume you can use with the --global flag as well to apply this filter to all repos should you so wish:

    git config --global diff.word.textconv d2t
    
  5. For the repository where you want this to work, edit .gitattributes:

    *.docx diff=word
    
  6. Now you should be able to git diff your docx documents.

    diff --git a/goodpoint.docx b/goodpoint.docx
    index 0d6e78c..4476023 100644
    --- a/goodpoint.docx
    +++ b/goodpoint.docx
    @@ -1,7 +1,7 @@
     Making many good points
    
      1. Overview
    -- 2l3k23lk
    +- this is a test
     - 23lkjl2k3j
    
      2. Remarks
    

Edit: tried this on git 2.7.4. you can't checkout and add in patches without doing more work.

Upvotes: 3

Crygnus
Crygnus

Reputation: 181

Answering your second question -

Usually when you try

git diff filename.docx

you will get output of the form -

Binary files a/filename.docx and b/filename.docx differ

Not very helpful. A perfect way around that is to use Pandoc.

  • Install Pandoc from above link on your system.
  • Create or edit file ~/.gitconfig (linux, Mac) or "c:\Documents and Settings\user.gitconfig" (Windows) to add (or use git config --global --edit)

    [diff "pandoc"]
         textconv=pandoc --to=markdown
         prompt = false
    [alias]
         wdiff = diff --word-diff=color --unified=1`
    
  • In your git controlled directory with .docx files, create or edit file .gitattributes (linux, Windows and Mac) to add

    *.docx diff=pandoc
    
  • You can commit .gitattributes so that it stays for use in other computers, but you'll need to edit ~/.gitconfig in every new computer you want to use.

  • Now you can see a pretty coloured diff with the changes you have made to your .docx file since the last commit

     git wdiff file.docx
    

More details can be found here.

Upvotes: 18

Axe
Axe

Reputation: 692

  1. Answer for second part of question. Already an old post but popping up in top 10 without an answer. With the following settings you get a poor man's diff on docx files.

In .gitattributes use:

*.docx diff=zip

In .git/config use:

[diff "zip"]
      textconv = unzip -c -a

As a bonus my settings for old word/excel and new word/excel:

In .gitattributes use:

*.doc diff=word
*.xsl diff=excel
*.xlsx diff=zip
*.docx diff=zip

In .git/config use:

[diff "word"] 
    textconv = strings
[diff "excel"]
    textconv = strings
[diff "zip"]
    textconv = unzip -c -a

Upvotes: 20

Related Questions