Reputation: 8972
I'm trying to obtain changes between commits for a large number of HTML documents, but I quickly noticed that most changes are not important and are usually the result of logging, changes in versions to prevent caching or external scripts. For example:
<a class="support-ga" target="_blank" href="#">0fb63cacd50e / 0fb63cacd50e @
-app-151</a>
+app-107</a>
<input type='hidden' name='csrfmiddlewaretoken'
-value='82NB5DdySoICu1mqcl0RZVk5dMCOVEQd'
+value='a0zBgxBevaBugotGpNKI6kMPsIsBbH44'
/>
The previous example shows that looking at those changes is probably not very interesting or useful.
I would like to know if there is a git diff command to ignore that kind of changes. Another alternative is to have a ranking of the differences based on similarity. So far I have been using the git diff --word-diff=porcelain --unified=0 HEAD~1 HEAD
command and then processing that output to extract changes, calculate the Levenshtein distance and remove duplicates. That helps but it is not a great solution considering that git already knows which lines are supposed to be compared and provides a configurable number of lines as context.
Upvotes: 1
Views: 917
Reputation: 1323115
You could try and write a diff driver for ignoring specific patterns.
See this discussion as an example.
echo '*.html filter=ignore_value' >> .gitattributes
git config filter.ignore_value.clean "sed -e '/^value= .*$/d'"
That is just a first draft, as the value
attribute might not be at the start of the lines: you need to adjust the regex in order to detect and ignore any line with the change you wish to skip.
The OP Robert Smith points to (in the comments) a more complete command with:
git diff --unified=0 HEAD~1 HEAD | grep -v -E -f PATTERNS.txt
Upvotes: 1