Andereoo
Andereoo

Reputation: 958

Highlight differences between two html strings

I have 2 HTML strings with multiple subtle differences:

<tbody class="Expanded4" id="divisionG_area24_clubs"><!--<tr><th class='noBorderLeftRight'></th>--><th class="noBorderLeftRight" colspan="6"></th><th colspan="6"><table style="margin-bottom:auto;" width="100%"><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_blue Grid_Table" colspan="2">Membership</th><th class="Grid_top_blue Grid_Table" colspan="1">Goal4s</th><th class="Grid_Title_top_black grid_blue_border" colspan="6">Education</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Mem.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Trn.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Rn.|Lst.</th></tr><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_black_max Grid_Table">Base</th><th class="Grid_top_black_max Grid_Table">To Date</th><th class="Grid_top_black_max Grid_Table blue_border_right">Met</th><th class="Grid_top_black" title="Four Level 1 awards">1</th><th class="Grid_top_black" title="Two Level 2 awards">2</th><th class="Grid_top_black" title="Two more Level 2 awards">3</th><th class="Grid_top_black" title="Two Level 3 awards">1</th><th class="Grid_top_black" title="One Level 4, Level 5, or DTM award">5</th><th class="Grid_top_black" title="One more Level 4, Level 5, or DTM award">6</th><th class="Grid_top_black max22" title="4 New members">7</th><th class="Grid_top_black max22" title="4 More new members">9</th><th class="Grid_top_black max22" title="4 Officers trained first training period">9a</th><th class="Grid_top_black max22" title="4 Officers trained second training period">9b</th><th class="Grid_top_black max22" title="1 Dues-renewal on time">10a</th><th class="Grid_top_black max22" title="1 officer list on time">10b</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=01448795'"><td class="Grid_Title_top5 min280 crop" title="Advanced Speakers on the Hill"> <span class="redFont">01448795</span> Advanced Speakers on the Hill</td><th class="Grid_Table_yellow"><span>29<span></span></span></th><td class="Grid_Table title_gray"><span>30<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">2</span></td><th class="Grid_Title_goal" title="3 Level 1s needed">1</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">7</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02194262'"><td class="Grid_Title_top5 min280 crop" title="Inclusive Toastmasters"> <span class="redFont">02194262</span> Inclusivey Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">7</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">1</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">5</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02785335'"><td class="Grid_Title_top5 min280 crop" title="Club Toastmasters FrancoFun"> <span class="redFont">02785335</span> Club Toastmsasters FrancoFun</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">1</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=04437661'"><td class="Grid_Title_top5 min280 crop" title="Feel Good Toastmasters"> <span class="redFont">04437661</span> Feel Good Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>22<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr></table></th></tbody>

and

<tbody class="Expanded4" id="divisionG_area24_clubs"><!--<tr><th class='noBorderLeftRight'></th>--><th class="noBorderLeftRight" colspan="6"></th><th colspan="6"><table style="margin-bottom:auto;" width="100%"><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_blue Grid_Table" colspan="2">Membership</th><th class="Grid_top_blue Grid_Table" colspan="1">Goals</th><th class="Grid_Title_top_black grid_blue_border" colspan="6">Education</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Mem.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Trn.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Rn.|Lst.</th></tr><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_black_max Grid_Table">Base</th><th class="Grid_top_black_max Grid_Table">To Date</th><th class="Grid_top_black_max Grid_Table blue_border_right">Met</th><th class="Grid_top_black" title="Four Level 1 awards">1</th><th class="Grid_top_black" title="Two Level 2 awards">2</th><th class="Grid_top_black" title="Two more Level 2 awards">3</th><th class="Grid_top_black" title="Two Level 3 awards">4</th><th class="Grid_top_black" title="One Level 4, Level 5, or DTM award">5</th><th class="Grid_top_black" title="One more Level 4, Level 5, or DTM award">6</th><th class="Grid_top_black max22" title="4 New members">7</th><th class="Grid_top_black max22" title="4 More new members">8</th><th class="Grid_top_black max22" title="4 Officers trained first training period">9a</th><th class="Grid_top_black max22" title="4 Officers trained second training period">9b</th><th class="Grid_top_black max22" title="1 Dues-renewal on time">10a</th><th class="Grid_top_black max22" title="1 officer list on time">10b</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=01448795'"><td class="Grid_Title_top5 min280 crop" title="Advanced Speakers on the Hill"> <span class="redFont">01448795</span> Advanced Speakers on the Hill</td><th class="Grid_Table_yellow"><span>29<span></span></span></th><td class="Grid_Table title_gray"><span>30<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">2</span></td><th class="Grid_Title_goal" title="3 Level 1s needed">1</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">7</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02194262'"><td class="Grid_Title_top5 min280 crop" title="Inclusive Toastmasters"> <span class="redFont">02194262</span> Inclusive Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">5</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02785335'"><td class="Grid_Title_top5 min280 crop" title="Club Toastmasters FrancoFun"> <span class="redFont">02785335</span> Club Toastmasters FrancoFun</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">1</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=04437661'"><td class="Grid_Title_top5 min280 crop" title="Feel Good Toastmasters"> <span class="redFont">04437661</span> Feel Good Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>22<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr></table></th></tbody>

I'm trying to search for differences between the two strings. I need to return the second string, where any differences are highlighted using <mark> tags.

This is a bit hard to explain, so here are some examples:

If one string has the text <span>This is a string</span> and the second has <span>Thiss is a string</span>, I want to return <span><mark>Thiss is a string</mark></span>. If another string has the text <p>36</p> and the second has <p>3</p>, I want to return <p><mark>3</mark></p>.

Note that the <mark> tag is inserted after the nearest > to the left of the difference, while the </mark> is inserted before the nearest < to the right of the difference.

I'm sure this is possible, but I can't seem to find a way to achieve this that works. This is what I have so far:

skew=0
prev_i = []
highlighted_area_info = my_second_html_string
diff = difflib.ndiff(my_first_html_string, my_second_html_string)
for i,s in enumerate(diff, start=0):
    if s[0]==' ':
        continue
    else:
        if i in prev_i:
             continue
        count_right = my_second_html_string[i].find('<')
        
        count_left = 0
        for a, b in reversed(list(enumerate(my_second_html_string))):
            if a < i:
                if b == ">":
                    break
                else:
                    count_left += 1
                
        highlighted_area_info2 = highlighted_area_info[:i-count_left+skew]
        highlighted_area_info2 += highlight_beginning
        highlighted_area_info2 += highlighted_area_info[i-count_left+skew:i+count_right+skew]
        highlighted_area_info2 += highlight_end
        highlighted_area_info2 += highlighted_area_info[i+count_right+skew:]
        skew += len(highlight_beginning)+len(highlight_end)
        highlighted_area_info = highlighted_area_info2
        prev_i = list(range(i-count_left+skew, i+count_right+skew))
print(highlighted_area_info)

Unfortionately, the <mark> and </mark> tags are inserted in incorrect positions, leading to issues like this: <td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder"><mark>0</</ma<mark>rk>s</mark>pan></td> instead of <td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder"><mark>0</mark></span></td>, which is what I'm expecting.

I've spent days on this, but I'm still not sure what I am doing wrong, although something is obviously not right. My code is likely not utilizing the most efficient way to achieve my goal, either.

I need to have working code in a few days, so any help is hugely appreciated.

Upvotes: 2

Views: 273

Answers (1)

furas
furas

Reputation: 142681

I used print() to test values in variables in your code and I found that you use ndiff(string1, string2) but it needs ndiff(list_of_lines1, list_of_lines2) - so it treats your strings as list of chars and it compares every char separatelly. This way it puts <mark> for every changed char - instead of puting one <mark> for full word.

I tried to change this using lists with single line ndiff([string1], [string2]) and other changes but finally I resigned because it makes no sense. You would rather need to use lxml or Beautifulsoup to parse HTML to tree with tags as nodes and then compare text in nodes.


I found module xmldiff which uses lxml and it generates list of changes for two XML or HTML.

import xmldiff.main

all_changes = xmldiff.main.diff_texts(my_first_html_string, my_second_html_string)

Every change gives xpath so I use lxml to find node and replace text with <mark>text</mark>

It can find differen changes but I needed only UpdateTextIn (when text is inside tag - ie.<a>new text</a>) and UpdateTextAfter (when text is after tag - ie.<a>...</a>new text

highlighted_tree = lxml.etree.fromstring(my_second_html_string)

for item in all_changes:

    highlighted_node = highlighted_tree.xpath(item.node)[0]

    if isinstance(item, xmldiff.actions.UpdateTextIn):
        highlighted_node.text = '' # remove
        highlighted_node.insert(0, lxml.etree.fromstring('<mark>' + item.text + '</mark>'))

    if isinstance(item, xmldiff.actions.UpdateTextAfter):
        highlighted_node.tail = '' # remove # has to be before addnext
        highlighted_node.addnext(lxml.etree.fromstring('<mark>' + item.text + '</mark>'))

After that I conver tree to HTML again

html = lxml.etree.tostring(highlighted_tree)

print(html.decode())

Minimal working example with data

import xmldiff.main     # diff_texts
import xmldiff.actions  # UpdateTextIn, UpdateTextAfter
import lxml.etree

my_first_html_string = '''<tbody class="Expanded4" id="divisionG_area24_clubs"><!--<tr><th class='noBorderLeftRight'></th>--><th class="noBorderLeftRight" colspan="6"></th><th colspan="6"><table style="margin-bottom:auto;" width="100%"><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_blue Grid_Table" colspan="2">Membership</th><th class="Grid_top_blue Grid_Table" colspan="1">Goal4s</th><th class="Grid_Title_top_black grid_blue_border" colspan="6">Education</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Mem.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Trn.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Rn.|Lst.</th></tr><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_black_max Grid_Table">Base</th><th class="Grid_top_black_max Grid_Table">To Date</th><th class="Grid_top_black_max Grid_Table blue_border_right">Met</th><th class="Grid_top_black" title="Four Level 1 awards">1</th><th class="Grid_top_black" title="Two Level 2 awards">2</th><th class="Grid_top_black" title="Two more Level 2 awards">3</th><th class="Grid_top_black" title="Two Level 3 awards">1</th><th class="Grid_top_black" title="One Level 4, Level 5, or DTM award">5</th><th class="Grid_top_black" title="One more Level 4, Level 5, or DTM award">6</th><th class="Grid_top_black max22" title="4 New members">7</th><th class="Grid_top_black max22" title="4 More new members">9</th><th class="Grid_top_black max22" title="4 Officers trained first training period">9a</th><th class="Grid_top_black max22" title="4 Officers trained second training period">9b</th><th class="Grid_top_black max22" title="1 Dues-renewal on time">10a</th><th class="Grid_top_black max22" title="1 officer list on time">10b</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=01448795'"><td class="Grid_Title_top5 min280 crop" title="Advanced Speakers on the Hill"> <span class="redFont">01448795</span> Advanced Speakers on the Hill</td><th class="Grid_Table_yellow"><span>29<span></span></span></th><td class="Grid_Table title_gray"><span>30<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">2</span></td><th class="Grid_Title_goal" title="3 Level 1s needed">1</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">7</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02194262'"><td class="Grid_Title_top5 min280 crop" title="Inclusive Toastmasters"> <span class="redFont">02194262</span> Inclusivey Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">7</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">1</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">5</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02785335'"><td class="Grid_Title_top5 min280 crop" title="Club Toastmasters FrancoFun"> <span class="redFont">02785335</span> Club Toastmsasters FrancoFun</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">1</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=04437661'"><td class="Grid_Title_top5 min280 crop" title="Feel Good Toastmasters"> <span class="redFont">04437661</span> Feel Good Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>22<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr></table></th></tbody>'''
my_second_html_string = '''<tbody class="Expanded4" id="divisionG_area24_clubs"><!--<tr><th class='noBorderLeftRight'></th>--><th class="noBorderLeftRight" colspan="6"></th><th colspan="6"><table style="margin-bottom:auto;" width="100%"><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_blue Grid_Table" colspan="2">Membership</th><th class="Grid_top_blue Grid_Table" colspan="1">Goals</th><th class="Grid_Title_top_black grid_blue_border" colspan="6">Education</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Mem.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Trn.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Rn.|Lst.</th></tr><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_black_max Grid_Table">Base</th><th class="Grid_top_black_max Grid_Table">To Date</th><th class="Grid_top_black_max Grid_Table blue_border_right">Met</th><th class="Grid_top_black" title="Four Level 1 awards">1</th><th class="Grid_top_black" title="Two Level 2 awards">2</th><th class="Grid_top_black" title="Two more Level 2 awards">3</th><th class="Grid_top_black" title="Two Level 3 awards">4</th><th class="Grid_top_black" title="One Level 4, Level 5, or DTM award">5</th><th class="Grid_top_black" title="One more Level 4, Level 5, or DTM award">6</th><th class="Grid_top_black max22" title="4 New members">7</th><th class="Grid_top_black max22" title="4 More new members">8</th><th class="Grid_top_black max22" title="4 Officers trained first training period">9a</th><th class="Grid_top_black max22" title="4 Officers trained second training period">9b</th><th class="Grid_top_black max22" title="1 Dues-renewal on time">10a</th><th class="Grid_top_black max22" title="1 officer list on time">10b</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=01448795'"><td class="Grid_Title_top5 min280 crop" title="Advanced Speakers on the Hill"> <span class="redFont">01448795</span> Advanced Speakers on the Hill</td><th class="Grid_Table_yellow"><span>29<span></span></span></th><td class="Grid_Table title_gray"><span>30<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">2</span></td><th class="Grid_Title_goal" title="3 Level 1s needed">1</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">7</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02194262'"><td class="Grid_Title_top5 min280 crop" title="Inclusive Toastmasters"> <span class="redFont">02194262</span> Inclusive Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">5</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02785335'"><td class="Grid_Title_top5 min280 crop" title="Club Toastmasters FrancoFun"> <span class="redFont">02785335</span> Club Toastmasters FrancoFun</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">1</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=04437661'"><td class="Grid_Title_top5 min280 crop" title="Feel Good Toastmasters"> <span class="redFont">04437661</span> Feel Good Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>22<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr></table></th></tbody>'''

#my_first_html_string =  '''<html>test1 <p>325</p><div>This</div> testA</html>'''
#my_second_html_string = '''<html>test2 <p>3</p><div>Thiss</div> testB</html>'''

all_changes = xmldiff.main.diff_texts(my_first_html_string, my_second_html_string)

#old_tree = lxml.etree.fromstring(my_first_html_string)
#new_tree = lxml.etree.fromstring(my_second_html_string)
highlighted_tree = lxml.etree.fromstring(my_second_html_string)

for item in all_changes:
    #print('item:', item)
    #print('item.xpath:', item.node)
    #print('item.text:', item.text)
    #old_node = old_tree.xpath(item.node)[0]
    #new_node = new_tree.xpath(item.node)[0]
    #print('old node:', lxml.etree.tostring(old_node))
    #print('new node:', lxml.etree.tostring(new_node))
    #print('old text and tail:', [old_node.text, old_node.tail])
    #print('new text and tail:', [new_node.text, new_node.tail])
    
    highlighted_node = highlighted_tree.xpath(item.node)[0]
    
    if isinstance(item, xmldiff.actions.UpdateTextIn):
        print('changed text:', item.text)
        highlighted_node.text = ''
        highlighted_node.insert(0, lxml.etree.fromstring('<mark style="background:red">' + item.text + '</mark>'))

    if isinstance(item, xmldiff.actions.UpdateTextAfter):
        print('changed tail:', item.text)
        highlighted_node.tail = '' # has to be removed before `addnext`
        highlighted_node.addnext(lxml.etree.fromstring('<mark style="background:red">' + item.text + '</mark>'))
    
    print('---')

html = lxml.etree.tostring(highlighted_tree)
html = html.decode()
print(html)

with open('output.html', 'w') as f:
    f.write(html)

Result:

enter image description here


The only problem is that sometimes old text and new text may have the same text but different numer of spaces, tabs, new lines and it is also treated as change - but rather it would be skiped (but this need additional code)

Upvotes: 3

Related Questions