abn
abn

Reputation: 1373

Differences in lists of lists using Python

I have two list of lists like below. I know I can use set(list1)-set(list2) or vice-versa to print the list that's different from the other corresponding one. However, I do not want the complete list to be printed out, I just want that part of the list that's been modified.

For example, list1:

[['Code', 'sID', 'dID', 'cID', 'ssID'], ['ABCD-00', 'ABCD-00-UNK', '358', '1234', '9999'], ['ABCD-01', 'ABCD-00-UNK', 160, '993', '587']]

list2:

[['Code', 'sID', 'dID', 'cID', 'ssID', 'AddedColumn'], ['ABCD-00', 'ABCD-00-UNK', '358', '1234', '9999', 'AddedValue1'], ['ABCD-01', 'ABCD-00-UNK', 160, '993', 'ChangedValue', 'AddedValue2']]

If I do set difference, it prints out the entire list. I want the output to show only the columns that are different/added/taken away when 'Code', 'sID' are the same.

The first list of each list of lists is the header. So I want to compare the lists when values from 'Code', 'sID' columns are matching.

Desired output:

Added - ['AddedColumn', 'AddedValue1', 'AddedValue2']
Deleted - []
Changed - ['Code', 'ABCD-01', 'ssID', 'ChangeValue']

something like this or anything simpler is fine also.

The code I've tried:

from difflib import SequenceMatcher

matcher = SequenceMatcher()
for a, b in zip(list1, list2):
    matcher.set_seqs(a, b)
    for tag, i1, i2, j1, j2 in matcher.get_opcodes():
        if tag == 'equal': continue
        print('{:>7s} {} {}'.format(tag, a[i1:i2], b[j1:j2]))

It works well in comparing corresponding lists, i.e., sub-list1 in list1 with sub-list1 in list2. But I want it to compare across the entire list because if a particular sub-list is missing, it prints that everything is different. By sub-list I mean, for example ['Code', 'sID', 'dID', 'cID', 'ssID'] in list1 is sub-list1.

Upvotes: 1

Views: 102

Answers (2)

user3467349
user3467349

Reputation: 3191

Here is my rudimentary interpretation. OP isn't quite clear on what they want as to the changed list - so they should update their requirements more specifically. As jsbueno suggests a dict may be better - it really depends, lists are cheaper if that's the format it came in.

added = []
deleted = []
changed = []
for  sub_l1, sub_l2 in zip(l1, l2): 
    for i in range(min(len(sub_l1), len(sub_l2))): 
        if sub_l1[i] != sub_l2[i]: 
            changed.append(sub_l2[i])
    if len(sub_l2) > len(sub_l1): 
        added.append(sub_l2[len(sub_l1):len(sub_l2)])
    elif len(sub_l1) > len(sub_l2):
        deleted.append(sub_l1[len(sub_l2):len(sub_l1)])

sample output:

In [66]: added
Out[66]: [['AddedColumn'], ['AddedValue1'], ['AddedValue2']]
In [67]: deleted
Out[67]: []
In [68]: changed
Out[68]: ['ChangedValue'] 

note that changed isn't telling you which value changed, generally you might want a tuple with the CSV sublist and column number.

Upvotes: 0

jsbueno
jsbueno

Reputation: 110301

So - as people are saying in the comments, what you really should do there is read each set of data you are calling "sublists" into proper objects - and them compare the propertis on those objects.

For example, to stick with native types, if "Code" and "sID" make up your key, each line could be a dictionary keyed by a tuple of your code and sid values.

But htis problem seems to call for a custom class - -

Given one of the lists above - you could pretty much start with something along:

class MyThing(object):
     def __init__(self, *args):
         for attrname, arg in zip(['Code', 'sID', 'dID', 'cID', 'ssID'], args):
            setattr(self, attrname, arg)

     def __hash__(self):
         # This is not needed for the OrderedDict bwellow, but allows you
         # to use sets with the objects if you want
         return hash(self.Code + self.sID)

from collections import OrderedDict
myobjs = OrderedDict()
for line in list1[1:]:
    obj = MyThing(line)
    id = obj.Code + obj.sId
    if id in myobjs:
        # do your comparisson -logging -printing stuff here
    else:
        myobjs[id] = obj

It can actually be done without the class and object creation part - just store the "line" in the dictionary - but the class enables you to do a lot of things in a cleaner way. The complicated __init__ is just a shorthand not to duplicate a lot of self.sId = sId lines.

Upvotes: 1

Related Questions