Reputation: 1373
I have two list of lists like below. I know I can use set(list1)-set(list2) or vice-versa to print the list that's different from the other corresponding one. However, I do not want the complete list to be printed out, I just want that part of the list that's been modified.
For example, list1:
[['Code', 'sID', 'dID', 'cID', 'ssID'], ['ABCD-00', 'ABCD-00-UNK', '358', '1234', '9999'], ['ABCD-01', 'ABCD-00-UNK', 160, '993', '587']]
list2:
[['Code', 'sID', 'dID', 'cID', 'ssID', 'AddedColumn'], ['ABCD-00', 'ABCD-00-UNK', '358', '1234', '9999', 'AddedValue1'], ['ABCD-01', 'ABCD-00-UNK', 160, '993', 'ChangedValue', 'AddedValue2']]
If I do set difference, it prints out the entire list. I want the output to show only the columns that are different/added/taken away when 'Code', 'sID' are the same.
The first list of each list of lists is the header. So I want to compare the lists when values from 'Code', 'sID' columns are matching.
Desired output:
Added - ['AddedColumn', 'AddedValue1', 'AddedValue2']
Deleted - []
Changed - ['Code', 'ABCD-01', 'ssID', 'ChangeValue']
something like this or anything simpler is fine also.
The code I've tried:
from difflib import SequenceMatcher
matcher = SequenceMatcher()
for a, b in zip(list1, list2):
matcher.set_seqs(a, b)
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag == 'equal': continue
print('{:>7s} {} {}'.format(tag, a[i1:i2], b[j1:j2]))
It works well in comparing corresponding lists, i.e., sub-list1 in list1 with sub-list1 in list2. But I want it to compare across the entire list because if a particular sub-list is missing, it prints that everything is different. By sub-list I mean, for example ['Code', 'sID', 'dID', 'cID', 'ssID']
in list1 is sub-list1.
Upvotes: 1
Views: 102
Reputation: 3191
Here is my rudimentary interpretation. OP isn't quite clear on what they want as to the changed
list - so they should update their requirements more specifically. As jsbueno suggests a dict may be better - it really depends, lists are cheaper if that's the format it came in.
added = []
deleted = []
changed = []
for sub_l1, sub_l2 in zip(l1, l2):
for i in range(min(len(sub_l1), len(sub_l2))):
if sub_l1[i] != sub_l2[i]:
changed.append(sub_l2[i])
if len(sub_l2) > len(sub_l1):
added.append(sub_l2[len(sub_l1):len(sub_l2)])
elif len(sub_l1) > len(sub_l2):
deleted.append(sub_l1[len(sub_l2):len(sub_l1)])
sample output:
In [66]: added
Out[66]: [['AddedColumn'], ['AddedValue1'], ['AddedValue2']]
In [67]: deleted
Out[67]: []
In [68]: changed
Out[68]: ['ChangedValue']
note that changed
isn't telling you which value changed, generally you might want a tuple with the CSV sublist and column number.
Upvotes: 0
Reputation: 110301
So - as people are saying in the comments, what you really should do there is read each set of data you are calling "sublists" into proper objects - and them compare the propertis on those objects.
For example, to stick with native types, if "Code" and "sID" make up your key, each line could be a dictionary keyed by a tuple of your code and sid values.
But htis problem seems to call for a custom class - -
Given one of the lists above - you could pretty much start with something along:
class MyThing(object):
def __init__(self, *args):
for attrname, arg in zip(['Code', 'sID', 'dID', 'cID', 'ssID'], args):
setattr(self, attrname, arg)
def __hash__(self):
# This is not needed for the OrderedDict bwellow, but allows you
# to use sets with the objects if you want
return hash(self.Code + self.sID)
from collections import OrderedDict
myobjs = OrderedDict()
for line in list1[1:]:
obj = MyThing(line)
id = obj.Code + obj.sId
if id in myobjs:
# do your comparisson -logging -printing stuff here
else:
myobjs[id] = obj
It can actually be done without the class and object creation part - just store the "line" in the dictionary - but the class enables you to do a lot of things in a cleaner way. The complicated __init__
is just a shorthand not to duplicate a lot of self.sId = sId
lines.
Upvotes: 1