Reputation: 2179
I am new to python. Someone help me with this requirement. I have a data set with attributes in the first row and the records in the remaining rows.
My requirement is to compare each record with other record and give the attribute name of the elements which are different. So at the end, I should have set of sets as output.
For example, if i have 3 records with 3 columns like this.
Col1 Col2 Col3
tuple1 H C G
tuple2 H M G
tuple3 L M S
It should give me like this tuple1,tuple2 = {Col2} tuple1,tuple3 = {Col1,Col2,Col3} tuple2,tuple3 = {Col1,Col3}
And the final output should be {{Col2},{Col1,Col2,Col3},{Col1,Col3}}
Here is the code which i have tried,
What i did now is, reading each row in to a list. So all the attributes in one list(list name is list_attr) and rows as list of lists(list name is rows). Then for each record, i am looping with other records, comparing each element and getting the index of the different element to get attribute name. And then finally converting them to set. I have given the Code below, but the problem is, I have 50k records and 15 attributes to process, so this looping takes long time to execute, is there any other way to get this done soon or improve the performance.
dis_sets = []
for l in rows:
for l1 in rows:
if l != l1:
i = 0
in_sets = []
while(i < length):
if l[i] != l1[i]:
in_sets.append(list_attr[i])
i = i+1
if in_sets != []:
dis_sets.append(in_sets)
skt = set(frozenset(temp) for temp in dis_sets)
Upvotes: 2
Views: 615
Reputation: 103744
Consider:
>>> tuple1=('H', 'C', 'G')
>>> tuple2=('H', 'M', 'G')
>>> tuple3=('L', 'M', 'S')
OK, you state 'My requirement is to compare each record with other record and give the attribute name of the elements which are different.'
Put that into code:
>>> [i for i, t in enumerate(zip(tuple1, tuple2), 1) if t[0]!=t[1]]
[2]
>>> [i for i, t in enumerate(zip(tuple1, tuple3), 1) if t[0]!=t[1]]
[1, 2, 3]
>>> [i for i, t in enumerate(zip(tuple2, tuple3), 1) if t[0]!=t[1]]
[1, 3]
Then you state 'And the final output should be {{Col2},{Col1,Col2,Col3},{Col1,Col3}}
Since a set of sets will loose order, this does not make sense. It should be:
>>> [[i for i, t in enumerate(zip(*pair), 1) if t[0]!=t[1]] for pair in
... [(tuple1, tuple2), (tuple1, tuple3), (tuple2, tuple3)]]
[[2], [1, 2, 3], [1, 3]]
If you really want sets, you can have them as the sub element; if you have a true set of sets you have lost the information of which pair is which.
List of sets:
>>> [{i for i, t in enumerate(zip(*pair), 1) if t[0]!=t[1]} for pair in
... [(tuple1, tuple2), (tuple1, tuple3), (tuple2, tuple3)]]
[set([2]), set([1, 2, 3]), set([1, 3])]
And your almost same desired output:
>>> [{'Col{}'.format(i) for i, t in enumerate(zip(*pair), 1) if t[0]!=t[1]} for pair in
... [(tuple1, tuple2), (tuple1, tuple3), (tuple2, tuple3)]]
[set(['Col2']), set(['Col2', 'Col3', 'Col1']), set(['Col3', 'Col1'])]
(Note that since sets are unordered, the order of the strings changes. If the top level order changes, what do you have?)
Notice if you have a list of lists, you are closer to you desired output:
>>> [['Col{}'.format(i) for i, t in enumerate(zip(*pair), 1) if t[0]!=t[1]] for pair
... in [(tuple1, tuple2), (tuple1, tuple3), (tuple2, tuple3)]]
[['Col2'], ['Col1', 'Col2', 'Col3'], ['Col1', 'Col3']]
Edit based on comment
You could do something similar to:
def pairs(LoT):
# for production code, consider using a deque of tuples...
seen=set() # hold the pair combinations seen
while LoT:
f=LoT.pop(0)
for e in LoT:
se=frozenset([f, e])
if se not in seen:
seen.add(se)
yield se
>>> list(pairs([('H', 'C', 'G'), ('H', 'M', 'G'), ('L', 'M', 'S')]))
[frozenset([('H', 'M', 'G'), ('H', 'C', 'G')]), frozenset([('L', 'M', 'S'), ('H', 'C', 'G')]), frozenset([('H', 'M', 'G'), ('L', 'M', 'S')])]
Which then can be used thus:
>>> LoT=[('H', 'C', 'G'), ('H', 'M', 'G'), ('L', 'M', 'S')]
>>> [['Col{}'.format(i) for i, t in enumerate(zip(*pair), 1) if t[0]!=t[1]] for pair
... in pairs(LoT)]
[['Col2'], ['Col1', 'Col2', 'Col3'], ['Col1', 'Col3']]
Edit #2
If you want a header vs a calculated value:
>>> theader=['tuple col 1', 'col 2', 'the third' ]
>>> [[theader[i] for i, t in enumerate(zip(*pair)) if t[0]!=t[1]] for pair
... in pairs(LoT)]
[['col 2'], ['tuple col 1', 'col 2', 'the third'], ['tuple col 1', 'the third']]
If you want (what I suspect the the right answer) a List of Dicts of Lists:
>>> di=[]
>>> for pair in pairs(LoT):
... di.append({repr(list(pair)): [theader[i] for i, t in enumerate(zip(*pair)) if t[0]!=t[1]]})
>>> di
[{"[('H', 'M', 'G'), ('H', 'C', 'G')]": ['col 2']}, {"[('L', 'M', 'S'), ('H', 'C', 'G')]": ['tuple col 1', 'col 2', 'the third']}, {"[('H', 'M', 'G'), ('L', 'M', 'S')]": ['tuple col 1', 'the third']}]
Or, just a straight Dict of Lists:
>>> di={}
>>> for pair in pairs(LoT):
... di[repr(list(pair))]=[theader[i] for i, t in enumerate(zip(*pair)) if t[0]!=t[1]]
>>> di
{"[('H', 'M', 'G'), ('L', 'M', 'S')]": ['tuple col 1', 'the third'], "[('L', 'M', 'S'), ('H', 'C', 'G')]": ['tuple col 1', 'col 2', 'the third'], "[('H', 'M', 'G'), ('H', 'C', 'G')]": ['col 2']}
Upvotes: 3