Reputation:
I have lines of data comprising of 4 fields
aaaa bbb1 cccc dddd
aaaa bbb2 cccc dddd
aaaa bbb3 cccc eeee
aaaa bbb4 cccc ffff
aaaa bbb5 cccc gggg
aaaa bbb6 cccc dddd
Please bear with me.
The first and third field is always the same - but I don't need them, the 4th field can be the same or different. The thing is, I only want 2nd and 4th fields from lines which don't share the common field. For example like this from the above data
bbb3 eeee
bbb4 ffff
bbb5 gggg
Now I don't mean deduplication as that would leave one of the entries in. If the 4th field shares a value with another line, I don't want any line which ever had that value.
humblest apologies once again for asking what is probably simple.
Upvotes: 1
Views: 576
Reputation: 82934
For your amplified requirement, you can avoid reading the file twice or saving it in a list:
LINES = """\
aaaa bbb1 cccc dddd
aaaa bbb2 cccc dddd
aaaa bbb3 cccc eeee
aaaa bbb4 cccc ffff
aaaa bbb5 cccc gggg
aaaa bbb6 cccc dddd""".split('\n')
import collections
adict = collections.defaultdict(list)
for line in LINES: # or file ...
a, b, c, d = line.split()
adict[d].append(b)
map_b_to_d = dict((blist[0], d) for d, blist in adict.items() if len(blist) == 1)
print(map_b_to_d)
# alternative; saves some memory
xdict = {}
duplicated = object()
for line in LINES: # or file ...
a, b, c, d = line.split()
xdict[d] = duplicated if d in xdict else b
map_b_to_d2 = dict((b, d) for d, b in xdict.items() if b is not duplicated)
print(map_b_to_d2)
Upvotes: 0
Reputation: 281505
Here you go:
from collections import defaultdict
LINES = """\
aaaa bbb1 cccc dddd
aaaa bbb2 cccc dddd
aaaa bbb3 cccc eeee
aaaa bbb4 cccc ffff
aaaa bbb5 cccc gggg
aaaa bbb6 cccc dddd""".split('\n')
# Count how many lines each unique value of the fourth field appears in.
d_counts = defaultdict(int)
for line in LINES:
a, b, c, d = line.split()
d_counts[d] += 1
# Print only those lines with a unique value for the fourth field.
for line in LINES:
a, b, c, d = line.split()
if d_counts[d] == 1:
print b, d
# Prints
# bbb3 eeee
# bbb4 ffff
# bbb5 gggg
Upvotes: 6