Element
Element

Reputation:

Python remove all lines which have common value in fields

I have lines of data comprising of 4 fields

aaaa bbb1 cccc dddd  
aaaa bbb2 cccc dddd  
aaaa bbb3 cccc eeee  
aaaa bbb4 cccc ffff  
aaaa bbb5 cccc gggg  
aaaa bbb6 cccc dddd    

Please bear with me.

The first and third field is always the same - but I don't need them, the 4th field can be the same or different. The thing is, I only want 2nd and 4th fields from lines which don't share the common field. For example like this from the above data

bbb3 eeee  
bbb4 ffff    
bbb5 gggg    

Now I don't mean deduplication as that would leave one of the entries in. If the 4th field shares a value with another line, I don't want any line which ever had that value.

humblest apologies once again for asking what is probably simple.

Upvotes: 1

Views: 576

Answers (2)

John Machin
John Machin

Reputation: 82934

For your amplified requirement, you can avoid reading the file twice or saving it in a list:

LINES = """\
aaaa bbb1 cccc dddd
aaaa bbb2 cccc dddd
aaaa bbb3 cccc eeee
aaaa bbb4 cccc ffff
aaaa bbb5 cccc gggg
aaaa bbb6 cccc dddd""".split('\n')

import collections
adict = collections.defaultdict(list)
for line in LINES: # or file ...
    a, b, c, d = line.split()
    adict[d].append(b)

map_b_to_d = dict((blist[0], d) for d, blist in adict.items() if len(blist) == 1)
print(map_b_to_d)

# alternative; saves some memory

xdict = {}
duplicated = object()
for line in LINES: # or file ...
    a, b, c, d = line.split()
    xdict[d] = duplicated if d in xdict else b

map_b_to_d2 = dict((b, d) for d, b in xdict.items() if b is not duplicated)
print(map_b_to_d2)

Upvotes: 0

RichieHindle
RichieHindle

Reputation: 281505

Here you go:

from collections import defaultdict

LINES = """\
aaaa bbb1 cccc dddd
aaaa bbb2 cccc dddd
aaaa bbb3 cccc eeee
aaaa bbb4 cccc ffff
aaaa bbb5 cccc gggg
aaaa bbb6 cccc dddd""".split('\n')

# Count how many lines each unique value of the fourth field appears in.
d_counts = defaultdict(int)
for line in LINES:
    a, b, c, d = line.split()
    d_counts[d] += 1

# Print only those lines with a unique value for the fourth field.
for line in LINES:
    a, b, c, d = line.split()
    if d_counts[d] == 1:
        print b, d

# Prints
# bbb3 eeee
# bbb4 ffff
# bbb5 gggg

Upvotes: 6

Related Questions