Finding duplicates in each row and column

Question

The function needs to be able to check a file for duplicates in each row and column.

Example of file with duplicates:

A B C
A A B
B C A

As you can see, there is a duplicate in row 2 with 2 A's but also in Column 1 with two A's. code:

def duplication_char(dc):
    with open (dc,"r") as duplicatechars: 
        linecheck = duplicatechar.readlines()
    linecheck = [line.split() for line in linecheck]

    for row in linecheck:
        if len(set(row)) != len(row):
            print ("duplicates", " ".join(row))


    for column in zip(*checkLine):
        if len(set(column)) != len(column):
            print ("duplicates"," ".join(column))

Julien Spronck · Accepted Answer

Well, here is how I would do it.

First, read your files and create a 2d numpy array with the content:

import numpy
with open('test.txt', 'r') as fil:
    lines = fil.readlines()
lines = [line.strip().split() for line in lines]
arr = numpy.array(lines)

Then, check if each row has duplicates using sets (a set has no duplicates, so if the length of the set is different than the length of the array, the array has duplicates):

for row in arr:
    if len(set(row)) != len(row):
        print 'Duplicates in row: ', row

Then, check if each column has duplicates using sets, by transposing your numpy array:

for col in arr.T:
    if len(set(col)) != len(col):
        print 'Duplicates in column: ', col

If you wrap all of this in a function:

def check_for_duplicates(filename):
    import numpy
    with open(filename, 'r') as fil:
        lines = fil.readlines()
    lines = [line.strip().split() for line in lines]
    arr = numpy.array(lines)

    for row in arr:
        if len(set(row)) != len(row):
            print 'Duplicates in row: ', row

    for col in arr.T:
        if len(set(col)) != len(col):
            print 'Duplicates in column: ', col

As suggested by Apero, you can also do this without numpy using zip (https://docs.python.org/3/library/functions.html#zip):

def check_for_duplicates(filename):
    with open(filename, 'r') as fil:
        lines = fil.readlines()
    lines = [line.strip().split() for line in lines]

    for row in lines:
        if len(set(row)) != len(row):
            print 'Duplicates in row: ', row

    for col in zip(*lines):
        if len(set(col)) != len(col):
            print 'Duplicates in column: ', col

In your example, this code prints:

# Duplicates in row:  ['A' 'A' 'B']
# Duplicates in column:  ['A' 'A' 'B']

Finding duplicates in each row and column

Answers (2)

Related Questions