Reputation: 23

Find and remove duplicates in a CSV file

I have a large CSV file (1.8 GB) with three columns. Each row contains two strings and a numerical value. The problem is that they are duplicate but swapped. Example:

Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
DEF,ABC,123

The desired output would look like this:

Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454

because the third row contains the same information like the first row.

EDIT

The data basically looks like this (Strings in the first two columns and a numerical value in the third, 40 Million lines):

Blockquote

Upvotes: 2

Answers (4)

Tech at The Sparks Foundation

Reputation: 1141

If you want to use csv library itself:-

you can use a DictReader and DictWriter.

Import csv
 def main():
 """Read csv file, delete duplicates and write it.""" 
     with open('test.csv', 'r',newline='') as inputfile: 
           with open('testout.csv', 'w', newline='') as outputfile: 
               duplicatereader = csv.DictReader(inputfile, delimiter=',') 
               uniquewrite = csv.DictWriter(outputfile, fieldnames=['address', 'floor', 'date', 'price'], delimiter=',') 
                uniquewrite.writeheader()
                keysread = []
               for row in duplicatereader:
                     key = (row['date'], row['price'])
                     if key not in keysread:
                              print(row) 
                              keysread.append(key)
                              uniquewrite.writerow(row)
 if __name__ == '__main__': 
     main()

Upvotes: 0

Ender Look

Reputation: 2401

Note: This question was done before the OP changed the python tag for awk tag.

If you don't mind the order of the elements you might do:

with open("in.csv", "r") as file:
    lines = set()
    for line in file:
        lines.add(frozenset(line.strip("\n").split(",")))

with open("out.csv", "w") as file:
    for line in lines:
        file.write(",".join(line)+"\n")

Output:

Col2,COL1,Col3
EFG,454,ABC
DEF,123,ABC

Note that you might want to treat the first line (the titles) in an special way to not loose their order.

But if the order matter you could use the code from Maintaining the order of the elements in a frozen set:

from itertools import filterfalse

def unique_everseen(iterable, key=None):
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element        

with open("in.csv", "r") as file:
    lines = []
    for line in file:
        lines.append(line.strip("\n").split(","))

with open("out.csv", "w") as file:
    for line in unique_everseen(lines, key=frozenset):
        file.write(",".join(line)+"\n")

Output:

COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454

The OP said that both codes seem to not work on large files (1.8 Gb). I think it may be due to the fact that both codes store the file in a list using the RAM, and a file of 1.8 GB might take all the available space on memory.

In order to solve that I made a few more attempts. Sadly, I must say that all of them are extremely slow compared to the first attempt. The firsts codes sacrifice RAM consumption for speed, but the following codes sacrifice speed, CPU and hard drive for less RAM consumption (instead of consuming the whole file size in RAM they take less than 50 Mb).

Since all of this examples needs a higher hard drive usage, it's advisable to has the "input" and "output" file on different hard drives.

My first attempt using less RAM is with the shelve module:

import shelve, os
with shelve.open("tmp") as db:
    with open("in.csv", "r") as file:
        for line in file:
            l = line.strip("\n").split(",")
            l.sort()
            db[",".join(l)] = l

    with open("out.csv", "w") as file:
        for v in db.values():
            file.write(",".join(v)+"\n")

os.remove("temp.bak")
os.remove("temp.dat")
os.remove("temp.dir")

Sadly, this code takes hundred of times more than the first two codes which uses the RAM.

Another attempt is:

with open("in.csv", "r") as fileRead:
    # total = sum(1 for _ in fileRead)
    # fileRead.seek(0)
    # i = 0
    with open("out.csv", "w") as _:
        pass
    with open("out.csv", "r+") as fileWrite:
        for lineRead in fileRead:
            # i += 1
            line = lineRead.strip("\n").split(",")
            lineSet = set(line)
            write = True
            fileWrite.seek(0)
            for lineWrite in fileWrite:
                if lineSet == set(lineWrite.strip("\n").split(",")):
                    write = False
            if write:
                pass
                fileWrite.write(",".join(line)+"\n")
            # if i / total * 100 % 1 == 0: print(f"{i / total * 100}% ({i} / {total})")

This is slightly faster but not much.

If your computer has several cores, you could try to use multiprocessing:

from multiprocessing import Process, Queue, cpu_count
from os import remove

def slave(number, qIn, qOut):
    name = f"slave-{number}.csv"
    with open(name, "w") as file:
        pass
    with open(name, "r+") as file:
        while True:
            if not qIn.empty():
                get = qIn.get()
                if get == False:
                    qOut.put(name)
                    break
                else:
                    write = True
                    file.seek(0)                    
                    for line in file:
                        if set(line.strip("\n").split(",")) == get[1]:
                            write = False
                            break
                    if write:
                        file.write(get[0])

def master():
    qIn = Queue(1)
    qOut = Queue()
    slaves = cpu_count()
    slavesList = []

    for n in range(slaves):
        slavesList.append(Process(target=slave, daemon=True, args=(n, qIn, qOut)))
    for s in slavesList:
        s.start()

    with open("in.csv", "r") as file:
        for line in file:
            lineSet = set(line.strip("\n").split(","))
            qIn.put((line, lineSet))
        for _ in range(slaves):
            qIn.put(False)

    for s in slavesList:
        s.join()

    slavesList = []

    with open(qOut.get(), "r+") as fileMaster:
        for x in range(slaves-1):
            file = qOut.get()
            with open(file, "r") as fileSlave:
                for lineSlave in fileSlave:
                    lineSet = set(lineSlave.strip("\n").split(","))
                    write = True
                    fileMaster.seek(0)
                    for lineMaster in fileMaster:
                        if set(lineMaster.strip("\n").split(",")) == lineSet:
                            write = False
                            break
                    if write:
                        fileMaster.write(lineSlave)

            slavesList.append(Process(target=remove, daemon=True, args=(file,)))
            slavesList[-1].start()

    for s in slavesList:
        s.join()

As you can see, I have the disappointing task to tell you that my both attempts work really slow. I hope you find a better approach, otherwise, it will take hours if not days to execute on 1,8 GB of data (the real time will primarily depend on the number of repeated values, which reduces time).

A new attempt: instead of storing every in file, this attempt stores the active portion on memory, and then write down on a file in order to process chunks faster. Then, the chunks must be read again by using one of the above methods:

lines = set()
maxLines = 1000 # This is the amount of lines that will be stored at the same time on RAM. Higher numbers are faster but requeires more RAM on the computer
perfect = True
with open("in.csv", "r") as fileRead:
    total = sum(1 for _ in fileRead)
    fileRead.seek(0)
    i = 0
    with open("tmp.csv", "w") as fileWrite:            
        for line in fileRead:
            if (len(lines) < maxLines):                    
                lines.add(frozenset(line.strip("\n").split(",")))
                i += 1
                if i / total * 100 % 1 == 0: print(f"Reading {i / total * 100}% ({i} / {total})")
            else:
                perfect = False
                j = 0
                for line in lines:
                    j += 1
                    fileWrite.write(",".join(line) + "\n")
                    if i / total * 100 % 1 == 0: print(f"Storing {i / total * 100}% ({i} / {total})")
                lines = set()

if (not perfect):
   use_one_of_the_above_methods() # Remember to read the tmp.csv and not the in.csv

This might boost the speed. You can change maxLines by any number you like, remember that higher the number, greater speed (not sure if really big numbers do the opposite) but higher RAM consumption.

Upvotes: 2

sjsam

Reputation: 21965

From the problem description, the mandate for a line to be NOT omitted is when the first and second fields in either order when concatenated should be unique. If so, below awk would help

awk -F, '{seen[$1,$2]++;seen[$2,$1]++}seen[$1,$2]==1 && seen[$2,$1]==1' filename

Sample Input

Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
DEF,ABC,123
GHI,ABC,123
DEF,ABC,123
ABC,GHI,123
DEF,GHI,123

Sample Output

Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
GHI,ABC,123
DEF,GHI,123

Upvotes: 2

James Brown

Reputation: 37404

Can you handle awk:

$ awk -F, '++seen[$3]==1' file

Output:

COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454

Explaied:

$ awk -F, '      # set comma as field delimiter
++seen[$3]==1    # count instances of the third field to hash, printing only first
' file

Update:

$ awk -F, '++seen[($1<$2?$1 FS $2:$2 FS $1)]==1' file

Output:

COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454

It hashes every met combination of first and second field so that "ABC,DEF"=="DEF,ABC" and counts them printing only the first. ($1<$2?$1 FS $2:$2 FS $1): if first field is less than second, hash 1st,2nd else hash 2nd,1st.

Upvotes: 7

Find and remove duplicates in a CSV file

Answers (4)

Related Questions