Reputation: 23
I have a large CSV file (1.8 GB) with three columns. Each row contains two strings and a numerical value. The problem is that they are duplicate but swapped. Example:
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
DEF,ABC,123
The desired output would look like this:
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
because the third row contains the same information like the first row.
EDIT
The data basically looks like this (Strings in the first two columns and a numerical value in the third, 40 Million lines):
Upvotes: 2
Views: 8241
Reputation: 1141
If you want to use csv library itself:-
you can use a DictReader and DictWriter.
Import csv
def main():
"""Read csv file, delete duplicates and write it."""
with open('test.csv', 'r',newline='') as inputfile:
with open('testout.csv', 'w', newline='') as outputfile:
duplicatereader = csv.DictReader(inputfile, delimiter=',')
uniquewrite = csv.DictWriter(outputfile, fieldnames=['address', 'floor', 'date', 'price'], delimiter=',')
uniquewrite.writeheader()
keysread = []
for row in duplicatereader:
key = (row['date'], row['price'])
if key not in keysread:
print(row)
keysread.append(key)
uniquewrite.writerow(row)
if __name__ == '__main__':
main()
Upvotes: 0
Reputation: 2401
Note: This question was done before the OP changed the python tag for awk tag.
If you don't mind the order of the elements you might do:
with open("in.csv", "r") as file:
lines = set()
for line in file:
lines.add(frozenset(line.strip("\n").split(",")))
with open("out.csv", "w") as file:
for line in lines:
file.write(",".join(line)+"\n")
Output:
Col2,COL1,Col3
EFG,454,ABC
DEF,123,ABC
Note that you might want to treat the first line (the titles) in an special way to not loose their order.
But if the order matter you could use the code from Maintaining the order of the elements in a frozen set:
from itertools import filterfalse
def unique_everseen(iterable, key=None):
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
with open("in.csv", "r") as file:
lines = []
for line in file:
lines.append(line.strip("\n").split(","))
with open("out.csv", "w") as file:
for line in unique_everseen(lines, key=frozenset):
file.write(",".join(line)+"\n")
Output:
COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
The OP said that both codes seem to not work on large files (1.8 Gb). I think it may be due to the fact that both codes store the file in a list using the RAM, and a file of 1.8 GB might take all the available space on memory.
In order to solve that I made a few more attempts. Sadly, I must say that all of them are extremely slow compared to the first attempt. The firsts codes sacrifice RAM consumption for speed, but the following codes sacrifice speed, CPU and hard drive for less RAM consumption (instead of consuming the whole file size in RAM they take less than 50 Mb).
Since all of this examples needs a higher hard drive usage, it's advisable to has the "input" and "output" file on different hard drives.
My first attempt using less RAM is with the shelve
module:
import shelve, os
with shelve.open("tmp") as db:
with open("in.csv", "r") as file:
for line in file:
l = line.strip("\n").split(",")
l.sort()
db[",".join(l)] = l
with open("out.csv", "w") as file:
for v in db.values():
file.write(",".join(v)+"\n")
os.remove("temp.bak")
os.remove("temp.dat")
os.remove("temp.dir")
Sadly, this code takes hundred of times more than the first two codes which uses the RAM.
Another attempt is:
with open("in.csv", "r") as fileRead:
# total = sum(1 for _ in fileRead)
# fileRead.seek(0)
# i = 0
with open("out.csv", "w") as _:
pass
with open("out.csv", "r+") as fileWrite:
for lineRead in fileRead:
# i += 1
line = lineRead.strip("\n").split(",")
lineSet = set(line)
write = True
fileWrite.seek(0)
for lineWrite in fileWrite:
if lineSet == set(lineWrite.strip("\n").split(",")):
write = False
if write:
pass
fileWrite.write(",".join(line)+"\n")
# if i / total * 100 % 1 == 0: print(f"{i / total * 100}% ({i} / {total})")
This is slightly faster but not much.
If your computer has several cores, you could try to use multiprocessing:
from multiprocessing import Process, Queue, cpu_count
from os import remove
def slave(number, qIn, qOut):
name = f"slave-{number}.csv"
with open(name, "w") as file:
pass
with open(name, "r+") as file:
while True:
if not qIn.empty():
get = qIn.get()
if get == False:
qOut.put(name)
break
else:
write = True
file.seek(0)
for line in file:
if set(line.strip("\n").split(",")) == get[1]:
write = False
break
if write:
file.write(get[0])
def master():
qIn = Queue(1)
qOut = Queue()
slaves = cpu_count()
slavesList = []
for n in range(slaves):
slavesList.append(Process(target=slave, daemon=True, args=(n, qIn, qOut)))
for s in slavesList:
s.start()
with open("in.csv", "r") as file:
for line in file:
lineSet = set(line.strip("\n").split(","))
qIn.put((line, lineSet))
for _ in range(slaves):
qIn.put(False)
for s in slavesList:
s.join()
slavesList = []
with open(qOut.get(), "r+") as fileMaster:
for x in range(slaves-1):
file = qOut.get()
with open(file, "r") as fileSlave:
for lineSlave in fileSlave:
lineSet = set(lineSlave.strip("\n").split(","))
write = True
fileMaster.seek(0)
for lineMaster in fileMaster:
if set(lineMaster.strip("\n").split(",")) == lineSet:
write = False
break
if write:
fileMaster.write(lineSlave)
slavesList.append(Process(target=remove, daemon=True, args=(file,)))
slavesList[-1].start()
for s in slavesList:
s.join()
As you can see, I have the disappointing task to tell you that my both attempts work really slow. I hope you find a better approach, otherwise, it will take hours if not days to execute on 1,8 GB of data (the real time will primarily depend on the number of repeated values, which reduces time).
A new attempt: instead of storing every in file, this attempt stores the active portion on memory, and then write down on a file in order to process chunks faster. Then, the chunks must be read again by using one of the above methods:
lines = set()
maxLines = 1000 # This is the amount of lines that will be stored at the same time on RAM. Higher numbers are faster but requeires more RAM on the computer
perfect = True
with open("in.csv", "r") as fileRead:
total = sum(1 for _ in fileRead)
fileRead.seek(0)
i = 0
with open("tmp.csv", "w") as fileWrite:
for line in fileRead:
if (len(lines) < maxLines):
lines.add(frozenset(line.strip("\n").split(",")))
i += 1
if i / total * 100 % 1 == 0: print(f"Reading {i / total * 100}% ({i} / {total})")
else:
perfect = False
j = 0
for line in lines:
j += 1
fileWrite.write(",".join(line) + "\n")
if i / total * 100 % 1 == 0: print(f"Storing {i / total * 100}% ({i} / {total})")
lines = set()
if (not perfect):
use_one_of_the_above_methods() # Remember to read the tmp.csv and not the in.csv
This might boost the speed. You can change maxLines
by any number you like, remember that higher the number, greater speed (not sure if really big numbers do the opposite) but higher RAM consumption.
Upvotes: 2
Reputation: 21965
From the problem description, the mandate for a line to be NOT omitted is when
the first and second fields in either order when concatenated should be unique.
If so, below awk
would help
awk -F, '{seen[$1,$2]++;seen[$2,$1]++}seen[$1,$2]==1 && seen[$2,$1]==1' filename
Sample Input
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
DEF,ABC,123
GHI,ABC,123
DEF,ABC,123
ABC,GHI,123
DEF,GHI,123
Sample Output
Col1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
GHI,ABC,123
DEF,GHI,123
Upvotes: 2
Reputation: 37404
Can you handle awk:
$ awk -F, '++seen[$3]==1' file
Output:
COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
Explaied:
$ awk -F, ' # set comma as field delimiter
++seen[$3]==1 # count instances of the third field to hash, printing only first
' file
Update:
$ awk -F, '++seen[($1<$2?$1 FS $2:$2 FS $1)]==1' file
Output:
COL1,Col2,Col3
ABC,DEF,123
ABC,EFG,454
It hashes every met combination of first and second field so that "ABC,DEF"=="DEF,ABC"
and counts them printing only the first. ($1<$2?$1 FS $2:$2 FS $1)
: if first field is less than second, hash 1st,2nd
else hash 2nd,1st
.
Upvotes: 7