StuckOnDiffyQ
StuckOnDiffyQ

Reputation: 129

Removing duplicate rows across millions of compressed CSV files while keeping one piece of information from duplicated rows

Have a collection of ~10 million GZipped CSV files, each one having anywhere from 100-1000 rows and >2000 columns. Each file also contains a header.

In each CSV file there are two important columns, "ID" and "target".

I'm trying to remove the rows with duplicate "target" but retain the ID from the row to be removed with the row that will not be removed.

E.g.

Input:

CSV1
|   ID  |  Target                      |
|-------|------------------------------|
| IX213 | C1=CC(=CC=C1CC(=O)C(=O)O)O   |
| IX412 | CN1C=NC2=C1C(=O)N(C(=O)N2C)C |

CSV2
|   ID  |  Target                      |
|-------|------------------------------|
| BC144 | CN1C=NC2=C1C(=O)N(C(=O)N2C)C |
| BC155 | C(CC(=O)O)C(C(=O)O)N         |

Output:

CSV1*
|   ID         |  Target                      |
|--------------|------------------------------|
| IX213        | C1=CC(=CC=C1CC(=O)C(=O)O)O   |
| IX412; BC144 | CN1C=NC2=C1C(=O)N(C(=O)N2C)C |

CSV2*
|   ID  |  Target                      |
|-------|------------------------------|
| BC155 | C(CC(=O)O)C(C(=O)O)N         |

This would be straightforward for a small number of files with Pandas (Python) or the like, but was hoping someone might have a much better way of doing it across millions of files with billions of entries.

Upvotes: 0

Views: 525

Answers (2)

josoler
josoler

Reputation: 1423

Even though that the amount of data may look overwhelming, I think that you could iterate over all files sequentially if you keep just the right amount of data you need. For instance you could track the relation with unique targets with the first ID with such target and a relation of ID aliases (e.g. ID IX412 corresponds to BC144). This way your solution may look something like this:

import csv

filenames = [...]
target_ids = {}
aliases = {}

for filename in filenames:
    with open(filename, 'r') as file_in:
        reader = csv.DictReader(file_in)
        for row in reader:
            if row['Target'] in target_ids:
                aliases[row['ID']] = target_ids[row['Target']]
                remove_row(row)  # Do whatever you may require
            else:
                target_ids[row['Target']] = row['ID']

Note that having a dict with 10M key-value pairs is something perfectly tractable.

If this still doesn't fit in memory you could use shelve instead of dicts so that the corresponding data is stored in HDD. You could do something like:

import csv
import shelve

filenames = [...]

with shelve.open('target_ids') as target_ids, shelve.open('aliases') as aliases:
    for filename in filenames:
        with open(filename, 'r') as file_in:
            reader = csv.DictReader(file_in)
            for row in reader:
                if row['Target'] in target_ids:
                    aliases[row['ID']] = target_ids[row['Target']]
                    remove_row(row)  # Do whatever you may require
                else:
                    target_ids[row['Target']] = row['ID']

The downside of shelve with regards to regular dict being speed.

Upvotes: 1

btilly
btilly

Reputation: 46497

I would write a program that goes through a gzipped csv file and writes the following 4 columns: target id file row.

Run this on each file, and you will get ~10 million small files.

Assuming that you are NOT doing this in a distributed fashion, I would next combine the files into a single, big file, and sort it with the unix sort utility. (Warning you want to do LC_ALL=C sort foo.txt because the C locale is faster, and produces more sensible results. See sort not sorting as expected (space and locale) for more.)

Now it is easy to process that file, and decide which one to keep. You can write a file out with the columns file row target id is_keep removed_ids. Be sure to write the row out with leading zeros, so instead of 42 you would write 000042. The removed_ids are the ones you removed from other files if you kept this one. (The number of leading zeros should be enough for your largest file. That is to make asciibetical order match numerical order.)

Sort this file again and then break it up into per file files of what decision.

Given the original gzipped file and this file of which rows to keep, and which ids to store if you do keep it, it is easy to process your original files to remove/keep rows and record what you removed. I strongly suggest doing the sanity check of verifying that target/id/row all match. And don't delete the originals unless that sanity check passed.


If you're a big fan of distributed processing, the translation from sorting to map-reduces is straightforward. If you have an infrastructure for that set up, you might as well use it. If not, I would suggest this sorted file approach, only using parallelization to process all the individual files at the first/last.

Upvotes: 2

Related Questions