GigaByte
GigaByte

Reputation: 700

how to remove duplication from a CSVfile by making some checks?

I have a CSV like:

col-1(ID)       col-2(val-List)

1               [1]
1               [1,2,3]
2               [1,2]
2               [1]
3               [10]
3               [10]

and I want to remove duplicate from this file and at the end, I just need a single row which has greater list length, like:

edited:

I want to keep a single row if there are those rows which have the same ID and same length of the inner list.

col-1(ID)       col-2(Val-List)

1               [1,2,3]
2               [1,2]
3               [10]

I tried a lot but no luck: I'm giving try by using CSV module but do not have an idea that how should I maintain the length of previous Val-List and compare with next matching ID.

import csv 
list_1 = []
with open('test123.csv', 'r', encoding='latin-1') as file:
    csvReader = csv.reader(file, delimiter=',')

    for row in csvReader:
        key = (row[0])
        # but how should I use this id to get my desired results?

Upvotes: 4

Views: 67

Answers (1)

rvd
rvd

Reputation: 568

Why not let pandas do the work?

import pandas

# Read in the CSV
df = pandas.read_csv('test123.csv', encoding='latin-1')

# Compute the list lengths
df['lst_len'] = df['col-2(val-List)'].map(lambda x: len(list(x)))

# Sort in reverse order by list lengths
df = df.sort_values('lst_len', ascending=False)

# Drop duplicates, preserving first (longest) list by ID
df = df.drop_duplicates(subset='col-1(ID)')

# Remove extra column that we introduced, write to file
df = df.drop('lst_len', axis=1)
df.to_csv('clean_test123.csv', index=False)

Upvotes: 3

Related Questions