Reputation: 700
I have a CSV like:
col-1(ID) col-2(val-List)
1 [1]
1 [1,2,3]
2 [1,2]
2 [1]
3 [10]
3 [10]
and I want to remove duplicate from this file and at the end, I just need a single row which has greater list length, like:
edited:
I want to keep a single row if there are those rows which have the same ID and same length of the inner list.
col-1(ID) col-2(Val-List)
1 [1,2,3]
2 [1,2]
3 [10]
I tried a lot but no luck: I'm giving try by using CSV module but do not have an idea that how should I maintain the length of previous Val-List and compare with next matching ID.
import csv
list_1 = []
with open('test123.csv', 'r', encoding='latin-1') as file:
csvReader = csv.reader(file, delimiter=',')
for row in csvReader:
key = (row[0])
# but how should I use this id to get my desired results?
Upvotes: 4
Views: 67
Reputation: 568
Why not let pandas
do the work?
import pandas
# Read in the CSV
df = pandas.read_csv('test123.csv', encoding='latin-1')
# Compute the list lengths
df['lst_len'] = df['col-2(val-List)'].map(lambda x: len(list(x)))
# Sort in reverse order by list lengths
df = df.sort_values('lst_len', ascending=False)
# Drop duplicates, preserving first (longest) list by ID
df = df.drop_duplicates(subset='col-1(ID)')
# Remove extra column that we introduced, write to file
df = df.drop('lst_len', axis=1)
df.to_csv('clean_test123.csv', index=False)
Upvotes: 3