Mark
Mark

Reputation: 141

python - comparing list to csv

I have this .csv:

col1,col2,col3,col4,col5
247,19,1.0,2016-01-01 14:11:21,MP
247,3,1.0,2016-01-01 14:23:43,MP
247,12,1.0,2016-01-01 15:32:16,MP
402,3,1.0,2016-01-01 12:11:15,?
583,12,1.0,2016-01-01 02:33:57,?
769,16,1.0,2016-01-01 03:12:24,?
769,4,1.0,2016-01-01 03:22:29,?
.....

I need to take col2 values for each col1 unique element and make a new .csv like this:

expected output:
19,3,12
3
12
16,4
...

That is, I want to output numbers until a non-unique value is seen, at which point I will start a new line and continue to output numbers.

I read the .csv in that way and removed duplicate from the list:

import pandas as pd
colnames = ['col1', 'col2', 'col3', 'col4', 'col5']
df = pd.read_csv('sorted.csv', names=colnames)
list1 = df.col1.tolist()
list2 = list(set(list1 ))

now things are getting hard for me, I'm newbie in python, my idea was to compare each element in list2 with each row in df writing col2 elements in a new .csv, could you help me please?

Upvotes: 0

Views: 112

Answers (3)

Mukul Anand
Mukul Anand

Reputation: 626

Example in python3

import pandas as pd
import csv
x = pd.read_csv('input.txt')
y = x[['col1','col2']]
with open("output.csv", "w") as f:
  writer = csv.writer(f)
  y.groupby(['col1']).agg(lambda x: writer.writerow(list(x.values)))

Maybe, you can try this. Don't store the whole output in a list or any Data Structure(memory issue). Write to file as you read and aggregate.(the reading should also be optimized to get an iterator if possible rather than loading the whole thing at once from input file.

Upvotes: 3

Mohammad Athar
Mohammad Athar

Reputation: 1980

you need to track your duplicates. The most simple (as in easy to understand, but sacrificing some efficiencies) way is as follows

import pandas as pd
colnames = ['col1', 'col2', 'col3', 'col4', 'col5']
df = pd.read_csv('sorted.csv', names=colnames)

list1 = df.col2.tolist()
dup_tracker = []
for x in list1:
    if x in dup_tracker: file_out_helper('\n')
    file_out_helper(str(x) + ', ')
    dup_tracker.append(x)

def file_out_helper(m_str):
    tgtfile = 'my_target_file.csv'
    with open(tgtfile,'a') as f: f.write(m_str)

Upvotes: 0

vielkind
vielkind

Reputation: 2980

You can do this by grouping your data then applying a set function as the aggregation.

df.groupby('col1')['col2'].apply(set).apply(list)

The apply(set) function creates a set of all distinct col2 elements for each col1 value then the apply(list) function converts the set into a list.

Upvotes: 1

Related Questions