Reputation: 141
I have this .csv:
col1,col2,col3,col4,col5
247,19,1.0,2016-01-01 14:11:21,MP
247,3,1.0,2016-01-01 14:23:43,MP
247,12,1.0,2016-01-01 15:32:16,MP
402,3,1.0,2016-01-01 12:11:15,?
583,12,1.0,2016-01-01 02:33:57,?
769,16,1.0,2016-01-01 03:12:24,?
769,4,1.0,2016-01-01 03:22:29,?
.....
I need to take col2 values for each col1 unique element and make a new .csv like this:
expected output:
19,3,12
3
12
16,4
...
That is, I want to output numbers until a non-unique value is seen, at which point I will start a new line and continue to output numbers.
I read the .csv in that way and removed duplicate from the list:
import pandas as pd
colnames = ['col1', 'col2', 'col3', 'col4', 'col5']
df = pd.read_csv('sorted.csv', names=colnames)
list1 = df.col1.tolist()
list2 = list(set(list1 ))
now things are getting hard for me, I'm newbie in python, my idea was to compare each element in list2 with each row in df writing col2 elements in a new .csv, could you help me please?
Upvotes: 0
Views: 112
Reputation: 626
Example in python3
import pandas as pd
import csv
x = pd.read_csv('input.txt')
y = x[['col1','col2']]
with open("output.csv", "w") as f:
writer = csv.writer(f)
y.groupby(['col1']).agg(lambda x: writer.writerow(list(x.values)))
Maybe, you can try this. Don't store the whole output in a list or any Data Structure(memory issue). Write to file as you read and aggregate.(the reading should also be optimized to get an iterator if possible rather than loading the whole thing at once from input file.
Upvotes: 3
Reputation: 1980
you need to track your duplicates. The most simple (as in easy to understand, but sacrificing some efficiencies) way is as follows
import pandas as pd
colnames = ['col1', 'col2', 'col3', 'col4', 'col5']
df = pd.read_csv('sorted.csv', names=colnames)
list1 = df.col2.tolist()
dup_tracker = []
for x in list1:
if x in dup_tracker: file_out_helper('\n')
file_out_helper(str(x) + ', ')
dup_tracker.append(x)
def file_out_helper(m_str):
tgtfile = 'my_target_file.csv'
with open(tgtfile,'a') as f: f.write(m_str)
Upvotes: 0
Reputation: 2980
You can do this by grouping your data then applying a set
function as the aggregation.
df.groupby('col1')['col2'].apply(set).apply(list)
The apply(set)
function creates a set of all distinct col2
elements for each col1
value then the apply(list)
function converts the set into a list.
Upvotes: 1