Reputation: 43
I'm new to python(any kind of coding really). So i'm sorry if if gets a bit confusing
I have a csv file like the following
A B C D E F G H
14 BP1 BP1-19119308 OR1A1 19119308 chip-chip Hs578T human 11/23/09
15 BP1 BP1-19119308 PTPRE 19119308 chip-chip Hs578T human 11/23/09
16 BP1 BP1-19119308 SELE 19119308 chip-chip Hs578T human 11/23/09
17 BP1 BP1-19119308 TAC3 19119308 chip-chip Hs578T human 11/23/09
18 BP1 BP1-19119308 VEGFA 19119308 chip-chip Hs578T human 11/23/09
19 CHD7 CHD7-19251738 APOA1 19251738 chip-chip MESC mouse 11/23/09
20 CHD7 CHD7-19251738 ARHGAP26 19251738 chip-chip MESC mouse 11/23/09
And I need to make it look like this
BP1-19119308-chip-chip-Hs578T-human OR1A1 PTPRE SELE TAC3 VEGFA
CHD7-19251738-chip-chip-MESC-mouse APOA1 ARHGAP26
I did manage to the C-F-G-H in the first column with this
import csv
out = open ('test.csv','rt', encoding='utf8')
data = csv.reader(out)
output = csv.writer(out)
data = [row for row in data]
new_data = [[row[2]+'-'+row[5]+'-'+row[6] +'-'+ row[7], row[3]] for row in data]
print (new_data)
out = open('new_data.csv','wt')
output = csv.writer(out)
for row in new_data:
output.writerow(row)
out.close()
A B
BP1-19119308-chip-chip-Hs578T-human OR1A1
BP1-19119308-chip-chip-Hs578T-human PTPRE
BP1-19119308-chip-chip-Hs578T-human SELE
BP1-19119308-chip-chip-Hs578T-human TAC3
BP1-19119308-chip-chip-Hs578T-human VEGFA
CHD7-19251738-chip-chip-MESC-mouse APOA1
CHD7-19251738-chip-chip-MESC-mouse ARHGAP26
CHD7-19251738-chip-chip-MESC-mouse ATP11A
But now I have these duplicates in A and I have no idea how to delete them and transpose all the values in B that were assigned to these duplicates.
I tried looping again to compare the current value to the previous value and I just messed the whole thing up.
Any suggestions?
Upvotes: 2
Views: 55
Reputation: 20890
Use itertools.groupby
and operator.itemgetter
. Add this to your code after initializing new_data
and output
:
for k, g in itertools.groupby(new_data, operator.itemgetter(0)):
row = [k]
row.extend(map(g, operator.itemgetter(1)))
output.writerow(row)
The complete improved (refactored) code could look like this:
import csv
import itertools
import operator
with open('test.csv','rt', encoding='utf8') as f_in:
inp = csv.reader(f_in)
new_data = (('-'.join(operator.itemgetter(2, 5, 6, 7)), row[3])
for row in inp)
with open('new_data.csv','wt') as f_out:
output = csv.writer(f_out)
for k, g in itertools.groupby(new_data, operator.itemgetter(0)):
row = [k]
row.extend(map(g, operator.itemgetter(1)))
output.writerow(row)
Upvotes: 0
Reputation: 2840
Star from the point where you have as here:
test.txt
A B
BP1-19119308-chip-chip-Hs578T-human OR1A1
BP1-19119308-chip-chip-Hs578T-human PTPRE
BP1-19119308-chip-chip-Hs578T-human SELE
BP1-19119308-chip-chip-Hs578T-human TAC3
BP1-19119308-chip-chip-Hs578T-human VEGFA
CHD7-19251738-chip-chip-MESC-mouse APOA1
CHD7-19251738-chip-chip-MESC-mouse ARHGAP26
CHD7-19251738-chip-chip-MESC-mouse ATP11A
Now, you can use the following code to bring to the shape you need:
with open("test.txt") as f:
data = f.readlines()[1:]
mydata = [x.strip() for x in data]
final = {}
for x in mydata:
k, v = x.split()
if final.has_key(k):
l = final[k]
l.append(v)
else:
final[k] = [v]
for d in final:
print d, " ".join(final[d])
Output:
CHD7-19251738-chip-chip-MESC-mouse APOA1 ARHGAP26 ATP11A
BP1-19119308-chip-chip-Hs578T-human OR1A1 PTPRE SELE TAC3 VEGFA
From here you can write it into a file if you need to.
Upvotes: 0
Reputation: 5373
You want to use a dictionary. If you're doing further analysis, save the aggregated values in a list for each identifier. Your identifier string is a key, and under each key, you have a list of values.
new_keys = [row[2] + '-' + row[5] + '-' + row[6] + '-' + row[7] for row in data]
new_values = [row[3] for row in data]
aggregate_values = {} # An empty dictionary
# Iterate across the paired lists together
for key, value in zip(new_keys, new_values):
if key not in aggregate_values:
aggregate_values[key] = [value]
else:
aggregate_values[key].append(value)
# printed output
for key in aggregate_values:
print key + " " + " ".join(aggregate_values[key])
Upvotes: 1
Reputation: 146
One solution is to make use of a dictionary while grouping your data:
import csv
out = open ('test.csv','rt', encoding='utf8')
data = csv.reader(out)
output = csv.writer(out)
data = [row for row in data]
new_data = [[row[2]+'-'+row[5]+'-'+row[6] +'-'+ row[7], row[3]] for row in data]
my_dict = {}
for row in new_data:
if row[0] in my_dict:
my_dict[row[0]] += " " + row[1]
else:
my_dict[row[0]] = row[1]
new_data = [[my_key,my_dict[my_key]] for my_key in my_dict]
print (new_data)
out = open('new_data.csv','wt')
output = csv.writer(out)
for row in new_data:
output.writerow(row)
out.close()
Upvotes: 0