Reputation: 43

Iterate and discard sequential duplicates

I'm new to python(any kind of coding really). So i'm sorry if if gets a bit confusing

I have a csv file like the following

A    B     C              D        E            F         G      H
14  BP1 BP1-19119308    OR1A1   19119308    chip-chip   Hs578T  human   11/23/09 
15  BP1 BP1-19119308    PTPRE   19119308    chip-chip   Hs578T  human   11/23/09 
16  BP1 BP1-19119308    SELE    19119308    chip-chip   Hs578T  human   11/23/09 
17  BP1 BP1-19119308    TAC3    19119308    chip-chip   Hs578T  human   11/23/09 
18  BP1 BP1-19119308    VEGFA   19119308    chip-chip   Hs578T  human   11/23/09 
19  CHD7 CHD7-19251738  APOA1   19251738    chip-chip   MESC    mouse   11/23/09 
20  CHD7 CHD7-19251738  ARHGAP26 19251738   chip-chip   MESC    mouse  11/23/09

And I need to make it look like this

BP1-19119308-chip-chip-Hs578T-human OR1A1 PTPRE SELE TAC3 VEGFA 
CHD7-19251738-chip-chip-MESC-mouse  APOA1 ARHGAP26

I did manage to the C-F-G-H in the first column with this

import csv

out = open ('test.csv','rt', encoding='utf8') 
data =  csv.reader(out)
output = csv.writer(out) 

data = [row for row in data]
new_data = [[row[2]+'-'+row[5]+'-'+row[6] +'-'+ row[7], row[3]] for row in data] 

print (new_data)

out = open('new_data.csv','wt') 
output = csv.writer(out)  

for row in new_data:
   output.writerow(row)    

out.close()





A                                  B
BP1-19119308-chip-chip-Hs578T-human OR1A1
BP1-19119308-chip-chip-Hs578T-human PTPRE
BP1-19119308-chip-chip-Hs578T-human SELE
BP1-19119308-chip-chip-Hs578T-human TAC3
BP1-19119308-chip-chip-Hs578T-human VEGFA
CHD7-19251738-chip-chip-MESC-mouse  APOA1
CHD7-19251738-chip-chip-MESC-mouse  ARHGAP26
CHD7-19251738-chip-chip-MESC-mouse  ATP11A

But now I have these duplicates in A and I have no idea how to delete them and transpose all the values in B that were assigned to these duplicates.

I tried looping again to compare the current value to the previous value and I just messed the whole thing up.

Any suggestions?

Upvotes: 2

Answers (4)

Cristian Ciupitu

Reputation: 20890

Use itertools.groupby and operator.itemgetter. Add this to your code after initializing new_data and output:

for k, g in itertools.groupby(new_data, operator.itemgetter(0)):
    row = [k]
    row.extend(map(g, operator.itemgetter(1)))
    output.writerow(row)

The complete improved (refactored) code could look like this:

import csv
import itertools
import operator

with open('test.csv','rt', encoding='utf8') as f_in:
    inp = csv.reader(f_in)
    new_data = (('-'.join(operator.itemgetter(2, 5, 6, 7)), row[3])
                for row in inp)
    with open('new_data.csv','wt') as f_out:
        output = csv.writer(f_out)
        for k, g in itertools.groupby(new_data, operator.itemgetter(0)):
                row = [k]
                row.extend(map(g, operator.itemgetter(1)))
                output.writerow(row)

Upvotes: 0

flamenco

Reputation: 2840

Star from the point where you have as here:
test.txt

A                                   B
BP1-19119308-chip-chip-Hs578T-human OR1A1
BP1-19119308-chip-chip-Hs578T-human PTPRE
BP1-19119308-chip-chip-Hs578T-human SELE
BP1-19119308-chip-chip-Hs578T-human TAC3
BP1-19119308-chip-chip-Hs578T-human VEGFA
CHD7-19251738-chip-chip-MESC-mouse  APOA1
CHD7-19251738-chip-chip-MESC-mouse  ARHGAP26
CHD7-19251738-chip-chip-MESC-mouse  ATP11A

Now, you can use the following code to bring to the shape you need:

with open("test.txt") as f:
    data = f.readlines()[1:]
mydata = [x.strip() for x in data]

final = {}

for x in mydata:
    k, v = x.split()
    if final.has_key(k):
        l = final[k]
        l.append(v)
    else:
        final[k] = [v]

for d in final:
    print d, " ".join(final[d])

Output:

CHD7-19251738-chip-chip-MESC-mouse APOA1 ARHGAP26 ATP11A
BP1-19119308-chip-chip-Hs578T-human OR1A1 PTPRE SELE TAC3 VEGFA

From here you can write it into a file if you need to.

Upvotes: 0

Bennett Brown

Reputation: 5373

You want to use a dictionary. If you're doing further analysis, save the aggregated values in a list for each identifier. Your identifier string is a key, and under each key, you have a list of values.

new_keys = [row[2] + '-' + row[5] + '-' + row[6] + '-' + row[7] for row in data] 
new_values = [row[3] for row in data]

aggregate_values = {} # An empty dictionary
# Iterate across the paired lists together
for key, value in zip(new_keys, new_values): 
    if key not in aggregate_values:
        aggregate_values[key] = [value]
    else: 
        aggregate_values[key].append(value)

# printed output
for key in aggregate_values:
    print key + " " + " ".join(aggregate_values[key])

Upvotes: 1

Lau-Rent

Reputation: 146

One solution is to make use of a dictionary while grouping your data:

import csv

out = open ('test.csv','rt', encoding='utf8') 
data =  csv.reader(out)
output = csv.writer(out) 

data = [row for row in data]
new_data = [[row[2]+'-'+row[5]+'-'+row[6] +'-'+ row[7], row[3]] for row in data] 

my_dict = {}
for row in new_data:
   if row[0] in my_dict:
      my_dict[row[0]] += " " + row[1]
   else:
      my_dict[row[0]] = row[1]

new_data = [[my_key,my_dict[my_key]] for my_key in my_dict]

print (new_data)

out = open('new_data.csv','wt') 
output = csv.writer(out)  

for row in new_data:
   output.writerow(row)    

out.close()

Upvotes: 0

Iterate and discard sequential duplicates

Answers (4)

Related Questions