Reputation: 3383
I have this type of string:
sheet = """
magenta
turquoise,PF00575
tan,PF00154,PF06745,PF08423,PF13481,PF14520
turquoise, PF00011
NULL
"""
Every line starts with an identifier (e.g. tan, magenta...) What I want is to count the number of occurrences of each PF-number per identifier.
So, the final structure would be something like this:
magenta turquoise tan NULL
PF00575 0 0 0 0
PF00154 0 1 0 0
PF06745 0 0 1 0
PF08423 0 0 1 0
PF13481 0 0 1 0
PF14520 0 0 1 0
PF00011 0 1 0 0
I started with making a a dictionary where every first word on a line is a key and then I want as values the PF-numbers behind it.
When I use this code, I get the values as a list of strings instead of as separate values in the dictionary:
lines = []
lines.append(sheet.split("\n"))
flattened=[]
flattened = [val for sublist in lines for val in sublist]
pfams = []
for i in flattened:
pfams.append(i.split(","))
d = defaultdict(list)
for i in pfams:
pfam = i[0]
d[pfam].append(i[1:])
So, the result is this:
defaultdict(<type 'list'>, {'': [[], []], 'magenta': [[]], 'NULL': [[]], 'turquoise': [['PF00575']], 'tan': [['PF00154', 'PF06745', 'PF08423', 'PF13481', 'PF14520']]})
How can I split up the PFnumbers so that they are separate values in the dictionary and then count the number of occurrences of each unique PF-number per key?
Upvotes: 0
Views: 100
Reputation: 3383
With thanks to dwblas on devshed, this is the most efficient way I've found to tackle the task:
I build a dictionary whose key is the PFnumber, and a list ordered by how I want the colors printed.
colors_list= ['cyan','darkorange','greenyellow','yellow','magenta','blue','green','midnightblue','brown','darkred','lightcyan','lightgreen','darkgreen','royalblue','orange','purple','tan','grey60','darkturquoise','red','lightyellow','darkgrey','turquoise','salmon','black','pink','grey','null']
lines = sheet.splitlines()
counts = {}
for line in lines:
parts = line.split(",")
if len(parts) > 1:
## doesn't break out the same item in the list many times
color=parts[0].strip().lower()
for key in parts[1:]: ## skip color
key=key.strip()
if key not in counts:
## new key and list of zeroes-print it if you want to verify
counts[key]=[0 for ctr in range(len(colors_list))]
## offset number/location of this color in list
el_number=colors_list.index(color)
if color > -1: ## color found
counts[key][el_number] += 1
else:
print "some error message"
import csv
with open("out.csv", "wb") as f:
writer=csv.writer(f)
writer.writerow( ["PFAM",] + colors_list)
for pfam in counts:
writer.writerow([pfam] + counts[pfam])
Upvotes: 0
Reputation: 4449
Use collections.Counter
(https://docs.python.org/2/library/collections.html#collections.Counter)
import collections
sheet = """
magenta
turquoise,PF00575
tan,PF00154,PF06745,PF08423,PF13481,PF14520
NULL
"""
acc = {}
for line in sheet.split('\n'):
if line == "NULL":
continue
parts = line.split(',')
acc[parts[0]] = collections.Counter(parts[1])
EDIT: Now with accumulating all PF values for each key
acc = collections.defaultdict(list)
for line in sheet.split('\n'):
if line == "NULL":
continue
parts = line.split(',')
acc[parts[0]] += parts[1:]
acc = {k: collections.Counter(v) for k,v in acc.iteritems()}
Final edit Count the occurrence of colours per PF value, which is what we were after all along, in the end:
acc = collections.defaultdict(list)
for line in sheet.split('\n'):
if line == "NULL":
continue
parts = line.split(',')
for pfval in parts[1:]
acc[ pfval ] += [ parts[0] ]
acc = {k: collections.Counter(v) for k,v in acc.iteritems()}
Upvotes: 1