Reputation: 153
I am able to count total number of each character in the entire document.
My document:
ATATCCCCGGGAT
ATCGATCGATAT
Calculating total number of each characters in the document:
data=sc.textFile("data.txt")
counts=data.flatMap(lambda x:[(c,1) for c in x]).reduceByKey(add)
Result:
[(u'A', 7), (u'C', 6), (u'T', 7), (u'G', 5)]
My Implementation
counts=data.map(lambda x:[(c,1)for c in x])
for row in counts.collect():
print sc.parallelize(row).reduceByKey(lambda x,y:x+y).collect()
Is there a better way to do it?
Upvotes: 0
Views: 8977
Reputation: 838
If what you want is "Count the number of characters for each line with pyspark" and not the total number of each characters for each line, this will do the trick:
data.map(lambda x:len(x)).collect()
>>> [13, 12]
If you want the index of the line among the number of characters:
data.map(lambda x:len(x)).zipWithIndex().collect()
>>> [(13, 0), (12, 1)]
Now, to count the number of each character for each line, this may help:
def count_occur(str):
uniq = set(str)
li = list(str)
dict = {}
for key in uniq:
dict[key] = str.count(key)
return dict
data.map(lambda x: count_occur(x)).collect()
>>> [{'C': 4, 'T': 3, 'A': 3, 'G': 3}, {'C': 2, 'T': 4, 'A': 4, 'G': 2}]
Again, if you want the index of the line zipWithIndex
do the trick:
data.map(lambda x: count_occur(x)).zipWithIndex().collect()
>>> [({'C': 4, 'T': 3, 'A': 3, 'G': 3}, 0), ({'C': 2, 'T': 4, 'A': 4, 'G': 2}, 1)]
Hope it helps.
Upvotes: 2
Reputation:
Try:
>>> counts.values().sum()
25
or
>>> sum(counts.collectAsMap().values())
25
Upvotes: 2