Reputation: 153

Count number of characters for each line pyspark

I am able to count total number of each character in the entire document.

My document:

ATATCCCCGGGAT
ATCGATCGATAT

Calculating total number of each characters in the document:

data=sc.textFile("data.txt")
counts=data.flatMap(lambda x:[(c,1) for c in x]).reduceByKey(add)

Result:

[(u'A', 7), (u'C', 6), (u'T', 7), (u'G', 5)]

My Implementation

counts=data.map(lambda x:[(c,1)for c in x])
for row in counts.collect():
    print sc.parallelize(row).reduceByKey(lambda x,y:x+y).collect()

Is there a better way to do it?

Upvotes: 0

Answers (2)

Josemy

Reputation: 838

If what you want is "Count the number of characters for each line with pyspark" and not the total number of each characters for each line, this will do the trick:

data.map(lambda x:len(x)).collect()
>>> [13, 12]

If you want the index of the line among the number of characters:

data.map(lambda x:len(x)).zipWithIndex().collect()
>>> [(13, 0), (12, 1)]

Now, to count the number of each character for each line, this may help:

def count_occur(str):
   uniq = set(str)
   li = list(str)
   dict = {}
   for key in uniq:
       dict[key] = str.count(key)
   return dict

data.map(lambda x: count_occur(x)).collect()
>>> [{'C': 4, 'T': 3, 'A': 3, 'G': 3}, {'C': 2, 'T': 4, 'A': 4, 'G': 2}]

Again, if you want the index of the line zipWithIndex do the trick:

data.map(lambda x: count_occur(x)).zipWithIndex().collect()
>>> [({'C': 4, 'T': 3, 'A': 3, 'G': 3}, 0), ({'C': 2, 'T': 4, 'A': 4, 'G': 2}, 1)]

Hope it helps.

Upvotes: 2

user6022341

Reputation:

Try:

>>> counts.values().sum()
25

>>> sum(counts.collectAsMap().values())
25

Upvotes: 2

Count number of characters for each line pyspark

Answers (2)

Related Questions