Reputation: 25
Problem: suppose, I have a text file containing data like
TATTGCTTTGTGCTCTCACCTCTGATTTTACTGGGGGCTGTCCCCCACCACCGTCTCGCTCTCTCTGTCA
AAGAGTTAACTTACAGCTCCAATTCATAAAGTTCCTGGGCAATTAGGAGTGTTTAAATCCAAACCCCTCA
GATGGCTCTCTAACTCGCCTGACAAATTTACCCGGACTCCTACAGCTATGCATATGATTGTTTACAGCCT
And I want to find occurrences of character 'A', 'T', 'AAA' , etc. in it.
My Approach
val source = scala.io.Source.fromFile(filePath)
val lines = source.getLines().filter(char => char != '\n')
for (line <- lines) {
val aList = line.filter(ele => ele == 'A')
println(aList)
}
This will give me output like
AAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAA
My Question How can I find total count of occurrences of 'A', 'T', 'AAA' etc. here? can I use map reduce functions for that? How?
Upvotes: 2
Views: 435
Reputation: 12804
In general regular expressions are a very good tool to find sequences of characters in a string.
You can use the r
method, defined with an implicit conversion over strings, to turn a string into a pattern, e.g.
val pattern = "AAA".r
Using it is then fairly easy. Assuming your sample input
val input =
"""TATTGCTTTGTGCTCTCACCTCTGATTTTACTGGGGGCTGTCCCCCACCACCGTCTCGCTCTCTCTGTCA
AAGAGTTAACTTACAGCTCCAATTCATAAAGTTCCTGGGCAATTAGGAGTGTTTAAATCCAAACCCCTCA
GATGGCTCTCTAACTCGCCTGACAAATTTACCCGGACTCCTACAGCTATGCATATGATTGTTTACAGCCT"""
Counting the number of occurrences of a pattern is straightforward and very readable:
pattern.findAllIn(input).size // returns 4
The iterator returned by regular expressions operations can also be used for more complex operations using the matchData
method, e.g. printing the index of each match:
pattern. // this code would print the following lines
findAllIn(input). // 98
matchData. // 125
map(_.start). // 131
foreach(println) // 165
You can read more on Regex
in Scala on the API docs (here for version 2.13.1)
Upvotes: 0
Reputation: 14803
There is even a shorter way:
lines.map(_.count(_ == 'A')).sum
This counts all A
of each line, and sums up the result.
By the way there is no filter
needed here:
val lines = source.getLines()
And as Leo C mentioned in his comment, if you start with Source.fromFile(filePath)
it can be just like this:
source.count(_ == 'A')
As SoleQuantum mentions in his comment he wants call count more than once. The problem here is that source
is a BufferedSource
which is not a Collection, but just an Iterator, which can only be used (iterated) once.
So if you want to use the source
mire than once you have to translate it first to a Collection.
Your example:
val stream = Source.fromResource("yourdata").mkString
stream.count(_ == 'A') // 48
stream.count(_ == 'T') // 65
Remark: String is a Collection of Chars.
For more information check: iterators
And here is the solution to get the count for all Chars:
stream.toSeq
.filterNot(_ == '\n') // filter new lines
.groupBy(identity) // group by each char
.view.mapValues(_.length) // count each group > HashMap(T -> TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT, A -> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA, G -> GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG, C -> CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
.toMap // Map(T -> 65, A -> 48, G -> 36, C -> 61)
Or as suggested by jwvh:
stream
.filterNot(_ == '\n')
.groupMapReduce(identity)(_=>1)(_+_))
This is Scala 2.13, let me know if you have problems with your Scala version.
Ok after the last update of the question:
stream.toSeq
.filterNot(_ == '\n') // filter new lines
.foldLeft(("", Map.empty[String, Int])){case ((a, m), c ) =>
if(a.contains(c))
(a + c, m)
else
(s"$c",
m.updated(a, m.get(a).map(_ + 1).getOrElse(1)))
}._2 // you only want the Map -> HashMap( -> 1, CCCC -> 1, A -> 25, GGG -> 1, AA -> 4, GG -> 3, GGGGG -> 1, AAA -> 5, CCC -> 1, TTTT -> 1, T -> 34, CC -> 9, TTT -> 4, G -> 22, CCCCC -> 1, C -> 31, TT -> 7)
Short explanation:
foldLeft
.Quite complex, let me know if you need more help.
Upvotes: 3
Reputation: 3191
You can get the count by doing the following:
lines.flatten.filter(_ == 'A').size
Upvotes: 0
Reputation: 2686
You can use Partition method and then just use length on it.
val y = x.partition(_ == 'A')._1.length
Upvotes: 0
Reputation: 465
Since scala.io.Source.fromFile(filePath)
produces stream of char
s you can use count(Char => Boolean)
function directly on your source
object.
val source = scala.io.Source.fromFile(filePath)
val result = source.count(_ == 'A')
Upvotes: 1