SoleQuantum
SoleQuantum

Reputation: 25

Scala: find character occurrences from a file

Problem: suppose, I have a text file containing data like

TATTGCTTTGTGCTCTCACCTCTGATTTTACTGGGGGCTGTCCCCCACCACCGTCTCGCTCTCTCTGTCA
AAGAGTTAACTTACAGCTCCAATTCATAAAGTTCCTGGGCAATTAGGAGTGTTTAAATCCAAACCCCTCA
GATGGCTCTCTAACTCGCCTGACAAATTTACCCGGACTCCTACAGCTATGCATATGATTGTTTACAGCCT

And I want to find occurrences of character 'A', 'T', 'AAA' , etc. in it.

My Approach

  val source = scala.io.Source.fromFile(filePath)
  val lines = source.getLines().filter(char => char != '\n')

  for (line <- lines) {
    val aList = line.filter(ele => ele == 'A')
    println(aList)

  }

This will give me output like

AAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAA

My Question How can I find total count of occurrences of 'A', 'T', 'AAA' etc. here? can I use map reduce functions for that? How?

Upvotes: 2

Views: 435

Answers (5)

stefanobaghino
stefanobaghino

Reputation: 12804

In general regular expressions are a very good tool to find sequences of characters in a string.

You can use the r method, defined with an implicit conversion over strings, to turn a string into a pattern, e.g.

val pattern = "AAA".r

Using it is then fairly easy. Assuming your sample input

val input =
  """TATTGCTTTGTGCTCTCACCTCTGATTTTACTGGGGGCTGTCCCCCACCACCGTCTCGCTCTCTCTGTCA
  AAGAGTTAACTTACAGCTCCAATTCATAAAGTTCCTGGGCAATTAGGAGTGTTTAAATCCAAACCCCTCA
  GATGGCTCTCTAACTCGCCTGACAAATTTACCCGGACTCCTACAGCTATGCATATGATTGTTTACAGCCT"""

Counting the number of occurrences of a pattern is straightforward and very readable:

pattern.findAllIn(input).size // returns 4

The iterator returned by regular expressions operations can also be used for more complex operations using the matchData method, e.g. printing the index of each match:

pattern.                      // this code would print the following lines
  findAllIn(input).           // 98
  matchData.                  // 125
  map(_.start).               // 131
  foreach(println)            // 165

You can read more on Regex in Scala on the API docs (here for version 2.13.1)

Upvotes: 0

pme
pme

Reputation: 14803

There is even a shorter way:

lines.map(_.count(_ == 'A')).sum

This counts all A of each line, and sums up the result.

By the way there is no filter needed here:

val lines = source.getLines()

And as Leo C mentioned in his comment, if you start with Source.fromFile(filePath) it can be just like this:

 source.count(_ == 'A')

As SoleQuantum mentions in his comment he wants call count more than once. The problem here is that source is a BufferedSource which is not a Collection, but just an Iterator, which can only be used (iterated) once.

So if you want to use the source mire than once you have to translate it first to a Collection.

Your example:

  val stream = Source.fromResource("yourdata").mkString
  stream.count(_ == 'A') // 48
  stream.count(_ == 'T') // 65

Remark: String is a Collection of Chars.

For more information check: iterators

And here is the solution to get the count for all Chars:

stream.toSeq
    .filterNot(_ == '\n')       // filter new lines
    .groupBy(identity)          // group by each char
    .view.mapValues(_.length)   // count each group > HashMap(T -> TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT, A -> AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA, G -> GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG, C -> CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
    .toMap                      // Map(T -> 65, A -> 48, G -> 36, C -> 61)

Or as suggested by jwvh:

stream
    .filterNot(_ == '\n') 
    .groupMapReduce(identity)(_=>1)(_+_))

This is Scala 2.13, let me know if you have problems with your Scala version.

Ok after the last update of the question:

stream.toSeq
    .filterNot(_ == '\n')       // filter new lines
    .foldLeft(("", Map.empty[String, Int])){case ((a, m), c ) =>
        if(a.contains(c))
          (a + c, m)
        else
          (s"$c", 
           m.updated(a, m.get(a).map(_ + 1).getOrElse(1)))
      }._2 // you only want the Map -> HashMap( -> 1, CCCC -> 1, A -> 25, GGG -> 1, AA -> 4, GG -> 3, GGGGG -> 1, AAA -> 5, CCC -> 1, TTTT -> 1, T -> 34, CC -> 9, TTT -> 4, G -> 22, CCCCC -> 1, C -> 31, TT -> 7)

Short explanation:

  • The solution uses a foldLeft.
  • The initial value is a pair:
    • a String that holds the actual characters (none to start)
    • a Map with the Strings and their count (empty at the start)
  • We have 2 main cases:
    • the character is the same we have a already a String. Just add the character to the actual String.
    • the character is different. Update the Map with the actual String; the new character is the now the actual String.

Quite complex, let me know if you need more help.

Upvotes: 3

senjin.hajrulahovic
senjin.hajrulahovic

Reputation: 3191

You can get the count by doing the following:

lines.flatten.filter(_ == 'A').size

Upvotes: 0

Raman Mishra
Raman Mishra

Reputation: 2686

You can use Partition method and then just use length on it.

val y = x.partition(_ == 'A')._1.length

Upvotes: 0

jker
jker

Reputation: 465

Since scala.io.Source.fromFile(filePath) produces stream of chars you can use count(Char => Boolean) function directly on your source object.

val source = scala.io.Source.fromFile(filePath)
val result = source.count(_ == 'A')

Upvotes: 1

Related Questions