Joe
Joe

Reputation: 817

Function to return List of Map while iterating over String, kmer count

I am working on creating a k-mer frequency counter (similar to word count in Hadoop) written in Scala. I'm fairly new to Scala, but I have some programming experience.

The input is a text file containing a gene sequence and my task is to get the frequency of each k-mer where k is some specified length of the sequence.

Therefore, the sequence AGCTTTC has three 5-mers (AGCTT, GCTTT, CTTTC)

I've parsed through the input and created a huge string which is the entire sequence, the new lines throw off the k-mer counting as the end of one line's sequence should still form a k-mer with the beginning of the next line's sequence.

Now I am trying to write a function that will generate a list of maps List[Map[String, Int]] with which it should be easy to use scala's groupBy function to get the count of the common k-mers

import scala.io.Source

object Main {
  def main(args: Array[String]) {

    // Get all of the lines from the input file
    val input = Source.fromFile("input.txt").getLines.toArray

    // Create one huge string which contains all the lines but the first
    val lines = input.tail.mkString.replace("\n","")

    val mappedKmers: List[Map[String,Int]] = getMappedKmers(5, lines)

  }

  def getMappedKmers(k: Int, seq: String): List[Map[String, Int]] = {
    for (i <- 0 until seq.length - k) {
      Map(seq.substring(i, i+k), 1) // Map the k-mer to a count of 1
    }
  }
}

Couple of questions:

Any help and/or advice is definitely appreciated!

Upvotes: 4

Views: 436

Answers (1)

Travis Brown
Travis Brown

Reputation: 139028

You're pretty close—there are three fairly minor problems with your code.

The first is that for (i <- whatever) foo(i) is syntactic sugar for whatever.foreach(i => foo(i)), which means you're not actually doing anything with the contents of whatever. What you want is for (i <- whatever) yield foo(i), which is sugar for whatever.map(i => foo(i)) and returns the transformed collection.

The second issue is that 0 until seq.length - k is a Range, not a List, so even once you've added the yield, the result still won't line up with the declared return type.

The third issue is that Map(k, v) tries to create a map with two key-value pairs, k and v. You want Map(k -> v) or Map((k, v)), either of which is explicit about the fact that you have a single argument pair.

So the following should work:

def getMappedKmers(k: Int, seq: String): IndexedSeq[Map[String, Int]] = {
  for (i <- 0 until seq.length - k) yield {
    Map(seq.substring(i, i + k) -> 1) // Map the k-mer to a count of 1
  }
}

You could also convert either the range or the entire result to a list with .toList if you'd prefer a list at the end.

It's worth noting, by the way, that the sliding method on Seq does exactly what you want:

scala> "AGCTTTC".sliding(5).foreach(println)
AGCTT
GCTTT
CTTTC

I'd definitely suggest something like "AGCTTTC".sliding(5).toList.groupBy(identity) for real code.

Upvotes: 4

Related Questions