Jamie Dixon
Jamie Dixon

Reputation: 4292

Splitting a Seq of Strings Of Variable Length in F#

I am using a .fasta file in F#. When I read it from disk, it is a sequence of strings. Each observation is usually 4-5 strings in length: 1st string is the title, then 2-4 strings of amino acids, and then 1 string of space. For example:

let filePath = @"/Users/XXX/sample_database.fasta"
let fileContents = File.ReadLines(filePath)
fileContents |> Seq.iter(fun x -> printfn  "%s" x)

yields:

enter image description here

I am looking for a way to split each observation into its own collection using the OOB high order functions in F#. I do not want to use any mutable variables or for..each syntax. I thought Seq.chunkBySize would work -> but the size varies. Is there a Seq.chunkByCharacter?

Upvotes: 0

Views: 277

Answers (4)

kaefer
kaefer

Reputation: 5741

To illustrate Fyodor's point about contained mutability, here's an example that is mutable as can be while still somewhat reasonable. The outer functional layer is a sequence expression, a common pattern demonstrated by Seq.scan in the F# source.

let chooseFoldSplit
        folding (state : 'State)
            (source : seq<'T>) : seq<'U[]> = seq {
    let sref, zs = ref state, ResizeArray()
    use ie = source.GetEnumerator()
    while ie.MoveNext() do
        let newState, uopt = folding !sref ie.Current
        if newState <> !sref then
            yield zs.ToArray()
            zs.Clear()
        sref := newState
        match uopt with
        | None -> ()
        | Some u -> zs.Add u
    if zs.Count > 0 then
        yield zs.ToArray() }
// val chooseFoldSplit :
//   folding:('State -> 'T -> 'State * 'U option) ->
//     state:'State -> source:seq<'T> -> seq<'U []> when 'State : equality

There is mutability of a ref cell (equivalent to a mutable variable) and there is a mutable data structure; an alias for System.Collection.Generic.List<'T>, which allows appending at O(1) cost.

The folding function's signature 'State -> 'T -> 'State * 'U option is reminiscent of the folder of fold, except that it causes the result sequence to be split when its state changes. And it also spawns an option that denotes the next member for the current group (or not).

It would work fine without the conversion to a persistent array, as long as you iterate the resulting sequence lazily and only exactly once. Therefore we need to isolate the contents of the ResizeArrayfrom the outside world.

The simplest folding for your use case is negation of a boolean, but you could leverage it for more complex tasks like numbering your records:

[| "foo"; "1"; "2"; ""; "bar"; "4"; "5"; "6"; "7"; ""; "baz"; "8"; "" |]
|> chooseFoldSplit (fun b t ->
    if t = "" then not b, None else b, Some t ) false
|> Seq.map (fun a ->
    if a.Length > 1 then
        { Description = a.[0]; Sequence = String.concat "" a.[1..] }
    else failwith "Format error" )
// val it : seq<FastaEntry> =
//   seq [{Description = "foo";
//         Sequence = "12";}; {Description = "bar";
//                             Sequence = "4567";}; {Description = "baz";
//                                                   Sequence = "8";}]

Upvotes: 1

David Raab
David Raab

Reputation: 4488

You also can use Regular Expression for these kind of stuff. Here is a solution that uses a regular expression to extract a whole Fasta Block at once.

type FastaEntry = {
    Description: string
    Sequence:    string
}

let fastaRegexStr = 
    @"
    ^>                  # Line Starting with >
      (.*)                 # Capture into $1
    \r?\n               # End-of-Line
    (                   # Capturing in $2
        (?:                   
            ^           # A Line ...
              [A-Z]+       # .. containing A-Z
            \*? \r?\n   # Optional(*) followed by End-of-Line
        )+              # ^ Multiple of those lines
    )
    (?:
        (?: ^ [ \t\v\f]* \r?\n )  # Match an empty (whitespace) line ..
        |                         # or
        \z                        # End-of-String
    )
    "

(* Regex for matching one Fasta Block *)
let fasta = Regex(fastaRegexStr, RegexOptions.IgnorePatternWhitespace ||| RegexOptions.Multiline)

(* Whole file as a string *)
let content = System.IO.File.ReadAllText "fasta.fasta"

let entries = [
    for m in fasta.Matches(content) do
        let desc = m.Groups.[1].Value
        (* Remove *, \r and \n from string *)
        let sequ = Regex.Replace(m.Groups.[2].Value, @"\*|\r|\n", "")
        {Description=desc; Sequence=sequ}
]

Upvotes: 0

Jamie Dixon
Jamie Dixon

Reputation: 4292

I went with recursion:

type FastaEntry = {Description:String; Sequence:String}

let generateFastaEntry (chunk:String seq) =
    match chunk |> Seq.length with
    | 0 -> None
    | _ ->
        let description = chunk |> Seq.head
        let sequence = chunk |> Seq.tail |> Seq.reduce (fun acc x -> acc + x)
        Some {Description=description; Sequence=sequence}

let rec chunk acc contents =
    let index = contents |> Seq.tryFindIndex(fun x -> String.IsNullOrEmpty(x))
    match index with
    | None -> 
        let fastaEntry = generateFastaEntry contents
        match fastaEntry with
        | Some x -> Seq.append acc [x]
        | None -> acc
    | Some x ->
        let currentChunk = contents |> Seq.take x
        let fastaEntry = generateFastaEntry currentChunk
        match fastaEntry with
        | None -> acc
        | Some y ->
            let updatedAcc = 
                match Seq.isEmpty acc with
                | true -> seq {y}
                | false -> Seq.append acc (seq {y})
            let remaining = contents |> Seq.skip (x+1)
            chunk updatedAcc remaining

Upvotes: 0

Fyodor Soikin
Fyodor Soikin

Reputation: 80744

Mutable variables are totally fine for this, provided their mutability doesn't leak into a wider context. Why exactly do you not want to use them?

But if you really want to go hardcore "functional", then the usual functional way of doing something like that is via fold.

  • Your folding state would be a pair of "blocks accumulated so far" and "current block".
  • At each step, if you get a non-empty string, you attach it to the "current block".
  • And if you get an empty string, that means the current block is over, so you attach the current block to the list of "blocks so far" and make the current block empty.
  • This way, at the end of folding you'll end up with a pair of "all blocks accumulated except the last one" and "last block", which you can glue together.
  • Plus, an optimization detail: since I'm going to do a lot of "attach a thing to a list", I'd like to use a linked list for that, because it has constant-time attaching. But then the problem is that it's only constant time for prepending, not appending, which means I'll end up with all the lists reversed. But no matter: I'll just reverse them again at the very end. List reversal is a linear operation, which means my whole thing would still be linear.
let splitEm lines =
  let step (blocks, currentBlock) s =
        match s with
        | "" -> (List.rev currentBlock :: blocks), []
        | _ -> blocks, s :: currentBlock

  let (blocks, lastBlock) = Array.fold step ([], []) lines

  List.rev (lastBlock :: blocks)

Usage:

> splitEm [| "foo"; "bar"; "baz"; ""; "1"; "2"; ""; "4"; "5"; "6"; "7"; ""; "8" |]

[["foo"; "bar"; "baz"]; ["1"; "2"]; ["4"; "5"; "6"; "7"]; ["8"]]

Note 1: You may have to address some edge cases depending on your data and what you want the behavior to be. For example, if there is an empty line at the very end, you'll end up with an empty block at the end.

Note 2: You may notice that this is very similar to imperative algorithm with mutating variables: I'm even talking about things like "attach to list of blocks" and "make current block empty". This is not a coincidence. In this purely functional version the "mutating" is accomplished by calling the same function again with different parameters, while in an equivalent imperative version you would just have those parameters turned into mutable memory cells. Same thing, different view. In general, any imperative iteration can be turned into a fold this way.

For comparison, here's a mechanical translation of the above to imperative mutation-based style:

let splitEm lines =
  let mutable blocks = []
  let mutable currentBlock = []
  
  for s in lines do
    match s with
    | "" -> blocks <- List.rev currentBlock :: blocks; currentBlock <- []
    | _ -> currentBlock <- s :: currentBlock

  List.rev (currentBlock :: blocks)

Upvotes: 2

Related Questions