Kacper
Kacper

Reputation: 5

String.Split() removes delimiter characters

I'm trying to create a method that will split a protein sequence based on two characters: R and K.
My code splits the protein sequence correctly, but then removes either R or K. I need the program to be able to preserve the delimiters used for splitting the string.

Example:

Lets say I have a protein sequence = GLSDEWQKFEGREGKFWER

My program will then cut the sequence after there is R or K.

It should end up like this:

GLSDEWQK

FEGR

EGK

FWER

My code:

Dim protein As String = "GLSDEWQKFEGREGKFWER"

Dim words As String() = protein.Split(New Char() {"R", "K"})

For Each word As String In words
    Console.WriteLine(word)
Next

I am writing this code in Visual Basic .NET Framework 4.7.2 and I want to display results in terminal console.

Upvotes: 0

Views: 416

Answers (2)

Andrew Morton
Andrew Morton

Reputation: 25013

You can use a RegEx.Split to include the items it was split on, then join the resulting array in pairs:

Dim protein As String = "GLSDEWQKFEGREGKFWER"
Dim splitter = New Regex("([KR])")
Dim wordParts = splitter.Split(protein)
' wordParts is now ("GLSDEWQ", "K", "FEG", "R", "EG", "K", "FWE", "R", "")

' join the wordParts in pairs
Dim words As New List(Of String)
For i = 0 To wordParts.Length - 2 Step 2
    words.Add(wordParts(i) & wordParts(i + 1))
Next

' if there was an odd number of parts, the last one needs to be added
If wordParts.Count Mod 2 = 1 AndAlso Not String.IsNullOrEmpty(wordParts.Last) Then
    words.Add(wordParts.Last)
End If

Console.WriteLine(String.Join(vbCrLf, words))

Outputs:

GLSDEWQK
FEGR
EGK
FWER

The [KR] is a character group - it'll match on any of the characters in that, and the parentheses ( ) surrounding it make it capture what it matched on.

Upvotes: 1

Jimi
Jimi

Reputation: 32223

String.Split() removes the splitter(s) from the resulting array of strings, but you of course want to preserve the full content.

You could loop the chars in the protein string (a string is a collection of chars), test the current char to see if it belongs to the array of chars, {"R"c, "K"c}, that cause the string to split.

If it doesn't, append the current char to a StringBuilder.
If it does, add the accumulated chars to a List(Of String), which will contain the results when the loop terminates.

You should have all the Imports statements already available in your Project. In case you don't add:

Imports System.Linq
Imports System.Text
Dim protein As String = "GLSDEWQKFEGREGKFWER"
Dim splitChars = {"R"c, "K"c}

Dim sb As New StringBuilder()
Dim splitResult As New List(Of String)

For Each c As Char In protein
    sb.Append(c)
    ' If the current char is one of the splitters, add the buffer to the 
    ' results and clear the buffer
    If splitChars.Contains(c) Then
        splitResult.Add(sb.ToString())
        sb.Clear()
    End If
Next
' Take the remainder, if any
If sb.Length > 0 Then splitResult.Add(sb.ToString())

Print the list of parts as:

For Each section As String In splitResult 
    Console.WriteLine(section)
Next

Upvotes: 1

Related Questions