Somebody
Somebody

Reputation: 733

Split utf16 string with special characters using delimiter

I want to split this utf-16 string in Swift 5

ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа

delimiter : "¾"

I've tried the following codes

let Arr =  "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа".split{$0 == "¾"}.map(String.init)

let Arr = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа".components(separatedBy: "¾")

but both failed

Upvotes: 4

Views: 214

Answers (2)

aheze
aheze

Reputation: 30564

I made an extension! This doesn't have the side effect of changing Ѝ into И.

let delimiter: Character = "¾" /// the delim
let string = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа"
        
let arr = string.components(separatedBySpecialCharacter: delimiter)
print(arr) /// ["ddd", "ͰͿΔδόϡϫЍа"]
extension String {
    func components(separatedBySpecialCharacter delimiter: Character) -> [String] {

        let cleanedString = self.folding(options: .diacriticInsensitive, locale: .current) /// remove all accents and diacritics
        
        let indicesOfDelimiter = cleanedString.indicesOf(string: String(delimiter)) /// get the indices of the full String where the delimiter is
        
        var stringCharacters = Array(self) /// split the full String into an array
        for index in indicesOfDelimiter {
            stringCharacters[index] = delimiter /// replace all occurrences of the accented delimited with a clean delimiter
        }
        
        let delimiterCleanedString = String(stringCharacters) /// make the array of the full String, with cleaned delimiters, back into a String
        let separatedComponents = delimiterCleanedString.components(separatedBy: "¾") /// finally get the components
        
        return separatedComponents
    }
    
    /// get indices of a String inside a String
    /// from https://stackoverflow.com/a/40413665/14351818
    func indicesOf(string: String) -> [Int] {
        var indices = [Int]()
        var searchStartIndex = self.startIndex
        
        while searchStartIndex < self.endIndex,
            let range = self.range(of: string, range: searchStartIndex..<self.endIndex),
            !range.isEmpty
        {
            let index = distance(from: self.startIndex, to: range.lowerBound)
            indices.append(index)
            searchStartIndex = range.upperBound
        }
        
        return indices
    }
}

Old answer:

The "¾̷̱̲͈́͌͠" inside "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа" has a lot of diacritics/zalgo text on it. You can first clean it up like this:

let string = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа"
let cleanedString = string.folding(options: .diacriticInsensitive, locale: .current)
print(cleanedString)

Result:

ddd¾ͰͿΔδοϡϫИа

Now, you can use components(separatedBy: "¾") on the cleaned string.

let arr = cleanedString.components(separatedBy: "¾")
print(arr)

Result:

["ddd", "ͰͿΔδοϡϫИа"]

Note that this also changes Ѝ to И. I will see if there is a better solution.

Upvotes: 1

Rob Napier
Rob Napier

Reputation: 299663

The Element of String is Character. A Character is an extended grapheme cluster, which means it composes all combining characters. The Character in this String is ¾̷̱̲͈́͌͠, so when you try to split on ¾, it's not found.

I believe what you're trying to operate on is UnicodeScalars, which are individual code points. To do that, you need to first call .unicodeScalars:

let arr = "ddd¾̷̱̲͈́͌͠ͰͿΔδόϡϫЍа".unicodeScalars.split(separator: "¾").map(String.init)
// ["ddd", "̷̱̲͈́͌͠ͰͿΔδόϡϫЍа"]

Note that the string you've posted here is UTF-8, not UTF-16. Swift can't operate directly on UTF-16 literals (you typically store them as Data or [UInt16] and then convert them to String). I don't believe this changes your question, however.

Upvotes: 4

Related Questions