jxwho
jxwho

Reputation: 385

How can I create a String from UTF8 in Swift?

We know we can print each character in UTF8 code units? Then, if we have code units of these characters, how can we create a String with them?

Upvotes: 25

Views: 38606

Answers (10)

Imanou Petit
Imanou Petit

Reputation: 92599

With Swift 5, you can choose one of the following ways in order to convert a collection of UTF-8 code units into a string.


#1. Using String's init(_:) initializer

If you have a String.UTF8View instance (i.e. a collection of UTF-8 code units) and want to convert it to a string, you can use init(_:) initializer. init(_:) has the following declaration:

init(_ utf8: String.UTF8View)

Creates a string corresponding to the given sequence of UTF-8 code units.

The Playground sample code below shows how to use init(_:):

let string = "Café 🇫🇷"
let utf8View: String.UTF8View = string.utf8

let newString = String(utf8View)
print(newString) // prints: Café 🇫🇷

#2. Using Swift's init(decoding:as:) initializer

init(decoding:as:) creates a string from the given Unicode code units collection in the specified encoding:

let string = "Café 🇫🇷"
let codeUnits: [Unicode.UTF8.CodeUnit] = Array(string.utf8)

let newString = String(decoding: codeUnits, as: UTF8.self)
print(newString) // prints: Café 🇫🇷

Note that init(decoding:as:) also works with String.UTF8View parameter:

let string = "Café 🇫🇷"
let utf8View: String.UTF8View = string.utf8

let newString = String(decoding: utf8View, as: UTF8.self)
print(newString) // prints: Café 🇫🇷

#3. Using transcode(_:from:to:stoppingOnError:into:) function

The following example transcodes the UTF-8 representation of an initial string into Unicode scalar values (UTF-32 code units) that can be used to build a new string:

let string = "Café 🇫🇷"
let bytes = Array(string.utf8)

var newString = ""
_ = transcode(bytes.makeIterator(), from: UTF8.self, to: UTF32.self, stoppingOnError: true, into: {
    newString.append(String(Unicode.Scalar($0)!))
})
print(newString) // prints: Café 🇫🇷

#4. Using Array's withUnsafeBufferPointer(_:) method and String's init(cString:) initializer

init(cString:) has the following declaration:

init(cString: UnsafePointer<CChar>)

Creates a new string by copying the null-terminated UTF-8 data referenced by the given pointer.

The following example shows how to use init(cString:) with a pointer to the content of a CChar array (i.e. a well-formed UTF-8 code unit sequence) in order to create a string from it:

let bytes: [CChar] = [67, 97, 102, -61, -87, 32, -16, -97, -121, -85, -16, -97, -121, -73, 0]

let newString = bytes.withUnsafeBufferPointer({ (bufferPointer: UnsafeBufferPointer<CChar>)in
    return String(cString: bufferPointer.baseAddress!)
})
print(newString) // prints: Café 🇫🇷

#5. Using Unicode.UTF8's decode(_:) method

To decode a code unit sequence, call decode(_:) repeatedly until it returns UnicodeDecodingResult.emptyInput:

let string = "Café 🇫🇷"
let codeUnits = Array(string.utf8)

var codeUnitIterator = codeUnits.makeIterator()
var utf8Decoder = Unicode.UTF8()
var newString = ""

Decode: while true {
    switch utf8Decoder.decode(&codeUnitIterator) {
    case .scalarValue(let value):
        newString.append(Character(Unicode.Scalar(value)))
    case .emptyInput:
        break Decode
    case .error:
        print("Decoding error")
        break Decode
    }
}

print(newString) // prints: Café 🇫🇷

#6. Using String's init(bytes:encoding:) initializer

Foundation gives String a init(bytes:encoding:) initializer that you can use as indicated in the Playground sample code below:

import Foundation

let string = "Café 🇫🇷"
let bytes: [Unicode.UTF8.CodeUnit] = Array(string.utf8)

let newString = String(bytes: bytes, encoding: String.Encoding.utf8)
print(String(describing: newString)) // prints: Optional("Café 🇫🇷")

Upvotes: 20

Qinghua
Qinghua

Reputation: 371

// Swift4
var units = [UTF8.CodeUnit]()
//
// update units
//
let str = String(decoding: units, as: UTF8.self)

Upvotes: 2

johnkzin
johnkzin

Reputation: 31

If you're starting with a raw buffer, such as from the Data object returned from a file handle (in this case, taken from a Pipe object):

let data = pipe.fileHandleForReading.readDataToEndOfFile()
var unsafePointer = UnsafeMutablePointer<UInt8>.allocate(capacity: data.count)

data.copyBytes(to: unsafePointer, count: data.count)

let output = String(cString: unsafePointer)

Upvotes: 1

Alex Shubin
Alex Shubin

Reputation: 3617

Swift 3

let s = String(bytes: arr, encoding: .utf8)

Upvotes: 4

Alex Shoshiashvili
Alex Shoshiashvili

Reputation: 489

There is Swift 3.0 version of Martin R answer

public class UTF8Encoding {
  public static func encode(bytes: Array<UInt8>) -> String {
    var encodedString = ""
    var decoder = UTF8()
    var generator = bytes.makeIterator()
    var finished: Bool = false
    repeat {
      let decodingResult = decoder.decode(&generator)
      switch decodingResult {
      case .scalarValue(let char):
        encodedString += "\(char)"
      case .emptyInput:
        finished = true
      case .error:
        finished = true
      }
    } while (!finished)
    return encodedString
  }
  public static func decode(str: String) -> Array<UInt8> {
    var decodedBytes = Array<UInt8>()
    for b in str.utf8 {
      decodedBytes.append(b)
    }
    return decodedBytes
  }
}

If you want show emoji from UTF-8 string, just user convertEmojiCodesToString method below. It is working properly for strings like "U+1F52B" (emoji) or "U+1F1E6 U+1F1F1" (country flag emoji)

class EmojiConverter {
  static func convertEmojiCodesToString(_ emojiCodesString: String) -> String {
    let emojies = emojiCodesString.components(separatedBy: " ")
    var resultString = ""
    for emoji in emojies {
      var formattedCode = emoji
      formattedCode.slice(from: 2, to: emoji.length)
      formattedCode = formattedCode.lowercased()
      if let charCode = UInt32(formattedCode, radix: 16),
        let unicode = UnicodeScalar(charCode) {
        let str = String(unicode)
        resultString += "\(str)"
      }
    }
    return resultString
  }
}

Upvotes: 0

dbart
dbart

Reputation: 5566

I've been looking for a comprehensive answer regarding string manipulation in Swift myself. Relying on cast to and from NSString and other unsafe pointer magic just wasn't doing it for me. Here's a safe alternative:

First, we'll want to extend UInt8. This is the primitive type behind CodeUnit.

extension UInt8 {
    var character: Character {
        return Character(UnicodeScalar(self))
    }
}

This will allow us to do something like this:

let codeUnits: [UInt8] = [
    72, 69, 76, 76, 79
]

let characters = codeUnits.map { $0.character }
let string     = String(characters)

// string prints "HELLO"

Equipped with this extension, we can now being modifying strings.

let string = "ABCDEFGHIJKLMONP"

var modifiedCharacters = [Character]()
for (index, utf8unit) in string.utf8.enumerate() {

    // Insert a "-" every 4 characters
    if index > 0 && index % 4 == 0 {
        let separator: UInt8 = 45 // "-" in ASCII
        modifiedCharacters.append(separator.character)
    }
    modifiedCharacters.append(utf8unit.character)
}

let modifiedString = String(modifiedCharacters)

// modified string == "ABCD-EFGH-IJKL-MONP"

Upvotes: 2

Martin R
Martin R

Reputation: 540115

This is a possible solution (now updated for Swift 2):

let utf8 : [CChar] = [65, 66, 67, 0]
if let str = utf8.withUnsafeBufferPointer( { String.fromCString($0.baseAddress) }) {
    print(str) // Output: ABC
} else {
    print("Not a valid UTF-8 string") 
}

Within the closure, $0 is a UnsafeBufferPointer<CChar> pointing to the array's contiguous storage. From that a Swift String can be created.

Alternatively, if you prefer the input as unsigned bytes:

let utf8 : [UInt8] = [0xE2, 0x82, 0xAC, 0]
if let str = utf8.withUnsafeBufferPointer( { String.fromCString(UnsafePointer($0.baseAddress)) }) {
    print(str) // Output: €
} else {
    print("Not a valid UTF-8 string")
}

Upvotes: 1

T2345
T2345

Reputation: 199

It's possible to convert UTF8 code points to a Swift String idiomatically using the UTF8 Swift class. Although it's much easier to convert from String to UTF8!

import Foundation

public class UTF8Encoding {
  public static func encode(bytes: Array<UInt8>) -> String {
    var encodedString = ""
    var decoder = UTF8()
    var generator = bytes.generate()
    var finished: Bool = false
    do {
      let decodingResult = decoder.decode(&generator)
      switch decodingResult {
      case .Result(let char):
        encodedString.append(char)
      case .EmptyInput:
        finished = true
      /* ignore errors and unexpected values */
      case .Error:
        finished = true
      default:
        finished = true
      }
    } while (!finished)
    return encodedString
  }

  public static func decode(str: String) -> Array<UInt8> {
    var decodedBytes = Array<UInt8>()
    for b in str.utf8 {
      decodedBytes.append(b)
    }
    return decodedBytes
  }
}

func testUTF8Encoding() {
  let testString = "A UTF8 String With Special Characters: 😀🍎"
  let decodedArray = UTF8Encoding.decode(testString)
  let encodedString = UTF8Encoding.encode(decodedArray)
  XCTAssert(encodedString == testString, "UTF8Encoding is lossless: \(encodedString) != \(testString)")
}

Of the other alternatives suggested:

  • Using NSString invokes the Objective-C bridge;

  • Using UnicodeScalar is error-prone because it converts UnicodeScalars directly to Characters, ignoring complex grapheme clusters; and

  • Using String.fromCString is potentially unsafe as it uses pointers.

Upvotes: 15

Bryan Chen
Bryan Chen

Reputation: 46618

improve on Martin R's answer

import AppKit

let utf8 : CChar[] = [65, 66, 67, 0]
let str = NSString(bytes: utf8, length: utf8.count, encoding: NSUTF8StringEncoding)
println(str) // Output: ABC

import AppKit

let utf8 : UInt8[] = [0xE2, 0x82, 0xAC, 0]
let str = NSString(bytes: utf8, length: utf8.count, encoding: NSUTF8StringEncoding)
println(str) // Output: €

What happened is Array can be automatic convert to CConstVoidPointer which can be used to create string with NSSString(bytes: CConstVoidPointer, length len: Int, encoding: Uint)

Upvotes: 5

holex
holex

Reputation: 24041

I would do something like this, it may be not such elegant than working with 'pointers' but it does the job well, those are pretty much about a bunch of new += operators for String like:

@infix func += (inout lhs: String, rhs: (unit1: UInt8)) {
    lhs += Character(UnicodeScalar(UInt32(rhs.unit1)))
}

@infix func += (inout lhs: String, rhs: (unit1: UInt8, unit2: UInt8)) {
    lhs += Character(UnicodeScalar(UInt32(rhs.unit1) << 8 | UInt32(rhs.unit2)))
}

@infix func += (inout lhs: String, rhs: (unit1: UInt8, unit2: UInt8, unit3: UInt8, unit4: UInt8)) {
    lhs += Character(UnicodeScalar(UInt32(rhs.unit1) << 24 | UInt32(rhs.unit2) << 16 | UInt32(rhs.unit3) << 8 | UInt32(rhs.unit4)))
}

NOTE: you can extend the list of the supported operators with overriding + operator as well, defining a list of the fully commutative operators for String.


and now you are able to append a String with a unicode (UTF-8, UTF-16 or UTF-32) character like e.g.:

var string: String = "signs of the Zodiac: "
string += (0x0, 0x0, 0x26, 0x4b)
string += (38)
string += (0x26, 76)

Upvotes: 1

Related Questions