Maury Markowitz
Maury Markowitz

Reputation: 9279

What characters are in whitespaceAndNewlineCharacterSet()?

I'm parsing some nasty files - you know, mix comma, space and tab delimiters in a single line, and then run it through a text editor that word wraps at column 65 with CRLF. Ugh.

As part of my efforts to parse this in Cocoa, I use Apple's whitespaceAndNewlineCharacterSet. But what, exactly is in that set? The documentation says "Unicode General Category Z*, U000A ~ U000D, and U0085". I was able to find the last three (85 is interesting, but what does the ~ mean, and what is General Category Z*?

Any Unicode gurus out there?

Upvotes: 0

Views: 524

Answers (2)

Alain T.
Alain T.

Reputation: 42143

NSCharacterSet is an opaque class that does not expose its content easily. You have to see it more as a "membership" rule service than a list of characters.

This may be a somewhat brutal approach, but you can get the list of members in an NSCharacterSet by going through all 16 bit scalar values and checking for membership in the set:

 let charSet = NSCharacterSet.whitespaceAndNewlineCharacterSet()
 for i in 0..<65536
 {
    let u:UInt16 = UInt16(i)
    if charSet.characterIsMember(u)
    { print("\(u): \(Character(UnicodeScalar(u)))") }
 }

This gives surprising results for non-displayable character sets but it can probably answer your question.

Upvotes: 0

matt
matt

Reputation: 535229

The ~ means "thru"; thus, U000A, B, C, and D.

The phrase "General Category Z*" is shorthand for "any character whose General Category property is one of the three categories that start with Z." Thus, various forms of space (0020, 00A0, 1680, 2000 thru 200A, 202F, 205F, 3000), plus the line separator (2028) and the paragraph separator (2029).

Upvotes: 2

Related Questions