Jauzee
Jauzee

Reputation: 118

Find a word preceding a symbol set

How can I find a word that preceding to [¹²³⁴⁵⁶⁷⁸⁹⁰]. For ex.:

let myString = "Regular expressions¹ consist of constants, ² and operator symbols...³"

Please, provide a pattern to select characters from start of the target word to superscript:

"expressions¹", "constants, ²", "symbols...³"

& pattern to select only target word

"expressions", "constants", "symbols"

Upvotes: 0

Views: 104

Answers (2)

user557597
user557597

Reputation:

This will match your examples.

Codepoints:

\b\w+\W*[\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]+

From Wikipedia:

The most common superscript digits (1, 2, and 3) were in ISO-8859-1 and were therefore carried over into those positions in the Latin-1 range of Unicode. The rest were placed in a dedicated section of Unicode at U+2070 to U+209F.

Update:

To get separate blocks that start with words or non-words, you can just
exclude the superscript range from the non-word class.
The regex is longer and more redundant, but it works.

(?:\b\w+[^\w\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]*|[^\w\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]+)[\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]+

Formatted

 (?:
      \b 
      # Required - Words
      \w+ 
      # Optional - Not words, nor supersctipt
      [^\w\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]* 

   |  # or,

      # Required - Not words, nor supersctipt
      [^\w\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]+ 
 )
 # Required - Superscript
 [\x{B9}\x{B2}\x{B3}\x{2074}\x{2075}\x{2076}\x{2077}\x{2078}\x{2079}\x{2070}]+ 

Upvotes: 1

user3441734
user3441734

Reputation: 17544

based on sin's or Caleb Kleveter's information

    let myString = " expressions¹ consist of 元機經中有關文字排版² and operator symbols³"
    let noteIdx = "\u{2070}\u{00b9}\u{00b2}\u{00b3}\u{2074}\u{2075}\u{2076}\u{2077}\u{2078}\u{2079}"

    let strs = myString.unicodeScalars.split { (s) -> Bool in
        noteIdx.unicodeScalars.contains{ $0 == s }
    }
    strs.forEach {
        print($0)
    }
    /* prints

     expressions
     consist of 元機經中有關文字排版
     and operator symbols

    */

this is just a torso, you can continue if you want

Upvotes: 1

Related Questions