MacAttack
MacAttack

Reputation: 25

Regex expression to capture hyphenated word between lines, and non hyphenated words

I am trying to write a regular expression, in java, that matches words and hyphenated words. So far I have:

Pattern p1 = Pattern.compile("\\w+(?:-\\w+)",Pattern.CASE_INSENSITIVE);
Pattern p2 = Pattern.compile("[a-zA-Z0-9]+",Pattern.CASE_INSENSITIVE);
Pattern p3 = Pattern.compile("(?<=\\s)[\\w]+-$",Pattern.CASE_INSENSITIVE | Pattern.DOTALL);

This is my test case:

    Programs
    Dsfasdf. Programs Programs Dsfasdf. Dsfasdf. as is wow woah! woah. woah? okay. 
    he said, "hi." aasdfa. wsdfalsdjf. go-to go-
to
asdfasdf.. , : ; " ' ( ) ? ! - / \ @ # $ % & ^ ~ `  * [ ] { } + _ 123

Any help would be awesome

My expected result would be to match all the words ie.

Programs Dsfasdf Programs Programs Dsfasdf Dsfasdf
as is wow woah woah woah okay he said hi aasdfa
wsdfalsdjf go-to go-to asdfasdf 

the part I'm struggling with is matching the words that are split up between lines as one word.

ie.

go-
to

Upvotes: 0

Views: 1940

Answers (2)

ohaal
ohaal

Reputation: 5268

\p{L}+(?:-\n?\p{L}+)*
\   /^\ /^\ /\   /^^^
 \ / | | | |  \ / |||
  |  | | | |   |  ||`- Previous can repeat 0 or more times (group of literal '-', optional new-line and one or more of any letter (upper/lower case))
  |  | | | |   |  |`-- End first non-capture group
  |  | | | |   |  `--- Match one or more of previous (any letter, upper/lower case)
  |  | | | |   `------ Match any letter (upper/lower case)
  |  | | | `---------- Match a single new-line (optional because of `?`)
  |  | | `------------ Literal '-'
  |  | `-------------- Start first non-capture group
  |  `---------------- Match one or more of previous (any letter between A-Z (upper/lower case))
  `------------------- Match any letter (upper/lower case)

Is this OK?

Upvotes: 3

Ωmega
Ωmega

Reputation: 43673

I would go with regex:

\p{L}+(?:\-\p{L}+)*

Such regex should match also words "fiancé", "À-la-carte" and other words containing some special category "letter" characters. \p{L} matches a single code point in the category "letter".

Upvotes: 1

Related Questions