Reputation: 25
I am trying to write a regular expression, in java, that matches words and hyphenated words. So far I have:
Pattern p1 = Pattern.compile("\\w+(?:-\\w+)",Pattern.CASE_INSENSITIVE);
Pattern p2 = Pattern.compile("[a-zA-Z0-9]+",Pattern.CASE_INSENSITIVE);
Pattern p3 = Pattern.compile("(?<=\\s)[\\w]+-$",Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
This is my test case:
Programs Dsfasdf. Programs Programs Dsfasdf. Dsfasdf. as is wow woah! woah. woah? okay. he said, "hi." aasdfa. wsdfalsdjf. go-to go- to asdfasdf.. , : ; " ' ( ) ? ! - / \ @ # $ % & ^ ~ ` * [ ] { } + _ 123
Any help would be awesome
My expected result would be to match all the words ie.
Programs Dsfasdf Programs Programs Dsfasdf Dsfasdf as is wow woah woah woah okay he said hi aasdfa wsdfalsdjf go-to go-to asdfasdf
the part I'm struggling with is matching the words that are split up between lines as one word.
ie.
go- to
Upvotes: 0
Views: 1940
Reputation: 5268
\p{L}+(?:-\n?\p{L}+)* \ /^\ /^\ /\ /^^^ \ / | | | | \ / ||| | | | | | | ||`- Previous can repeat 0 or more times (group of literal '-', optional new-line and one or more of any letter (upper/lower case)) | | | | | | |`-- End first non-capture group | | | | | | `--- Match one or more of previous (any letter, upper/lower case) | | | | | `------ Match any letter (upper/lower case) | | | | `---------- Match a single new-line (optional because of `?`) | | | `------------ Literal '-' | | `-------------- Start first non-capture group | `---------------- Match one or more of previous (any letter between A-Z (upper/lower case)) `------------------- Match any letter (upper/lower case)
Upvotes: 3
Reputation: 43673
I would go with regex:
\p{L}+(?:\-\p{L}+)*
Such regex should match also words "fiancé", "À-la-carte" and other words containing some special category "letter" characters. \p{L}
matches a single code point in the category "letter".
Upvotes: 1