Reputation: 95
I have a text like that:
The C language is%y% widely used today in application, operating system, and embedded system development, and its influence is seen in most modern programming languages. UNIX has also been influential, establishing %y% concepts and principles that are now precepts of computing.%p%
Text has some unnecessary indicators: %y% and %p%
I use regex for split words using this regex:
Pattern p = Pattern.compile("[a-zA-Z]+");
I could split all words but this regex brings "y" and "p" letters. How can i ignore these indicators?
Upvotes: 1
Views: 1297
Reputation: 36329
Or you may treat the indicators as separate words, and sort them out later:
Pattern p = Pattern.compile("[a-zA-Z]+|%[a-z]%");
BTW, you should not use [a-zA-Z]
for natural language texts - even english text could contain words like café
, names like Björn etc. For this, java.util.regex.Pattern supports predefined character classes for letters \p{L}
along with \p{Ll}
(only lowercase letters) and \p{Lu}
(only uppercase letters) that would match such words just fine.
Upvotes: 1
Reputation: 24334
If the only characters are "%y%" and "%p%" you could make it simple and just remove these before doing the regex..
e.g.
myString = myString.replaceAll("%y%|%p%", "");
Upvotes: 0
Reputation: 2929
You could use some pre-processing to remove all of the unneccesary characters before you do your main processing. Something like this should work:
string.replaceAll("%y%|%p%","")
Upvotes: 2