Reputation: 1705
I have read many of the regex questions on stackoverflow, but they didn't help me to develop my own code.
What I need is like the following. I am parsing texts which have already been parsed using Stanford Tagger. Now, I am trying to remove the time durations in some parts of the texts: 1) The phrase starts with the date (e.g. 1999_CARD Tom_NN was_VP
) 2) when the time duration follows this format: 2/1999_CARD -_- 01/01/2000_CARD
(or similar ones).
I have developed a code. But it's wrongly removing some other parts. I don't know why. My regex is like the following
String regex = "(\\s|\\b.*?_(CARD|CD)\\s([^A-Za-z0-9])+_([^A-Za-z0-9])+(.*?)+_(CARD|CD))|(\\b.*?_(CARD|CD))";
Pattern pattern2 = Pattern.compile(regex);
Matcher m2 = pattern2.matcher(chunkPhrase);
if (m2.find()) {
chunkPhrase = chunkPhrase.replace(m2.group(0), "");
}
For example, in the following phrase, it finds something (but it shouldn't)
·_NNP Research_NNP of_IN Symbian_NNP OS_NNP 7.0_CD s_NNS
After removing the time duration in the above phrase, I'm left with · s_NNS
which is not what I want.
To make it more clear what I expect the code, here are some examples:
1/1/2002_CD -_- 1/2/2003_CD Test_NN Company_NN
after applying the code, I expect:
Test_NN Company_NN
For this one:
1/1/2002_CARD -_- 1/2/2003_CARD Test_NN Company_NN
after applying the code, I expect:
Test_NN Company_NN
For this one:
2000_CARD I_NN was_VP working_NP here_ADV
after applying the code, I expect:
I_NN was_VP working_NP here_ADV
For this one:
I_NN have_VP worked_VP in_PP 3_CARD companies_NP
after applying the code, I expect:
I_NN have_VP worked_VP in_PP 3_CARD companies_NP
Meanwhile, I use java
.
Update: To clarify better: If a number occurs AT THE BEGINNING, it must be removed. Otherwise, it must be remained. If it follows the second format (e.g. 1999_CD -_- 2000_CARD
), it must be removed, indifferent if it occurs at the beginning or middle or end of the phrase.
Can anyone help what is wrong with my code?
Upvotes: 1
Views: 142
Reputation: 785058
You can use this regex:
final String regex = "\\b(?:\\d{1,2}/*\\d{1,2}/)?\\d{4}_(?:CARD|CD)(?:\\h*[-_]+)?\\h*";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(input);
// The substituted value will be contained in the result variable
final String result = matcher.replaceAll("");
System.out.println("Substitution result: " + result);
RegEx Breakup:
\b
- Word boundary(?:
- Start non-capturing group
\d{1,2}/*\d{1,2}/
- Match mm/dd part of a date)?
- End non-capturing group (optional)\d{4}
- Match 4 digits of year_
- Match a literal _
(?:CARD|CD)
- Match CARD
or CD
(?:
- Start non-capturing group
\h*[-_]+
- Match horizontal whitespace followed by 1 or more -
or _
)?
- End non-capturing group (optional)\h*
- Match 0 or more horizontal whitespacesUpvotes: 1
Reputation: 2852
Based on the examples you have provided, the following regex will capture the required time durations
((?:\d{2,}|\d{1,2}\/\d{1,2}\/\d{2,4})_(?:CARD|CD) (?:-_- )?)
Details
(?:\d{2,}|\d{1,2}\/\d{1,2}\/\d{2,4}) // match minimum of 2 digits or a date in xx/xx/xx[xx] format
_(?:CARD|CD) // match _CARD or _CD
(?:-_- )? // match -_- , if it exists
The ?:
at the beginning mean these are non-capturing groups. The parentheses around the whole thing is the capturing group
Upvotes: 1