user1419243
user1419243

Reputation: 1705

Regex to find time durations

I have read many of the regex questions on stackoverflow, but they didn't help me to develop my own code.

What I need is like the following. I am parsing texts which have already been parsed using Stanford Tagger. Now, I am trying to remove the time durations in some parts of the texts: 1) The phrase starts with the date (e.g. 1999_CARD Tom_NN was_VP) 2) when the time duration follows this format: 2/1999_CARD -_- 01/01/2000_CARD (or similar ones).

I have developed a code. But it's wrongly removing some other parts. I don't know why. My regex is like the following

    String regex = "(\\s|\\b.*?_(CARD|CD)\\s([^A-Za-z0-9])+_([^A-Za-z0-9])+(.*?)+_(CARD|CD))|(\\b.*?_(CARD|CD))";
        Pattern pattern2 = Pattern.compile(regex);
        Matcher m2 = pattern2.matcher(chunkPhrase);
        if (m2.find()) {

            chunkPhrase = chunkPhrase.replace(m2.group(0), "");
        }

For example, in the following phrase, it finds something (but it shouldn't)

·_NNP Research_NNP of_IN Symbian_NNP OS_NNP 7.0_CD s_NNS

After removing the time duration in the above phrase, I'm left with · s_NNS which is not what I want.

To make it more clear what I expect the code, here are some examples:

1/1/2002_CD -_- 1/2/2003_CD Test_NN Company_NN

after applying the code, I expect:

Test_NN Company_NN

For this one:

1/1/2002_CARD -_- 1/2/2003_CARD Test_NN Company_NN

after applying the code, I expect:

Test_NN Company_NN

For this one:

2000_CARD I_NN was_VP working_NP here_ADV

after applying the code, I expect:

I_NN was_VP working_NP here_ADV

For this one:

I_NN have_VP worked_VP in_PP 3_CARD companies_NP

after applying the code, I expect:

I_NN have_VP worked_VP in_PP 3_CARD companies_NP

Meanwhile, I use java.

Update: To clarify better: If a number occurs AT THE BEGINNING, it must be removed. Otherwise, it must be remained. If it follows the second format (e.g. 1999_CD -_- 2000_CARD), it must be removed, indifferent if it occurs at the beginning or middle or end of the phrase.

Can anyone help what is wrong with my code?

Upvotes: 1

Views: 142

Answers (2)

anubhava
anubhava

Reputation: 785058

You can use this regex:

final String regex = "\\b(?:\\d{1,2}/*\\d{1,2}/)?\\d{4}_(?:CARD|CD)(?:\\h*[-_]+)?\\h*";

final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(input);

// The substituted value will be contained in the result variable
final String result = matcher.replaceAll("");

System.out.println("Substitution result: " + result);

RegEx Demo


RegEx Breakup:

  • \b - Word boundary
  • (?: - Start non-capturing group
    • \d{1,2}/*\d{1,2}/ - Match mm/dd part of a date
  • )? - End non-capturing group (optional)
  • \d{4} - Match 4 digits of year
  • _ - Match a literal _
  • (?:CARD|CD) - Match CARD or CD
  • (?: - Start non-capturing group
    • \h*[-_]+ - Match horizontal whitespace followed by 1 or more - or _
  • )? - End non-capturing group (optional)
  • \h* - Match 0 or more horizontal whitespaces

Upvotes: 1

garyh
garyh

Reputation: 2852

Based on the examples you have provided, the following regex will capture the required time durations

((?:\d{2,}|\d{1,2}\/\d{1,2}\/\d{2,4})_(?:CARD|CD) (?:-_- )?)

Details

(?:\d{2,}|\d{1,2}\/\d{1,2}\/\d{2,4})  // match minimum of 2 digits or a date in xx/xx/xx[xx] format

_(?:CARD|CD)  // match _CARD or _CD

(?:-_- )?  // match -_- , if it exists 

The ?: at the beginning mean these are non-capturing groups. The parentheses around the whole thing is the capturing group

See demo here

Upvotes: 1

Related Questions