Vineeth Prabhakaran
Vineeth Prabhakaran

Reputation: 641

Regular expression with conditional extraction

I have sentences like

1 1994 FORD 5640 2WD Tractor

2 AG-GATOR 1004 4x4 Tree Spade Truck

3 2004 ROSCO RB48 Broom

4 TENNANT 830II Street Sweeper

from which i need to extract words using regex like

5640
1004
RB48
830II

i.e.in a sentance if there is a year such as 1994 in 1st sentance i need to get the 4th word(5640) if there is no year like 2nd sentence i need to get the 3rd word(1004)

Can anyone suggest me a regular expression to achieve this ???..

Upvotes: 0

Views: 486

Answers (4)

m.cekiera
m.cekiera

Reputation: 5395

You can try with:

(?=^(?:.*\d{4}\s)?[-a-zA-Z]+\s([a-zA-Z0-9]+))

DEMO

Which means:

  • (?= - positive lookahead for:
  • ^ - beginning of a line,
  • (?:.*\d{4}\s)? - four digits and space,
  • [-a-zA-Z]+\s - one or more letters and a spece,
  • ([a-zA-Z0-9]+) - one or more letters or digits (desired value)

This regex match by grouping in lookahead, so it will not match any text, just zero-lenght point in text, but you can get values by group(1). Example in Java:

public class Test{
    public static void main(String[] args){
        String[] array = {"1994 FORD 5640 2WD Tractor","AG-GATOR 1004 4x4 Tree Spade Truck","2004 ROSCO RB48 Broom",
                "TENNANT 830II Street Sweeper","4A 1998 BROCE RJ350 Broom"};
        Matcher matcher = null;
        for(String element : array) {
            matcher = Pattern.compile("(?=^(?:.*\\d{4}\\s)?[-a-zA-Z]+\\s([a-zA-Z0-9]+))").matcher(element);
            if (matcher.find()) {
                System.out.println(matcher.group(1));
            }
        }
    }
}

Another way, but only for Java, would be to match directly with:

(?<=^(?:.{0,99}\d{4}\s)?[-a-zA-Z]{1,99}\s)[a-zA-Z0-9]+

DEMO

this is using positive lookbehind without fixed lenght. It use rather ugly construction with syntax like: .{0,99} (from zero to 99 characters), etc. In most regex flavour you cannot use quantifires in lookbehinds, but Java allow usage of ? and intervals with min and max values ({2,6}). It is not too elegant, but works in this case.

Upvotes: 0

Pranav C Balan
Pranav C Balan

Reputation: 115212

Use regex

\d+\s(?:\d{4}\s\S*?\s(\S+)|\S+\s(\S+))

Test regex here

  1. \d+ for index number
  2. \d{4}\s\S*?\s(\S+) for first type
  3. \S+\s(\S+) for matching

Update : For index with alphanueric use

(?<=^|\n)\w+\s(?:\d{4}\s\S*?\s(\S+)|\S+\s(\S+))

Test regex here

Use (?<=^|\n), positive look behind for string should be either at the beginning or after a newline

Upvotes: 1

Thomas
Thomas

Reputation: 88707

Assuming the layout is somewhat constant (as it seems from your question) just make the year optional:

^\d+ (?:\d{4} )?\S+ (\S+)

Breakdown of the expression:

  • ^ start of the input
  • \d+ a sequence of digits followed by a space char
  • (?:\d{4} )? on optional sequence of 4 digits followed by a space char
  • \S+ a sequence of non-whitespace followed by a space char
  • (\S+) a sequence of non-whitespace as a capturing group - this is what you're after

If you want to support any whitespace in between and possibly any length use \s+ instead of just the space character.

Use classes Pattern and Matcher to apply the regex on each sentence and use group(1) on the matcher to extract the content of the group you're looking for.

Edit: note that \d will match any kind of digit. If you want to only allow ascii digits 0-9 use [0-9] instead.

Depending on how much you want to restrict possible year numbers you might want to expand that expression as well, e.g. (19|20)[0-9]{2} instead of \d{4}.

Upvotes: 1

Rick
Rick

Reputation: 1209

What about /\d{4}(?!.*\d{4})/g? Use a negative lookahead to skip characters without matching.

EDIT: this regex matches the last 4 digit sequence in the text.

Upvotes: 0

Related Questions