Just_another_developer
Just_another_developer

Reputation: 5957

Java Regex to extract specific words

I am trying to extract all the presence of 'and', 'a', 'the', 'an','& amp ;' from a block of text along with all the presence of digits.

I tried to create different regex for that purpose but fail to get the accurate result.

All the digits are extracted fine but I am unable to fetch all the aforementioned strings through regex.

My basic regex was

 Pattern p = Pattern.compile("^[0-9]");

then I tried different combinations like

 Pattern p = Pattern.compile("^[0-9](&)");
 Pattern p = Pattern.compile("^[0-9]+[&]");

to get aforementioned strings but of no use.

Example of the text:

System requirements: iOS 6.0 and Android (varies) &
Version used in this guide: 2.2.4 (iPhone), 13.1.2 (Android)

Expected Result

 6.0,and,&,2.2.4,13.1.2

Upvotes: 0

Views: 114

Answers (2)

ohaal
ohaal

Reputation: 5268

You are nowhere even close with your "attempts" and I almost feel bad for just handing you the solution, but if you really are "keen to learn new things" (as you say in your SO profile), have a look at a regex tutorial.

A basic use of alternation, grouping, quantifiers and anchors(/word boundaries) will solve your problem.

(\b(?:a|an|and|the)\b|&|\d+(?:\.\d+)*)

Explanation:

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      a                        'a'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      an                       'an'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      and                      'and'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      the                      'the'
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    &                    '&'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
--------------------------------------------------------------------------------
      \.                       '.'
--------------------------------------------------------------------------------
      \d+                      digits (0-9) (1 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
    )*                       end of grouping
--------------------------------------------------------------------------------
  )                        end of \1

For use in Java, you would have to escape every \.

(\\b(?:a|an|and|the)\\b|&|\\d+(?:\\.\\d+)*)

Upvotes: 1

karthik manchala
karthik manchala

Reputation: 13640

You can use the following regex:

(\\ban?d?\\b|\\bthe\\b|\\B&\\B|[\\d.]+)

See DEMO

Upvotes: 0

Related Questions