Reputation: 5957
I am trying to extract all the presence of 'and', 'a', 'the', 'an','& amp ;' from a block of text along with all the presence of digits.
I tried to create different regex for that purpose but fail to get the accurate result.
All the digits are extracted fine but I am unable to fetch all the aforementioned strings through regex.
My basic regex was
Pattern p = Pattern.compile("^[0-9]");
then I tried different combinations like
Pattern p = Pattern.compile("^[0-9](&)");
Pattern p = Pattern.compile("^[0-9]+[&]");
to get aforementioned strings but of no use.
Example of the text:
System requirements: iOS 6.0 and Android (varies) &
Version used in this guide: 2.2.4 (iPhone), 13.1.2 (Android)
Expected Result
6.0,and,&,2.2.4,13.1.2
Upvotes: 0
Views: 114
Reputation: 5268
You are nowhere even close with your "attempts" and I almost feel bad for just handing you the solution, but if you really are "keen to learn new things" (as you say in your SO profile), have a look at a regex tutorial.
A basic use of alternation, grouping, quantifiers and anchors(/word boundaries) will solve your problem.
(\b(?:a|an|and|the)\b|&|\d+(?:\.\d+)*)
Explanation:
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
a 'a'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
an 'an'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
and 'and'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
the 'the'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
& '&'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
) end of \1
For use in Java, you would have to escape every \
.
(\\b(?:a|an|and|the)\\b|&|\\d+(?:\\.\\d+)*)
Upvotes: 1
Reputation: 13640
You can use the following regex:
(\\ban?d?\\b|\\bthe\\b|\\B&\\B|[\\d.]+)
See DEMO
Upvotes: 0