Reputation: 141
I use Java Pattern class to specify the regex as a string.
So example I love being spider-man : "Peter Parker"
should list spider-man and "Peter Parker" as a separate token. Thanks
try {
BufferedReader br = new BufferedReader(new FileReader(f));
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
line = br.readLine();
}
String everything = sb.toString();
List<String> result = new ArrayList<String>();
Pattern pat = Pattern.compile("([\"'].*?[\"']|[^ ]+)");
PatternTokenizer pt = new PatternTokenizer(new StringReader(everything),pat,0);
while (pt.incrementToken()) {
result.add(pt.getAttribute(CharTermAttribute.class).toString());
}
}
catch (Exception e) {
throw new RuntimeException(e);
}
So i guess the reason why "some word" is not working is because each token is itself a string. Any cues ? Thank you
Upvotes: 0
Views: 528
Reputation: 56809
Check whether this regex is what you need:
"([\"'].*?[\"']|(?<=[ :]|^)[a-zA-Z0-9-]+(?=[ :]|$))"
I assume that you don't have (single/double) quote inside (single/double) quote.
There is also assumption about the delimiter: I only allow space and :
to work as delimiter. Nothing will be matched in "foo_bar"
. If you want to add more delimiter, such as ;
, .
, ,
, ?
, add it to the character class in both look ahead and look behind assertion, like this:
"([\"'].*?[\"']|(?<=[ :;.,?]|^)[a-zA-Z0-9-]+(?=[ :;.,?]|$))"
Not yet tested on every input, but I have tested on this input:
" sdfsdf \" sdfs sdfsdfs \" \"sdfsdf\" sdfsdf sdfsd dsfshj sdfsdf-sdf 'sdfsdfsdf sd f ' "
// I used replaceAll to check the captured group
.replaceAll("([\"'].*?[\"']|(?<=[ :]|^)[a-zA-Z0-9-]+(?=[ :]|$))", "X$1Y")
And it works fine for me.
If you want a more liberal capturing, but still with the assumption about quoting:
"([\"'].*?[\"']|[^ ]+)"
To extract matches:
Matcher m = Pattern.compile(regex).matcher(inputString);
List<String> tokens = new ArrayList<String>();
while (m.find()) {
tokens.add(m.group(1));
}
Upvotes: 1
Reputation: 124225
If it doesn't have to be regex and your data in String is correct (quotes are in right order not like " ' some data " '
) then you can do it in one iteration like
String data="I love being spider-man : \"Peter Parker\" or 'photo reporter'";
List<String> tokens = new ArrayList<String>();
StringBuilder sb=new StringBuilder();
boolean inSingleQuote=false;
boolean indDoubleQuote=false;
for (char c:data.toCharArray()){
if (c=='\'') inSingleQuote=!inSingleQuote;
if (c=='"') indDoubleQuote=!indDoubleQuote;
if (c==' ' && !inSingleQuote && !indDoubleQuote){
tokens.add(sb.toString());
sb.delete(0,sb.length());
}
else
sb.append(c);
}
tokens.add(sb.toString());
System.out.println(tokens);
output
[I, love, being, spider-man, :, "Peter Parker", or, 'photo reporter']
Upvotes: 2