Shi Zhang
Shi Zhang

Reputation: 163

java regex for separating by space or capture content in " "

Getting used to regex here.

I have a file in the structure of

word1 word2 word3 word4 word5 "word6" "word7"
word1 word2 word3 word4 word5 "word6" "word7"
word1 word2 word3 word4 word5 "word6" "word7"
...

which I want to capture into:

arr[0] = word1
arr[1] = word2
arr[2] = word3
arr[3] = word4
arr[4] = word5
arr[5] = word6
arr[6] = word7

My regex is: (?m)(.* )(.* )(.* )(.* )(.* )(".*") (".*")

Now I'm sure there is a more elegant way to write this where I don't have to repeat the same sequence multiple times.

My understanding is something like this should work?

(?:(.* )*|(".*")*)

I believe (?:(.* )|(".*")) means match EITHER .* or ".*" and the * at the end of (.* ) and (".*") forming (.* )* and (".*")* means match 0 or more times. This should do the same thing as my working regex no?

Thoughts?

EDIT After reading everything, I was simply trying to shorten my regex by capturing based on (.) or \"(.)\" without specifying the number of times the capturing will occur which is not possible. thank you!

the correct regex: (?m)(.*) (.*) (.*) (.*) (.*) \"(.*)\" \"(.*)\"

Upvotes: 0

Views: 116

Answers (1)

Gangnus
Gangnus

Reputation: 24464

  1. If you have a group repeating by * or +, it will still be taken only once - the last time when it matches. Alas, we have to write such groups many times.
  2. Space is done by \s
  3. (.*)\s(.*)\s(.*)\s(.*)\s(.*)\s"(.*)"\s"(.*)"

is enough. You mustn't put " IN groups, according to your task. Your regex is NOT working, taking " and spaces into arr[6] and arr[5].

  1. Example

If you want to read words independently on if they are in "" or not, and number of spaces between words can be any, then:

[\s"]*(\w+)[\s"]+(\w+)[\s"]+(\w+)[\s"]+(\w+)[\s"]+(\w+)[\s"]+(\w+)[\s"]+(\w+)[\s"]*

Really, it is the shortened variant, for this way we cannot check for presence of "" on both sides of the words.

Example

If you really want to take arbitrary number of words, use split() function, splitting by spaces \\s? and after that trimming off excessive " and/or spaces from the elements.

Look here for example.

It is impossible to split lines into arbitrary number of groups by regex only, without split() or something similar.

Upvotes: 1

Related Questions