Reputation: 6874
I have a text file with a series as follows:
Lots of textLots of textLots of textLots of textLots of textLots of textLots
of textLots of textLots of textLots of textLots of textLots of textLots of
textLots of textLots of textLots of textLots of textLots of textLots of
textLots of text
Wave amplitude (mean, 3.0 & 7.0 above LES) (mmHg)
43-152
35.9
N/A
N/A
N/A
43.5
21.9
N/A
37.3
N/A
40.9
N/A
Wave duration (mean at 3.0 & 7.0 above LES) (sec)
2.7-5.4
2.5
N/A
N/A
N/A
2.2
3.0
N/A
2.2
N/A
2.6
N/A
Onset velocity (between 11.0 & 3.0 above LES) (cm/s)
2.8-6.3
2.2
N/A
N/A
N/A
2.5
1.0
N/A
2.5
N/A
2.7
N/A
Some other textSome other textSome other textSome other textSome other textSome
other textSome other textSome other textSome other textSome other textSome
other textSome other textSome other textSome other textSome other textSome
other text
The rules are:
The first line always contains a bracket somewhere and this isn't found elsewhere.
There is always an empty line at the end of each series of numbers(or series of N/As)
The values are all either numbers (with or without decimal points) or N/A.
I do not want to capture the first number after the title of each block (which also usually contains a - or <)
I would like to capture the title and the subsequent numbers into one arrayList.
The expected output for the first example would therefore be
[Wave amplitude (mean, 3.0 & 7.0 above LES (mmHg),35.9,N/A,N/A,N/A,43.5,21.9,N/A,37.3,N/A,40.9,N/A]
I am stuck on the regex that would allow me to achieve this. Because the text I want to extract lies within a bigger text file I think I need to use regex to extract just the part I'm interested in. I suppose an alternative would be to select out just the start and end of the entire section I'm interested in but it would still rely on some regex and I think the pattern to do this would be more complex.
Upvotes: 2
Views: 109
Reputation: 5095
If you really want to use regex for parsing this, you can do like this:
String pattern = "(?<desc>.*\\(.*\\).*)\n.*-.*\n(?<data>(?:N/A\n|\\d*\\.\\d*\n)+)";
String rawData = new String(Files.readAllBytes(Paths.get("indata.txt")));
Matcher seriesMatcher = Pattern.compile(pattern).matcher(rawData);
while(seriesMatcher.find()) {
List<String> series = new ArrayList<>();
series.add(seriesMatcher.group("desc").trim());
series.addAll(asList(seriesMatcher.group("data").split("\n")));
System.out.println(series);
}
The regexp consist of several parts:
(?<desc>.*\\(.*\\).*)\n.*-.*\n(?<data>(?:N/A\n|\\d*\\.\\d*\n)+)
--------------------- ------- ---------------------------------
description ignore data
description
= A line containing a matched pair of parenthesis.
ignore
= An line with a dash, to be ignored.
data
= The entries, ie any number of lines either N/A
or a decimal number.
Upvotes: 2