Sebastian Zeki
Sebastian Zeki

Reputation: 6874

How can I use regex to find series of data within a text file?

I have a text file with a series as follows:

Lots of textLots of textLots of textLots of textLots of textLots of textLots
 of textLots of textLots of textLots of textLots of textLots of textLots of
 textLots of textLots of textLots of textLots of textLots of textLots of
 textLots of text

Wave amplitude (mean, 3.0 & 7.0 above LES) (mmHg)
43-152
35.9
N/A
N/A
N/A
43.5
21.9
N/A
37.3
N/A
40.9
N/A

    Wave duration (mean at 3.0 & 7.0 above LES) (sec)
2.7-5.4
2.5
N/A
N/A
N/A
2.2
3.0
N/A
2.2
N/A
2.6
N/A

    Onset velocity (between 11.0 & 3.0 above LES) (cm/s)
2.8-6.3
2.2
N/A
N/A
N/A
2.5
1.0
N/A
2.5
N/A
2.7
N/A

Some other textSome other textSome other textSome other textSome other textSome
 other textSome other textSome other textSome other textSome other textSome 
other textSome other textSome other textSome other textSome other textSome 
other text

The rules are:

  1. The first line always contains a bracket somewhere and this isn't found elsewhere.

  2. There is always an empty line at the end of each series of numbers(or series of N/As)

  3. The values are all either numbers (with or without decimal points) or N/A.

  4. I do not want to capture the first number after the title of each block (which also usually contains a - or <)

I would like to capture the title and the subsequent numbers into one arrayList.

The expected output for the first example would therefore be

[Wave amplitude (mean, 3.0 & 7.0 above LES  (mmHg),35.9,N/A,N/A,N/A,43.5,21.9,N/A,37.3,N/A,40.9,N/A]

I am stuck on the regex that would allow me to achieve this. Because the text I want to extract lies within a bigger text file I think I need to use regex to extract just the part I'm interested in. I suppose an alternative would be to select out just the start and end of the entire section I'm interested in but it would still rely on some regex and I think the pattern to do this would be more complex.

Upvotes: 2

Views: 109

Answers (1)

Per Huss
Per Huss

Reputation: 5095

If you really want to use regex for parsing this, you can do like this:

String pattern = "(?<desc>.*\\(.*\\).*)\n.*-.*\n(?<data>(?:N/A\n|\\d*\\.\\d*\n)+)";

String rawData = new String(Files.readAllBytes(Paths.get("indata.txt")));
Matcher seriesMatcher = Pattern.compile(pattern).matcher(rawData);
while(seriesMatcher.find()) {
    List<String> series = new ArrayList<>();
    series.add(seriesMatcher.group("desc").trim());
    series.addAll(asList(seriesMatcher.group("data").split("\n")));
    System.out.println(series);
}

The regexp consist of several parts:

(?<desc>.*\\(.*\\).*)\n.*-.*\n(?<data>(?:N/A\n|\\d*\\.\\d*\n)+)
--------------------- ------- ---------------------------------
description           ignore  data

description = A line containing a matched pair of parenthesis.
ignore = An line with a dash, to be ignored.
data = The entries, ie any number of lines either N/A or a decimal number.

Upvotes: 2

Related Questions