Reputation: 1155
I have text like the following:
Grad/Med School University of Osteopathic Medicine and Health Sci.this was read from a pdfFile into a String (Java) called pdfFileText. Actually, the above is just a small part of the total text.
I will also have a String called institution. In this case the value of institution is "University of Osteopathic Medicine and Health Sci."
In the PDF file, as you see above, the University name exceeded the line width so it wrapped to the next line.
What I want to do is verify pdfFileText.contains(institution). But since the institution is line-wrapped this will not work.
I tried to make a new String ins = institution.replaceAll(" ", [ \n\r]+); But that did not work. I also tried various numbers of dashes, up to something like institution.replaceAll(" ", [ \\\\n\\\\r]+); or maybe more backslashes. But nothing seems to work.
What could be the correct regular expression to use? Or perhaps, contains() will not allow regular expressions? Would you suggest trying a pattern matcher? I would still be confused about what to replace the blank spaces with in a pattern.
Upvotes: 1
Views: 301
Reputation: 3554
Look for a multiline pattern with arbitrary space by first replacing your spaces with the whitespace character class, and then check multiline:
String text = "Grad/Med School University of Osteopathic Medicine and\nHealth Sci. And more text.";
String pat = "University of Osteopathic Medicine and Health Sci";
Pattern regex = Pattern.compile(".*" + pat.replaceAll("\\s", "\\\\s+") + ".*", Pattern.MULTILINE);
Matcher matcher = regex.matcher(text);
System.out.println(matcher.matches());
Note the addition of .* at front and end to match strings in the middle of the text.
Upvotes: 2
Reputation: 38777
You're doing it backwards. Remove the line endings from the input first:
pdfFileText.replaceAll("\\s+", " ").contains(institution)
If you cannot guarantee that institution
will always be normalised, then pre-process that as well:
pdfFileText.replaceAll("\\s+", " ")
.contains(institution.replaceAll("\\s+", " "))
If after testing this turns out to be too slow due to the input size, implement your own contains
that just skips extra whitespace while matching.
Upvotes: 3