Reputation: 320
I want to divide my document into paragraphs.
First I used TIKA to extract my text from (PDF, DOC) format.
After this I used Split()
to divide the text into lines.
String[]lines=handler.toString().split("\n");\\handler from TIKA that extract the whole text from document
And then I used regex to extract specific information (e.g company name, designation, loyalty).
It works perfect until I have a paragraph that divided to many lines i.e:
Worked in Lycatel B.O.S. (P) Ltd. India Office, Chennai as Telecom Billing Analyst from 22nd October 07 to 3rd June 08.
It will divide to:
paragraph [1] : Worked in Lycatel B.O.S. (P) Ltd. India Office, Chennai as Telecom
paragraph [2] : Billing Analyst from 22nd October 07 to 3rd June 08.
Since I apply Matcher
for each paragraph:
Matcher matcher = pattern.matcher(paragraphs[i]);
The extracted data will be wrong because the 2 lines should be in the same paragraph.
I tried to split the text depending on .
:
String[]lines=handler.toString().split(".");
However, companies that contain .
in their names will be spit as well. For example:
Lycatel B.O.S. (P) Ltd.
How could I divide my text so that the paragraph [i]
will be until the full-stop (.
)?
Upvotes: 1
Views: 656
Reputation: 24822
You can try using (?sm)^.*?\\.$
but I doubt you can get perfect solutions to your problem with regex.
(?s)
is the dotall flag, it will make .
match line feeds.
(?m)
is the multiline flag, so $
will match the end of line (rather than only the end of string).
So with this regexp, we match as many characters (linefeeds included) as needed before we can match a .
that is at the end of its line.
You can try it on regex101.
Upvotes: 1