Abeer zaroor
Abeer zaroor

Reputation: 320

Divide document into paragraphs

I want to divide my document into paragraphs.

First I used TIKA to extract my text from (PDF, DOC) format.

After this I used Split() to divide the text into lines.

String[]lines=handler.toString().split("\n");\\handler from TIKA that extract the whole text from document

And then I used regex to extract specific information (e.g company name, designation, loyalty).

It works perfect until I have a paragraph that divided to many lines i.e:

Worked in Lycatel B.O.S. (P) Ltd. India Office, Chennai as Telecom Billing Analyst from 22nd October 07 to 3rd June 08.

It will divide to:

paragraph [1] :  Worked in Lycatel B.O.S. (P) Ltd. India Office, Chennai as Telecom
paragraph [2] : Billing Analyst from 22nd October 07 to 3rd June 08.

Since I apply Matcher for each paragraph:

Matcher matcher = pattern.matcher(paragraphs[i]);

The extracted data will be wrong because the 2 lines should be in the same paragraph.

I tried to split the text depending on .:

String[]lines=handler.toString().split(".");

However, companies that contain . in their names will be spit as well. For example:

Lycatel B.O.S. (P) Ltd.

How could I divide my text so that the paragraph [i] will be until the full-stop (.)?

Upvotes: 1

Views: 656

Answers (1)

Aaron
Aaron

Reputation: 24822

You can try using (?sm)^.*?\\.$ but I doubt you can get perfect solutions to your problem with regex.

(?s) is the dotall flag, it will make . match line feeds.
(?m) is the multiline flag, so $ will match the end of line (rather than only the end of string).
So with this regexp, we match as many characters (linefeeds included) as needed before we can match a . that is at the end of its line.

You can try it on regex101.

Upvotes: 1

Related Questions