scc
scc

Reputation: 385

PDFBox extracting paragraphs

I am new to pdfbox and I want to extract a paragraph that matches some particular words and I am able to extract the whole pdf to text(notepad) but I have no idea of how to extract particular paragraph to my java program. Can anyone help me with this atleast some tutorials or examples.Thank you so much

Upvotes: 15

Views: 16780

Answers (7)

feodal007
feodal007

Reputation: 11

may be it helps, for more info just read comments in code

 /**
 * @param content PDF as a byte array
 * @return the collection with domain object Page
 */
fun extractPages(content: ByteArray): Collection<Page> = loadPdfDocument(content).use {
    val textStripper = PDFTextStripper().apply {
        sortByPosition = true; addMoreFormatting = true; paragraphStart = "\n"
    }
    List<Page>(size = it.pages.count) { pageIndex ->
        Page(
            pageIndex,
            extractText(it, textStripper, pageIndex),
            extractImages(it, pageIndex)
        )
    }
}

/**
 * @param pdDocument loaded PDF document
 * @param textStripper document text extractor
 * @param pageIndex current page index
 * @return text for current document page
 */
private fun extractText(pdDocument: PDDocument, textStripper: PDFTextStripper, pageIndex: Int): Collection<String> {
    with(textStripper) { startPage = pageIndex + 1; endPage = pageIndex + 1 }
    return textStripper
        .getText(pdDocument)
        .split(textStripper.paragraphStart)
        .let { groupLines(it) }
}

/**
 * Method group the lines by algorithm: if the first letter of a line is an upper letter,
 * then other lines witch starts with a lower letter will be concatenated together and so in loop.
 * @param lines that is a collection with all text lines from one page
 * @return collection grouped lines
 */
private fun groupLines(lines: Collection<String>): Collection<String> =
    mutableListOf<String>().apply {
        var currentParagraph: String? = null

        lines.forEach { line ->
            if (isNewParagraph(line)) {
                currentParagraph?.let { add(it) }
                currentParagraph = line
            } else currentParagraph = processLineStartsInLowerCase(currentParagraph, line)
        }

        currentParagraph?.let { add(it) }
    }

/**
 * Method checks when current line should to be a new paragraph.
 * @param line current line
 * @return true when the current line should to be a new paragraph, false when not
 */
private fun isNewParagraph(line: String): Boolean = isUpperCase(line) || isDigit(line) || isSymbol(line)

/**
 * Often some paragraphs start with symbol like a '-' etc.
 * This method checks if the first letter is not a letter.
 * It is unnecessary to check if it is a digit that does the method 'isDigit(…)'
 * @param line current line
 * @return true when digit and false when not
 */
private fun isSymbol(line: String): Boolean = line.firstOrNull()?.isLetter()?.not() == true

/**
 * Often chapters starts with a digit. This method checks if the first letter is a digit.
 * @param line current line
 * @return true when digit and false when not
 */
private fun isDigit(line: String): Boolean = line.firstOrNull()?.isDigit() == true

/**
 * Method shows when currentGroup is null, then returns line immediately without transformations,
 * when currentGroup is not null, then currentGroup will be concatenated with new line
 * @param currentGroup the current line holder
 * @param line current line
 * @return a new value for a currentGroup
 */
private fun processLineStartsInLowerCase(currentGroup: String?, line: String): String =
    if (currentGroup == null) line
    else "$currentGroup $line"

/**
 * Method checks: is the first letter in the upper case.
 * @param line one line
 * @return true when in upper case and false when not
 */
private fun isUpperCase(line: String): Boolean = line.firstOrNull()?.isUpperCase() == true

Upvotes: 0

Praveen Kumar K R
Praveen Kumar K R

Reputation: 1860

private static String getParagraphs(String filePath, int linecount) throws IOException {
    ParagraphDetector paragraphDetector = new ParagraphDetector();
    StringBuilder extracted = new StringBuilder();
    LineIterator it = IOUtils.lineIterator(new BufferedReader(new FileReader(filePath)));
    int i = 0;
    String line;
    for (int lineNumber = 0; it.hasNext(); lineNumber++) {
        line = (String) it.next();
        if (lineNumber == linecount) {
            for (int j = 0; it.hasNext(); j++) {
                extracted.append((String) it.next());
            }
        }
    }
    return paragraphDetector.SentenceSplitter(extracted.toString());
}

Upvotes: -1

aavos
aavos

Reputation: 137

public static void main(String[] args) throws InvalidPasswordException, IOException {
    File file = new File("File Path");
    PDDocument document = PDDocument.load(file);        
    PDFTextStripper pdfStripper = new PDFTextStripper();
    pdfStripper.setParagraphStart("/t");
    pdfStripper.setSortByPosition(true);
    
    
    for (String line: pdfStripper.getText(document).split(pdfStripper.getParagraphStart()))
        {
            System.out.println(line);
            System.out.println("********************************************************************");
        }
}

Guys please try the above code. This works for sure with PDFBox-2.0.8 Jar

Upvotes: 3

wen li
wen li

Reputation: 1

You can first use pdfbox getText function to get the text. Every lines ends with '\n'; So you cannot segment paragraphs simpy with "\n". If a line satify the following condition:

line.length() > 2 && (int)line.charAt(line.length()-2) == 32

then this line is the last line of its paragraph. Here 32 is unicode value.

Upvotes: 0

ipavlic
ipavlic

Reputation: 4966

Text in PDF documents is absolutely positioned. So instead of words, lines and paragraphs, one only has absolutely positioned characters.

Let's say you have a paragraph:

Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit

Roughly speaking, in the PDF file it will be represented as characters N at some position, e a bit right to it, q, u, e more to the right, etc.

PDFBox tries to guess how the characters make words, lines and paragraphs. So it will look for a lot of characters at approximately same vertical position, for groups of characters that are near to each other and similar to try and find what you need. It does that by extracting the text from the entire page and then processing it character by character to create text (it can also try and extract text from just one rectangular area inside a page). See the appropriate class PDFTextStripper (or PDFTextStripperByArea). For usage, see ExtractText.java in PDFBox sources.

That means that you cannot extract paragraphs easily using PDFBox. It also means that PDFBox can and sometimes will miss when extracting text (there are a lot of very different PDF documents out there).

What you can do is extract text from the entire page and then try and find your paragraph searching through that text. Regular expressions are usually well suited for such tasks (available in Java either through Pattern and Matcher classes, or convenience methods on String class).

Upvotes: 16

dprahut
dprahut

Reputation: 41

After extracting text, paragraph can be constructed programmatically considering following points:

  1. All lines starts with small letters should be joined with previous line. But a line starts with capital letter may also require to join with previous line. e.g: for quoted expression.

  2. .,?,!," ending line with these characters may be the end of paragraph. Not always.

  3. If programmatically a paragraph is determined, then test it for even number of quotes. This may be simple double quote or Unicode double opening and closing quote.

Upvotes: 0

Coder
Coder

Reputation: 29

I had detected the start of paragraph using the using the following approach. Read the page line by line. For each line:-

  1. Find the last index of '.' (period) in the line.
  2. Compare this index with the length of the input line.
  3. If the index is less then this implies that this is not the end of the previous paragraph.
  4. If it is then it indicates that the previous paragraph has ended and the next line will be the beginning of the new paragraph.

Hope this helps.

Upvotes: 1

Related Questions