Reputation: 1

How can I extract the background color of a table cell within a PDF file using Python?

I've been using tabula-py, PyPDF2 and tika modules, but none of them seems to detect the background color of a table cell, which is within a PDF file.

These colored cells mean important information in the context of my problem. I know, for exemple, that tabula-py is a wrapper from tabula-java and this one does not provided colored cell information. Is there some easy-to-follow solution in Python out there?

Thanks in advance.

Upvotes: 0

Answers (3)

toshi

Reputation: 2897

Some kind user reported my previous solution did not work well.
It's true because pdfplumber's page.rects does not always detect cells in table correctly.
Sometimes it only detects lines, rows, cols.
So I propose another solution here.

import pdfplumber
from collections import Counter
    

def get_cell_color(image, cell:tuple[int, int, int, int]):
    cropped_image = image.crop(cell)
    pixels = list(cropped_image.convert('RGB').getdata())
    color_counts = Counter(pixels)
    most_common = color_counts.most_common(1)
    return most_common[0][0]


def demo(page):
    """example method: print colored cells information"""
    page_image = page.to_image().original
    tables = page.find_tables()
    
    for table in tables:
        extracted_table = table.extract()
        for row_idx, row in enumerate(table.rows):
            for cell_idx, cell in enumerate(row.cells):
                cell_color = get_cell_color(page_image, cell)
                if cell_color != (255, 255, 255):
                    print(f"cell color: {cell_color}")
                    print(f"cell location: {cell}")
                    print(f"cell content: {extracted_table[row_idx][cell_idx]}")


pdf = pdfplumber.open("/path/to/target.pdf")
page = pdf.pages[0]
demo(page)

Upvotes: 1

toshi

Reputation: 2897

I found a solution using pdfplumber. Here is rough sample code.

from typing import Optional

import pdfplumber
from pdfplumber.page import Page, Table


def cmyk_to_rgb(cmyk: tuple[float, float, float, float]):
    r = 255 * (1.0 - (cmyk[0] + cmyk[3]))
    g = 255 * (1.0 - (cmyk[1] + cmyk[3]))
    b = 255 * (1.0 - (cmyk[2] + cmyk[3]))
    return r, g, b


def to_bbox(rect: dict) -> tuple[float, float, float, float]:
    return (rect["x0"], rect["top"], rect["x1"], rect["bottom"])


def is_included(cell_box: tuple[float, float, float, float], rect_box: tuple[float, float, float, float]):
    c_left, c_top, c_right, c_bottom = cell_box
    r_left, r_top, r_right, r_bottom = rect_box
    return c_left >= r_left and c_top >= r_top and c_right <= r_right and c_bottom <= r_bottom


def find_rect_for_cell(cell: tuple[float, float, float, float], rects: list[dict]) -> Optional[dict]:
    return next((r for r in rects if is_included(cell, to_bbox(r))), None)


def get_cell_color(cell: tuple[float, float, float, float], page: Page) -> tuple[float, float, float]:
    rect = find_rect_for_cell(cell, page.rects) if cell else None
    return cmyk_to_rgb(rect["non_stroking_color"]) if rect else (255, 255, 255)


pdf = pdfplumber.open("/path/to/target.pdf")
page = pdf.pages[0]
tables: list[Table] = page.find_tables()

# get RGB color of first(= top-left) cell of first table
print(get_cell_color(tables[0].rows[0].cells[0], page)) # => (r, g, b)

Upvotes: 0

Joris Schellekens

Reputation: 9012

disclaimer: I am the author of the library borb used in this answer

about PDF: PDF is not so much a "what you see is what you get" format, as it is a container for rendering instructions. That means a table is in fact just a collection of rendering instructions that draws something we humans interpret as a table. Something like:

go to location x, y
set the current stroke colour to black
set the current fill colour to blue
set the font to Helvetica, size 12
draw a line to x, y
move the pen up
go to x, y
render the string "Hello World"

Whenever a PDF library is extracting tables from a PDF, it's important to keep in mind this is a heuristic. It's based on some assumptions. Such as "tables tend to have a large number of lines that intersect at 90-degree angles".

I suggest you have a look at TableDetectionByLines in borb. It's a class that gathers the aforementioned rendering instructions and spits out the locations of tables in the PDF document.

You would use it as such:

from borb.pdf.canvas.layout.table.table import Table, TableCell
from borb.pdf.document.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from borb.toolkit.table.table_detection_by_lines import TableDetectionByLines

doc: typing.Optional[Document] = None
with open(input_file, "rb") as input_pdf_handle:
    l: TableDetectionByLines = TableDetectionByLines()
    doc = PDF.loads(input_pdf_handle, [l])

assert doc is not None
tables: typing.List[Table] = l.get_tables_for_page(0)

As it stands, this class does not track the stroke/fill colour. But you can easily subclass it, and modify it so it does.

For this, I would start at this particular line.

Upvotes: 1

How can I extract the background color of a table cell within a PDF file using Python?

Answers (3)

Related Questions