Reputation: 1
I've been using tabula-py, PyPDF2 and tika modules, but none of them seems to detect the background color of a table cell, which is within a PDF file.
These colored cells mean important information in the context of my problem. I know, for exemple, that tabula-py is a wrapper from tabula-java and this one does not provided colored cell information. Is there some easy-to-follow solution in Python out there?
Thanks in advance.
Upvotes: 0
Views: 2542
Reputation: 2897
Some kind user reported my previous solution did not work well.
It's true because pdfplumber's page.rects
does not always detect cells in table correctly.
Sometimes it only detects lines, rows, cols.
So I propose another solution here.
import pdfplumber
from collections import Counter
def get_cell_color(image, cell:tuple[int, int, int, int]):
cropped_image = image.crop(cell)
pixels = list(cropped_image.convert('RGB').getdata())
color_counts = Counter(pixels)
most_common = color_counts.most_common(1)
return most_common[0][0]
def demo(page):
"""example method: print colored cells information"""
page_image = page.to_image().original
tables = page.find_tables()
for table in tables:
extracted_table = table.extract()
for row_idx, row in enumerate(table.rows):
for cell_idx, cell in enumerate(row.cells):
cell_color = get_cell_color(page_image, cell)
if cell_color != (255, 255, 255):
print(f"cell color: {cell_color}")
print(f"cell location: {cell}")
print(f"cell content: {extracted_table[row_idx][cell_idx]}")
pdf = pdfplumber.open("/path/to/target.pdf")
page = pdf.pages[0]
demo(page)
Upvotes: 1
Reputation: 2897
I found a solution using pdfplumber. Here is rough sample code.
from typing import Optional
import pdfplumber
from pdfplumber.page import Page, Table
def cmyk_to_rgb(cmyk: tuple[float, float, float, float]):
r = 255 * (1.0 - (cmyk[0] + cmyk[3]))
g = 255 * (1.0 - (cmyk[1] + cmyk[3]))
b = 255 * (1.0 - (cmyk[2] + cmyk[3]))
return r, g, b
def to_bbox(rect: dict) -> tuple[float, float, float, float]:
return (rect["x0"], rect["top"], rect["x1"], rect["bottom"])
def is_included(cell_box: tuple[float, float, float, float], rect_box: tuple[float, float, float, float]):
c_left, c_top, c_right, c_bottom = cell_box
r_left, r_top, r_right, r_bottom = rect_box
return c_left >= r_left and c_top >= r_top and c_right <= r_right and c_bottom <= r_bottom
def find_rect_for_cell(cell: tuple[float, float, float, float], rects: list[dict]) -> Optional[dict]:
return next((r for r in rects if is_included(cell, to_bbox(r))), None)
def get_cell_color(cell: tuple[float, float, float, float], page: Page) -> tuple[float, float, float]:
rect = find_rect_for_cell(cell, page.rects) if cell else None
return cmyk_to_rgb(rect["non_stroking_color"]) if rect else (255, 255, 255)
pdf = pdfplumber.open("/path/to/target.pdf")
page = pdf.pages[0]
tables: list[Table] = page.find_tables()
# get RGB color of first(= top-left) cell of first table
print(get_cell_color(tables[0].rows[0].cells[0], page)) # => (r, g, b)
Upvotes: 0
Reputation: 9012
disclaimer: I am the author of the library borb
used in this answer
about PDF: PDF is not so much a "what you see is what you get" format, as it is a container for rendering instructions. That means a table is in fact just a collection of rendering instructions that draws something we humans interpret as a table. Something like:
Whenever a PDF library is extracting tables from a PDF, it's important to keep in mind this is a heuristic. It's based on some assumptions. Such as "tables tend to have a large number of lines that intersect at 90-degree angles".
I suggest you have a look at TableDetectionByLines
in borb
. It's a class that gathers the aforementioned rendering instructions and spits out the locations of tables in the PDF document.
You would use it as such:
from borb.pdf.canvas.layout.table.table import Table, TableCell
from borb.pdf.document.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from borb.toolkit.table.table_detection_by_lines import TableDetectionByLines
doc: typing.Optional[Document] = None
with open(input_file, "rb") as input_pdf_handle:
l: TableDetectionByLines = TableDetectionByLines()
doc = PDF.loads(input_pdf_handle, [l])
assert doc is not None
tables: typing.List[Table] = l.get_tables_for_page(0)
As it stands, this class does not track the stroke/fill colour. But you can easily subclass it, and modify it so it does.
For this, I would start at this particular line.
Upvotes: 1