user34088
user34088

Reputation: 41

highlight Text in pdf file without using search_for()

I would like to highlight text in my pdf file by using PyMuPDF library. The method search_for() return the location of the searched words. the problem is this method ignore spaces. Upper / lower case.it works only for ASCII characters.

is there any solution to get the location\coordinate without using search_for()

my Code:

pattern=re.compile(r'(\[V2G2-\d{3}\])(\s{1,}\w(.+?)\.  )')
for m in re.finditer(pattern,text):
     macted.append(m.group())

def doHighleigh():
    pdf_document = fitz.open("ISO_15.pdf")
    page_num = pdf_document.page_count
    
    for i in range(page_num):
        page = pdf_document[i]

        for item in macted:
            search_instances = page.search_page_for(item,quad=True)
            
            for q in search_instances:
                highlight = page.add_highlight_annot(q)
                #RGB(127, 255, 255)
                highlight.set_colors({"stroke": (0.5, 1, 1), "fill": (0.75, 0.8, 0.95)})
                highlight.update()
    pdf_document.save(r"output.pdf")

it igone the sec. sentence because the spaces between the words. enter image description here

Upvotes: 1

Views: 961

Answers (1)

Jorj McKie
Jorj McKie

Reputation: 3110

Using the search method is just one way to get hold of coordinates required for highlighting. You can also use any of the page.get_text() variants returning text coordinates. Looking at your example, the "blocks" variant may be sufficient, or a combination of "words" and "blocks" extractions.

page.get_text("blocks") returns a list of items like (x0, y0, x1, y1, "line1\nline2\n, ...", blocknumber, blocktype). The first 4 items in the tuple are the coordinates of the enveloping rectangle.

page.get_text("words") You also can extract a list of words (strings containing no spaces) with similar items: (x0, y0, x1, y1, "wordstring", blocknumber, linenumber, wordnumber).

You could inspect the "words" for items matching the regex pattern and then highlight the respective block. Probably can even be done without regular expressions. Here is a snippet that may serve your intention:

def matches(word):
    if word.startswith("[V2G2-") and word.endswith(("]", "].")):
        return True
    return False

def add_highlight(page, rect):
    """Highlight annots have no fill color"""
    annot = page.add_highlight_annot(rect)
    annot.set_colors(stroke=(0.5,1,1))
    annot.update()

flags = fitz.TEXTFLAGS_TEXT  # need identical flags for all extractions
for page in doc:
    blocks = page.get_text("blocks", flags=flags)
    words = page.get_text("words", flags=flags)
    for word in words:
        blockn = word[-3]  # block number
        if matches(word[4]):
            block = blocks[blockn]  # get the containing block
            block_rect = fitz.Rect(block[:4])
            add_highlight(page, block_rect)

So the approach used here is: check if a block contains a matching word. If so, highlight it.

Upvotes: 1

Related Questions