kishan Kumar
kishan Kumar

Reputation: 21

How to detect selected text from a PDF using Python in a Django application?

I have a Python script that successfully detects selected text from a PDF file using the xdotool and pyperclip libraries. Here's the script:

import time
import os
import subprocess
import pyperclip  

def get_selected_text():
    subprocess.run(['xdotool', 'key', 'ctrl+c'])  
    time.sleep(1)  
    return pyperclip.paste()

if __name__ == "__main__":
    pdf_filepath = 'class3english.pdf'
    subprocess.Popen(['xdg-open', pdf_filepath])

    while True:
        selected_text = get_selected_text()
        if selected_text:
            print("Selected text:", selected_text)
        time.sleep(2)


#Views.py(django implementation)
#NOTE:- i am only able to print the selected text from pdf only if i copy first then it is detecting the selected word

import time
import subprocess
import pyperclip
def get_selected_text():
    subprocess.run(['xdotool', 'key', 'ctrl+c'])
    return pyperclip.paste()
def get_selected_text_view(request):
    while True:
        selected_text = get_selected_text()
        if selected_text:
            print("Selected text:", selected_text)
            time.sleep(2)        
            return render(request, 'grad_school/viewer.html')
        else:
            print("\nselected_text = ","NONE")
            time.sleep(2)
            return render(request, 'grad_school/viewer.html')

#viewer.html


{% load static %}
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>PDF Viewer</title>
    <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
</head>
<body>
    <embed src="{% static 'class3english.pdf' %}" type="application/pdf" width="100%" height="600px" />
    <button id="readButton">Read</button>

    <script>
        $(document).ready(function() {
            $('#readButton').click(function() {
                $.ajax({
                    url: '{% url 'detectingWordSentences' %}', 
                    type: 'GET',
                    success: function(response) {
                       console.log(response);
                    },
                    error: function(xhr, status, error) {
                        console.error(error);
                    }
                });
            });
        });
    </script>
</body>
</html>

However, when I try to integrate this script into my Django application, it doesn't detect the selected text from the PDF. Instead, it captures text from the clipboard. I want to be able to detect text that the user has actively selected within the PDF viewer.

Is there a way to achieve this within a Django application? How can I modify my approach to capture only the selected text from a PDF when running within a Django environment?

Upvotes: 0

Views: 152

Answers (1)

M69k65y
M69k65y

Reputation: 647

The best option I've found would be to:

  1. Extract the text of the PDF and render it in your template file, perhaps with a bit of styling applied. In my code sample, I use the package pypdf.
  2. Copy the selected text using JavaScript. (See this answer for a detailed breakdown.)

This is what this approach would look like:

from pypdf import PdfReader

# Use `get_selected_text` as it is.

def get_selected_text_view(request):
    # Read the PDF file
    reader = PdfReader("path-to-your-file.pdf")
    document_pages = reader.pages
    extracted_text = []

    # Loop over all the pages and append the extracted text to the
    # `extracted_text` list.
    for single_page in document_pages:
        text = single_page.extract_text(extraction_mode="layout")
        extracted_text.append(text)
        
    while True:
        selected_text = get_selected_text()
        if selected_text:
            print("Selected text:", selected_text)
            time.sleep(2)
            # Pass the extracted text as context data
            return render(request, 'pyperclip.html', {"pdf_text": extracted_text})
        else:
            print("\nselected_text = ","NONE")
            time.sleep(2)
            return render(request, 'pyperclip.html', {"pdf_text": extracted_text})

Update your template:

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="UTF-8">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>PDF Viewer</title>
        <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
    </head>
    <body>
        <!-- Display all text extracted from the PDF -->
        <div style="width: 100%; height: 600px;">
            {% for page in pdf_text %}
                <!-- To try and render the text as it may have been laid out in the PDF,
                we use the pre tag. -->
                <pre>
                    {{ page }}
                </pre>
            {% endfor %}
        </div>
        <button id="readButton">Read</button>

        <script>
            $(document).ready(function() {
                $('#readButton').click(function() {
                    // Get the selected text.
                    let selectedText = document.getSelection().toString();
                    navigator.clipboard.writeText(selectedText);
                    $.ajax({
                        url: '{% url 'detectingWordSentences' %}', 
                        type: 'GET',
                        success: function(response) {
                        //    console.log(response);
                        },
                        error: function(xhr, status, error) {
                            // console.error(error);
                        }
                    });
                });
            });
        </script>
    </body>
</html>

The PDF text is extracted first and then displayed because text selected in an embedded PDF (either through the embed or iframe elements) is not recognised by the getSelection() function.

Additionally, due to the different rendering engines used by the different browsers, it may not be possible to reliably get text selected. (See this post.)

Another approach you may choose to explore is using the PDF.js package. Based on my (limited) testing, it's not possible to render the PDF as an embedded file, even when following the example included in the documentation; the output displayed in the browser is an image.
However, using the demo site, it is possible to get the selected text in the console using getSelection(), so it just might be that there's something I'm missing.

Upvotes: 0

Related Questions