BostonPlummer
BostonPlummer

Reputation: 127

Detect and remove references from a pdf when the section title "References" or "Bibliography" is not present - Python

I put together the following code, to extract the manuscript text from an online (2-column) PDF. Since there is not a clear section title called "References" or "Bibliography", I tried several times to detect the references and to remove them from the extracted manuscript main text, but without success. Do you have some suggestions on how to remove the references when the section title "References" or "Bibliography" is not present?

This is the code, which extracts the manuscript text from https://hal.science/hal-04206682/document:

import fitz  # PyMuPDF
import requests
import io

def extract_text_from_pdf(pdf_file):
    # Open the PDF file from the stream
    doc = fitz.open(stream=pdf_file, filetype="pdf")
    full_text = []

    for page_num in range(len(doc)):
        page = doc[page_num]
        blocks = page.get_text("dict")["blocks"]

        # Analyze the page width to detect column structure
        page_width = page.rect.width
        mid_x = page_width / 2  # Middle of the page for splitting columns

        left_column = []
        right_column = []

        for block in blocks:
            if "bbox" in block:
                x0, y0, x1, y1 = block["bbox"]  # Extract block bounding box

                # Classify blocks into left or right columns
                if x1 <= mid_x:
                    left_column.append(block)
                elif x0 >= mid_x:
                    right_column.append(block)

        # Sort blocks by their vertical position (top) within each column
        left_column.sort(key=lambda b: b["bbox"][1])
        right_column.sort(key=lambda b: b["bbox"][1])

        # Extract text from each column and concatenate
        page_text = []
        for column in [left_column, right_column]:
            for block in column:
                if "lines" in block:
                    block_text = ""
                    for line in block["lines"]:
                        for span in line["spans"]:
                            block_text += span["text"] + " "
                    page_text.append(block_text.strip())

        # Combine text from both columns into page text
        full_text.append("\n".join(page_text))

    return "\n\n".join(full_text)

# Fetch the PDF from the URL
url = 'https://hal.science/hal-04206682/document'

try:
    response = requests.get(url)
    response.raise_for_status()  # Raise an error if the request failed
    pdf_file = io.BytesIO(response.content)  # Load the PDF content into memory

    # Extract text from the PDF
    text = extract_text_from_pdf(pdf_file)
    print(text)
except requests.exceptions.RequestException as e:
    print(f"Error downloading the PDF: {e}")

A note, not related to my question, but it can be useful for the readers: the code is able to correctly extract the code following the 2-column PDF structure. However, it does not distinguish between the "Figure" text and the main manuscript text, adding therefore the "Figure" text in between the main manuscript text, where the "Figure" occurs. And I do not know how to removeenter code here the "Figure" texts.

Upvotes: 0

Views: 86

Answers (2)

BostonPlummer
BostonPlummer

Reputation: 127

This is meant to be a comment for @EuanG, but given the space limits in the "comment section", I write it here.

I tried to wrap up what you wrote in a "def", but it is not really working... I guess something is missing... but I am not able to figure out what..

import re

def remove_references(text):

    # Define regex patterns
    numeric_pattern = r"\[\d+\]|\(\d+\)" 
    author_year_pattern = r"\b[A-Z][a-z]+ et al\., \d{4}|\(\w+, \d{4}\)" 
    doi_url_pattern = r"\bdoi:|http[s]?://\S+"  
    sequential_numeric_pattern = r"\[\d+(,\s*\d+)*\]|\(\d+(,\s*\d+)*\)"  
    journal_style_pattern = r"[A-Za-z\s]+, \d{1,4}\([\d\-]+\):\d+\-\d+"  
    common_references_pattern = r"^\[\d+\]|\(\d+\)|\b[A-Z][a-z]+ et al\., \d{4}|\(\w+, \d{4}\)|doi:|http[s]?://"
    # Remove the references from the text using regex
    cleaned_text = re.sub(common_references_pattern, "", text)
    
    return cleaned_text

Upvotes: 0

EuanG
EuanG

Reputation: 1122

You can use some heuristics for detecting and removing references as they usually have common patterns or rules.

Use re import for regex expressions.

import re

Then use this to identify different citation/ reference styles and then deal with them as you like. You can use the common regex for references as below depending on the PDFs style. (or just use all of them, be careful though)

Numeric References (APA, IEEE: [1], [23], (1), (23)):

numeric_pattern = r"^\[\d+\]|\(\d+\)"

Numeric References (APA, IEEE: [1], [23], (1), (23)):

numeric_pattern = r"^\[\d+\]|\(\d+\)"

Author-Year References (APA, Harvard: Smith et al., 2020, (Smith, 2020)):

author_year_pattern = r"\b[A-Z][a-z]+ et al\., \d{4}|\(\w+, \d{4}\)"

Citation with DOIs or URLs (doi:, http:, https:):

doi_url_pattern = r"\bdoi:|http[s]?://"

Sequential References ([1, 2, 3], (1, 2, 3)):

sequential_numeric_pattern = r"^\[\d+(,\s*\d+)*\]|\(\d+(,\s*\d+)*\)"

Journal Style (Journal Name, Volume, Page Numbers):

journal_style_pattern = r"[A-Za-z\s]+, \d{1,4}\([\d\-]+\):\d+\-\d+"

General Patterns for References (Combines common referencing styles):

common_references_pattern = r"^\[\d+\]|\(\d+\)|\b[A-Z][a-z]+ et al\., \d{4}|\(\w+, \d{4}\)|doi:|http[s]?://"

Upvotes: 1

Related Questions