Reputation: 127
I put together the following code, to extract the manuscript text from an online (2-column) PDF. Since there is not a clear section title called "References" or "Bibliography", I tried several times to detect the references and to remove them from the extracted manuscript main text, but without success. Do you have some suggestions on how to remove the references when the section title "References" or "Bibliography" is not present?
This is the code, which extracts the manuscript text from https://hal.science/hal-04206682/document:
import fitz # PyMuPDF
import requests
import io
def extract_text_from_pdf(pdf_file):
# Open the PDF file from the stream
doc = fitz.open(stream=pdf_file, filetype="pdf")
full_text = []
for page_num in range(len(doc)):
page = doc[page_num]
blocks = page.get_text("dict")["blocks"]
# Analyze the page width to detect column structure
page_width = page.rect.width
mid_x = page_width / 2 # Middle of the page for splitting columns
left_column = []
right_column = []
for block in blocks:
if "bbox" in block:
x0, y0, x1, y1 = block["bbox"] # Extract block bounding box
# Classify blocks into left or right columns
if x1 <= mid_x:
left_column.append(block)
elif x0 >= mid_x:
right_column.append(block)
# Sort blocks by their vertical position (top) within each column
left_column.sort(key=lambda b: b["bbox"][1])
right_column.sort(key=lambda b: b["bbox"][1])
# Extract text from each column and concatenate
page_text = []
for column in [left_column, right_column]:
for block in column:
if "lines" in block:
block_text = ""
for line in block["lines"]:
for span in line["spans"]:
block_text += span["text"] + " "
page_text.append(block_text.strip())
# Combine text from both columns into page text
full_text.append("\n".join(page_text))
return "\n\n".join(full_text)
# Fetch the PDF from the URL
url = 'https://hal.science/hal-04206682/document'
try:
response = requests.get(url)
response.raise_for_status() # Raise an error if the request failed
pdf_file = io.BytesIO(response.content) # Load the PDF content into memory
# Extract text from the PDF
text = extract_text_from_pdf(pdf_file)
print(text)
except requests.exceptions.RequestException as e:
print(f"Error downloading the PDF: {e}")
A note, not related to my question, but it can be useful for the readers: the code is able to correctly extract the code following the 2-column PDF structure. However, it does not distinguish between the "Figure" text and the main manuscript text, adding therefore the "Figure" text in between the main manuscript text, where the "Figure" occurs. And I do not know how to removeenter code here
the "Figure" texts.
Upvotes: 0
Views: 86
Reputation: 127
This is meant to be a comment for @EuanG, but given the space limits in the "comment section", I write it here.
I tried to wrap up what you wrote in a "def", but it is not really working... I guess something is missing... but I am not able to figure out what..
import re
def remove_references(text):
# Define regex patterns
numeric_pattern = r"\[\d+\]|\(\d+\)"
author_year_pattern = r"\b[A-Z][a-z]+ et al\., \d{4}|\(\w+, \d{4}\)"
doi_url_pattern = r"\bdoi:|http[s]?://\S+"
sequential_numeric_pattern = r"\[\d+(,\s*\d+)*\]|\(\d+(,\s*\d+)*\)"
journal_style_pattern = r"[A-Za-z\s]+, \d{1,4}\([\d\-]+\):\d+\-\d+"
common_references_pattern = r"^\[\d+\]|\(\d+\)|\b[A-Z][a-z]+ et al\., \d{4}|\(\w+, \d{4}\)|doi:|http[s]?://"
# Remove the references from the text using regex
cleaned_text = re.sub(common_references_pattern, "", text)
return cleaned_text
Upvotes: 0
Reputation: 1122
You can use some heuristics for detecting and removing references as they usually have common patterns or rules.
Use re
import for regex expressions.
import re
Then use this to identify different citation/ reference styles and then deal with them as you like. You can use the common regex for references as below depending on the PDFs style. (or just use all of them, be careful though)
Numeric References (APA, IEEE: [1], [23], (1), (23)):
numeric_pattern = r"^\[\d+\]|\(\d+\)"
Numeric References (APA, IEEE: [1], [23], (1), (23)):
numeric_pattern = r"^\[\d+\]|\(\d+\)"
Author-Year References (APA, Harvard: Smith et al., 2020, (Smith, 2020)):
author_year_pattern = r"\b[A-Z][a-z]+ et al\., \d{4}|\(\w+, \d{4}\)"
Citation with DOIs or URLs (doi:, http:, https:):
doi_url_pattern = r"\bdoi:|http[s]?://"
Sequential References ([1, 2, 3], (1, 2, 3)):
sequential_numeric_pattern = r"^\[\d+(,\s*\d+)*\]|\(\d+(,\s*\d+)*\)"
Journal Style (Journal Name, Volume, Page Numbers):
journal_style_pattern = r"[A-Za-z\s]+, \d{1,4}\([\d\-]+\):\d+\-\d+"
General Patterns for References (Combines common referencing styles):
common_references_pattern = r"^\[\d+\]|\(\d+\)|\b[A-Z][a-z]+ et al\., \d{4}|\(\w+, \d{4}\)|doi:|http[s]?://"
Upvotes: 1