Matching entities & individuals using entity recognition in R(spacy)

Question

I am relatively new to entity recognition but I have been using this useful guide and I have a large text corpus of Congress policy debates in a Latin American country translated into English.

My goal is to explore which Congress members mention a specific free trade agreement called "NAFTA" the most and their political affiliation. I will then be analyzing sentiment or views about this agreement by political affiliation, but I am now working on the first task and I am not sure if entity recognition would be helpful. The main political parties are abbreviated as "WP", "PRD", and "PAN"

Here is my current attempt:

#Install and load required packages
# install.packages("pdftools")
# install.packages("spacyr")
library(pdftools)
library("spacyr")
library(quanteda)
library(dplyr)
# spacy_initialize(model = "en_core_web_sm")
## successfully initialized (spaCy Version: 3.7.2, language model: en_core_web_sm)

pdf_files <- c("df1.pdf", "df2.pdf", "df3.pdf")

#Function to extract text from PDFs
extract_text_from_pdf <- function(pdf_files) {
  texts <- lapply(pdf_files, function(file) {
    # Extract text from each PDF file
    pdf_text(file)
  })
  return(unlist(texts))
}

# Extract text from PDFs
pdf_texts <- extract_text_from_pdf(pdf_files)

# Process text using SpaCy
parsed_texts <- spacy_parse(pdf_texts)
parsed_texts

Here is a data example:

dput(parsed_texts[1:25,(1:7)])

output:

structure(list(doc_id = c("text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1"
), sentence_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), token_id = 1:25, 
    token = c("12/19/23", ",", "4:41", "PM", "                                                ", 
    "about", ":", "blank", "



 ", "Parliament", "No", ":", 
    "               ", "14", "

 ", "Session", "No", ":", "                  ", 
    "1", "

 ", "Volume", "No", ":", "                   "), 
    lemma = c("12/19/23", ",", "4:41", "pm", "                                                ", 
    "about", ":", "blank", "



 ", "Parliament", "no", ":", 
    "               ", "14", "

 ", "Session", "no", ":", "                  ", 
    "1", "

 ", "volume", "no", ":", "                   "), 
    pos = c("NUM", "PUNCT", "NUM", "NOUN", "SPACE", "ADP", "PUNCT", 
    "ADJ", "SPACE", "PROPN", "NOUN", "PUNCT", "SPACE", "NUM", 
    "SPACE", "PROPN", "NOUN", "PUNCT", "SPACE", "NUM", "SPACE", 
    "NOUN", "NOUN", "PUNCT", "SPACE"), entity = c("CARDINAL_B", 
    "", "TIME_B", "TIME_I", "", "", "", "ORG_B", "ORG_I", "ORG_I", 
    "", "", "", "", "", "", "", "", "", "CARDINAL_B", "", "", 
    "", "", "")), row.names = c(NA, 25L), class = c("spacyr_parsed", 
"data.frame"))

Ideally, I would like an outcome that shows something as follows, while currently both the political party and the trade agreement are below the "entity" column in my df.

token    entity       political affiliation      congress_member_share_of_NAFTA_mentions   
Rafael   NAFTA            WP                               3%
Martinez   NAFTA          WP                               7%
Martinez   NAFTA          WP                               7%
Alberto   NAFTA          PAN                               36%
Alberto   NAFTA          PAN                               36%
Rafael   NAFTA          PAN                                24%
Rafael   NAFTA          PAN                                24%
Alberto   NAFTA          PAN                               36%

Matching entities & individuals using entity recognition in R(spacy)

Answers (1)

Related Questions

Matching entities &amp; individuals using entity recognition in R(spacy)

Answers (1)

Related Questions

Matching entities & individuals using entity recognition in R(spacy)