nesta1990
nesta1990

Reputation: 295

Matching entities & individuals using entity recognition in R(spacy)

I am relatively new to entity recognition but I have been using this useful guide and I have a large text corpus of Congress policy debates in a Latin American country translated into English.

My goal is to explore which Congress members mention a specific free trade agreement called "NAFTA" the most and their political affiliation. I will then be analyzing sentiment or views about this agreement by political affiliation, but I am now working on the first task and I am not sure if entity recognition would be helpful. The main political parties are abbreviated as "WP", "PRD", and "PAN"

Here is my current attempt:

#Install and load required packages
# install.packages("pdftools")
# install.packages("spacyr")
library(pdftools)
library("spacyr")
library(quanteda)
library(dplyr)
# spacy_initialize(model = "en_core_web_sm")
## successfully initialized (spaCy Version: 3.7.2, language model: en_core_web_sm)
pdf_files <- c("df1.pdf", "df2.pdf", "df3.pdf")
#Function to extract text from PDFs
extract_text_from_pdf <- function(pdf_files) {
  texts <- lapply(pdf_files, function(file) {
    # Extract text from each PDF file
    pdf_text(file)
  })
  return(unlist(texts))
}

# Extract text from PDFs
pdf_texts <- extract_text_from_pdf(pdf_files)

# Process text using SpaCy
parsed_texts <- spacy_parse(pdf_texts)
parsed_texts

Here is a data example:

dput(parsed_texts[1:25,(1:7)])

output:

structure(list(doc_id = c("text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1"
), sentence_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), token_id = 1:25, 
    token = c("12/19/23", ",", "4:41", "PM", "                                                ", 
    "about", ":", "blank", "\n\n\n\n ", "Parliament", "No", ":", 
    "               ", "14", "\n\n ", "Session", "No", ":", "                  ", 
    "1", "\n\n ", "Volume", "No", ":", "                   "), 
    lemma = c("12/19/23", ",", "4:41", "pm", "                                                ", 
    "about", ":", "blank", "\n\n\n\n ", "Parliament", "no", ":", 
    "               ", "14", "\n\n ", "Session", "no", ":", "                  ", 
    "1", "\n\n ", "volume", "no", ":", "                   "), 
    pos = c("NUM", "PUNCT", "NUM", "NOUN", "SPACE", "ADP", "PUNCT", 
    "ADJ", "SPACE", "PROPN", "NOUN", "PUNCT", "SPACE", "NUM", 
    "SPACE", "PROPN", "NOUN", "PUNCT", "SPACE", "NUM", "SPACE", 
    "NOUN", "NOUN", "PUNCT", "SPACE"), entity = c("CARDINAL_B", 
    "", "TIME_B", "TIME_I", "", "", "", "ORG_B", "ORG_I", "ORG_I", 
    "", "", "", "", "", "", "", "", "", "CARDINAL_B", "", "", 
    "", "", "")), row.names = c(NA, 25L), class = c("spacyr_parsed", 
"data.frame"))

Ideally, I would like an outcome that shows something as follows, while currently both the political party and the trade agreement are below the "entity" column in my df.

token    entity       political affiliation      congress_member_share_of_NAFTA_mentions   
Rafael   NAFTA            WP                               3%
Martinez   NAFTA          WP                               7%
Martinez   NAFTA          WP                               7%
Alberto   NAFTA          PAN                               36%
Alberto   NAFTA          PAN                               36%
Rafael   NAFTA          PAN                                24%
Rafael   NAFTA          PAN                                24%
Alberto   NAFTA          PAN                               36%

Upvotes: 3

Views: 202

Answers (1)

jeffsdata
jeffsdata

Reputation: 444

I don't think entity recognition would be appropriate for this problem. You should do some analysis of the data to find how NAFTA appears when people are talking about it.

  • If they only ever reference is by the text "NAFTA" (which is likely), then just simply count the number of occurrences. You got it easy!
  • If there are a handful of ways it's referenced (like if it was shortened to "NAF" or some people call it "North American Free Trade Agreement" and "North America Free Trade Agreement", I'd just document all of those (since there are probably only like 5-6 ways it could be referenced) and count the number of occurrences.

Entity recognition would be used when there's a class of words that need to be automatically tagged. For example, in your analysis, after analyzing instances of NAFTA, you could go a step further and see which [Trade Goods] are being discussed in conjunction with the word "NAFTA" - like "Oil", "Wood", "Paper", etc... So, you might create a custom entity to tag trade goods, do some manual training to get it going, and then manually stem/combine terms - like "oil", "gas", and "petroleum" might be the same thing. Could potentially use an LLM as well - to tag a sentence for you.

Upvotes: 0

Related Questions