Reputation: 295
I am relatively new to entity recognition but I have been using this useful guide and I have a large text corpus of Congress policy debates in a Latin American country translated into English.
My goal is to explore which Congress members mention a specific free trade agreement called "NAFTA" the most and their political affiliation. I will then be analyzing sentiment or views about this agreement by political affiliation, but I am now working on the first task and I am not sure if entity recognition would be helpful. The main political parties are abbreviated as "WP", "PRD", and "PAN"
Here is my current attempt:
#Install and load required packages
# install.packages("pdftools")
# install.packages("spacyr")
library(pdftools)
library("spacyr")
library(quanteda)
library(dplyr)
# spacy_initialize(model = "en_core_web_sm")
## successfully initialized (spaCy Version: 3.7.2, language model: en_core_web_sm)
pdf_files <- c("df1.pdf", "df2.pdf", "df3.pdf")
#Function to extract text from PDFs
extract_text_from_pdf <- function(pdf_files) {
texts <- lapply(pdf_files, function(file) {
# Extract text from each PDF file
pdf_text(file)
})
return(unlist(texts))
}
# Extract text from PDFs
pdf_texts <- extract_text_from_pdf(pdf_files)
# Process text using SpaCy
parsed_texts <- spacy_parse(pdf_texts)
parsed_texts
Here is a data example:
dput(parsed_texts[1:25,(1:7)])
output:
structure(list(doc_id = c("text1", "text1", "text1", "text1",
"text1", "text1", "text1", "text1", "text1", "text1", "text1",
"text1", "text1", "text1", "text1", "text1", "text1", "text1",
"text1", "text1", "text1", "text1", "text1", "text1", "text1"
), sentence_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), token_id = 1:25,
token = c("12/19/23", ",", "4:41", "PM", " ",
"about", ":", "blank", "\n\n\n\n ", "Parliament", "No", ":",
" ", "14", "\n\n ", "Session", "No", ":", " ",
"1", "\n\n ", "Volume", "No", ":", " "),
lemma = c("12/19/23", ",", "4:41", "pm", " ",
"about", ":", "blank", "\n\n\n\n ", "Parliament", "no", ":",
" ", "14", "\n\n ", "Session", "no", ":", " ",
"1", "\n\n ", "volume", "no", ":", " "),
pos = c("NUM", "PUNCT", "NUM", "NOUN", "SPACE", "ADP", "PUNCT",
"ADJ", "SPACE", "PROPN", "NOUN", "PUNCT", "SPACE", "NUM",
"SPACE", "PROPN", "NOUN", "PUNCT", "SPACE", "NUM", "SPACE",
"NOUN", "NOUN", "PUNCT", "SPACE"), entity = c("CARDINAL_B",
"", "TIME_B", "TIME_I", "", "", "", "ORG_B", "ORG_I", "ORG_I",
"", "", "", "", "", "", "", "", "", "CARDINAL_B", "", "",
"", "", "")), row.names = c(NA, 25L), class = c("spacyr_parsed",
"data.frame"))
Ideally, I would like an outcome that shows something as follows, while currently both the political party and the trade agreement are below the "entity" column in my df.
token entity political affiliation congress_member_share_of_NAFTA_mentions
Rafael NAFTA WP 3%
Martinez NAFTA WP 7%
Martinez NAFTA WP 7%
Alberto NAFTA PAN 36%
Alberto NAFTA PAN 36%
Rafael NAFTA PAN 24%
Rafael NAFTA PAN 24%
Alberto NAFTA PAN 36%
Upvotes: 3
Views: 202
Reputation: 444
I don't think entity recognition would be appropriate for this problem. You should do some analysis of the data to find how NAFTA appears when people are talking about it.
Entity recognition would be used when there's a class of words that need to be automatically tagged. For example, in your analysis, after analyzing instances of NAFTA, you could go a step further and see which [Trade Goods] are being discussed in conjunction with the word "NAFTA" - like "Oil", "Wood", "Paper", etc... So, you might create a custom entity to tag trade goods, do some manual training to get it going, and then manually stem/combine terms - like "oil", "gas", and "petroleum" might be the same thing. Could potentially use an LLM as well - to tag a sentence for you.
Upvotes: 0