Reputation: 11
Good afternoon,
I am trying to sort a large corpus of normative texts of different lengths, and to tag the parts of speech (POS). For that purpose, I was using the tm and udpipe libraries, and given the length of the database.
The other task I need to perform is to identify the entities. I tried the SpacyR library, but it does not correctly identify the name of the organizations, so I want to train a custom NER model based on a few documents from the corpus, which I have personally validated.
How could I "spacy_extract_entity()" with custom data? Or maybe with quanteda and spacyr?
Thanks in advance.
I have done the POS task in this way. I generated a couple of functions.
suppressMessages(suppressWarnings(library(pdftools)))
suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(tm)))
# load the corpus
tm_corpus <- VCorpus(DirSource(
"working_path,
pattern = ".pdf"),readerControl = list(reader = readPDF, language = 'es-419'))
# load udpipe
library(udpipe)
dl <- udpipe_download_model(language = "spanish", overwrite = FALSE)
str(dl)
udmodel_spanish <- udpipe_load_model(file = dl$file_model)
# functions to annotate the corpus
f_udpipe_anot <- function(n){
txt <- as.character(tm_corpus[[n]]) %>% #lista simia
unlist()
y <- udpipe_annotate(udmodel_spanish, x = txt, trace = TRUE)
y <- as.data.frame(y)
}
pinkillazo <- function(desde, hasta){
resultado <- data.frame()
for (item in desde:hasta){
print(item)
resultado <- rbind(resultado, f_udpipe_anot(item))
}
return(resultado)
}
leyes_udpipe_POS <- pinkillazo(1,13) # here I got the annotated corpus as a dataframe
To identify the named entities, I have tried this:
spacyr::spacy_initialize(model = "es_core_news_sm")
quan_corpus <- corpus(tm_corpus)
POS_df_spacyr <- spacy_parse(quan_corpus, lemma = FALSE, entity = TRUE, tag = FALSE, pos = TRUE)
organiz <- spacy_extract_entity(
quan_corpus,
output = c("data.frame", "list"),
type = c("all", "named", "extended"),
multithread = TRUE,
)
I am getting the wrong organizations' names as well as other misspecifications. With multithread, I tought that this task could easen, but it's not the case.
Upvotes: 0
Views: 191
Reputation:
If you want to train your own named entity recognition model in R, you could use R packages crfsuite and R package nametagger which are respectively Conditional Random Fields and Maximum Entropy Models which can be used alongside the udpipe annotation.
If you want deep learning models, you might have to look into torch alongside tokenisers like sentencepiece and embedding techniques like word2vec to implement your own modelling flow (e.g. BiLSTM).
Upvotes: 0