How to extract entities names with SpacyR with personalized data?

Question

Good afternoon,

I am trying to sort a large corpus of normative texts of different lengths, and to tag the parts of speech (POS). For that purpose, I was using the tm and udpipe libraries, and given the length of the database.

The other task I need to perform is to identify the entities. I tried the SpacyR library, but it does not correctly identify the name of the organizations, so I want to train a custom NER model based on a few documents from the corpus, which I have personally validated.

How could I "spacy_extract_entity()" with custom data? Or maybe with quanteda and spacyr?

Thanks in advance.

I have done the POS task in this way. I generated a couple of functions.

suppressMessages(suppressWarnings(library(pdftools)))
suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(tm)))

# load the corpus

tm_corpus <- VCorpus(DirSource(
  "working_path,
  pattern = ".pdf"),readerControl = list(reader = readPDF, language = 'es-419'))

# load udpipe

library(udpipe)
dl <- udpipe_download_model(language = "spanish", overwrite = FALSE)
str(dl)
udmodel_spanish <- udpipe_load_model(file = dl$file_model)

# functions to annotate the corpus

f_udpipe_anot <- function(n){
  
  txt <- as.character(tm_corpus[[n]]) %>% #lista simia
    unlist()
  y <- udpipe_annotate(udmodel_spanish, x = txt, trace = TRUE)
  y <- as.data.frame(y)
}

pinkillazo <- function(desde, hasta){
  resultado <- data.frame()
  for (item in desde:hasta){
    print(item)
    resultado <- rbind(resultado, f_udpipe_anot(item))
   
   }
  return(resultado)
}

leyes_udpipe_POS <- pinkillazo(1,13) # here I got the annotated corpus as a dataframe

To identify the named entities, I have tried this:

spacyr::spacy_initialize(model = "es_core_news_sm")
quan_corpus <- corpus(tm_corpus)
POS_df_spacyr <- spacy_parse(quan_corpus, lemma = FALSE, entity = TRUE, tag = FALSE, pos = TRUE)

organiz <- spacy_extract_entity(
  quan_corpus,
  output = c("data.frame", "list"),
  type = c("all", "named", "extended"),
  multithread = TRUE,
  )

I am getting the wrong organizations' names as well as other misspecifications. With multithread, I tought that this task could easen, but it's not the case.

How to extract entities names with SpacyR with personalized data?

Answers (1)

Related Questions