Is there a way to keep between-word hyphens when lemmatizing using spacyr?

Question

I'm using spacyr to lemmatise a corpus of speeches, and then using quanteda to tokenise and analyze results (via textstat_frequency()). My issue is that some key terms in the texts are hyphenated. When I tokenise using quanteda, I do not lose these between-word hyphens, and the hyphenated terms are treated as one token, which is my desired result. However, when I use spacyr to lemmatise first, hyphenated words are not kept together. I've tried nounphrase_consolidate(), which does keep hyphenated words, but I find the results to be very inconsistent, as sometimes a term of interest is kept on its own during this consolidation, and in other instances is combined as part of a larger nounphrase. This is suboptimal because I'm looking at a particular dictionary of features in my final step with textstat_frequenecy, some of which are hyphenated terms.

It seems like this is this solution in spacy, but was curious if there's a similar option in spacyr: SpaCy -- intra-word hyphens. How to treat them one word?

Thanks for any thoughts or suggestions. Code below. It doesn't make a difference whether I use remove_punct or not when tokenising.

    test.sp <- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
test.sp$token <- test.sp$lemma
test.np <- nounphrase_consolidate(test.sp)
test.tokens.3 <- as.tokens(test.np)
test.tokens.3 <- tokens(test.tokens.3, remove_symbols = TRUE,
                      remove_numbers = TRUE,
                      remove_punct = TRUE,
                      remove_url = TRUE) %>% 
  tokens_tolower() %>% 
  tokens_select(pattern = stopwords("en"), selection = "remove")

Ken Benoit · Accepted Answer

You should be able to rejoin the hyphenated words in quanteda, using tokens_compound().

library("quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
library("spacyr")

test.corpus <- c(d1 = "NLP is fast-moving.",
                 d2 = "A co-ordinated effort.")
test.sp <- spacy_parse(test.corpus, lemma = TRUE, entity = FALSE, pos = FALSE, tag = FALSE, nounphrase = TRUE)
#> Found 'spacy_condaenv'. spacyr will use this environment
#> successfully initialized (spaCy Version: 3.4.4, language model: en_core_web_sm)
#> (python options: type = "condaenv", value = "spacy_condaenv")
test.sp$token <- test.sp$lemma
test.np <- nounphrase_consolidate(test.sp)
test.tokens.3 <- as.tokens(test.np)

tokens_compound(test.tokens.3, pattern = phrase("* - *"), concatenator = "")
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "NLP"         "be"          "fast-moving" "."          
#> 
#> d2 :
#> [1] "a_co-ordinated_effort" "."

^{Created on 2023-06-09 with reprex v2.0.2}

Is there a way to keep between-word hyphens when lemmatizing using spacyr?

Answers (1)

Related Questions