Reputation: 89
I am using the quanteda package by Ken Benoit and Paul Nulty to work with textual data.
My corpus contains texts with full German sentences and I want to work with the nouns of every text only. One trick in German is to use the upper case words only, but this would fail at the beginning of a sentence.
Text1 <- c("Halle an der Saale ist die grünste Stadt Deutschlands")
Text2 <- c("In Hamburg regnet es immer, das ist also so wie in London.")
Text3 <- c("James Bond trinkt am liebsten Martini")
myCorpus <- corpus(c(Text1, Text2, Text3))
metadoc(myCorpus, "language") <- "german"
summary(myCorpus, showmeta = T)
myDfm <- dfm(myCorpus, tolower = F, remove_numbers = T,
remove = stopwords("german"), remove_punct = TRUE,
remove_separators = T)
topfeatures(myDfm, 20)
From this minimal example, I would like to retrieve: "Halle", "Saale", "Stadt", "Deutschland", "Hamburg", "London", "Martini", "James", "Bond".
I assume I need a dictionary, which defines verbs/nouns/etc. and the proper names (James Bond, Hamburg etc.), or is there a build in function/dict?
Bonus Question: Does the solution work for English texts too?
Upvotes: 3
Views: 1782
Reputation: 14902
You need some help from a part-of-speech tagger. Fortunately there is a great one, with a German language model, in the form of spaCy, and a package we wrote as a wrapper around it, spacyr. Installation instructions are at the spacyr page.
This code will do what you want:
txt <- c("Halle an der Saale ist die grünste Stadt Deutschlands",
"In Hamburg regnet es immer, das ist also so wie in London.",
"James Bond trinkt am liebsten Martini")
library("spacyr")
spacy_initialize(model = "de")
txtparsed <- spacy_parse(txt, tag = TRUE, pos = TRUE)
head(txtparsed, 20)
# doc_id sentence_id token_id token lemma pos tag entity
# 1 text1 1 1 Halle halle PROPN NE LOC_B
# 2 text1 1 1 an an ADP APPR LOC_I
# 3 text1 1 1 der der DET ART LOC_I
# 4 text1 1 1 Saale saale PROPN NE LOC_I
# 5 text1 1 1 ist ist AUX VAFIN
# 6 text1 1 1 die die DET ART
# 7 text1 1 1 grünste grünste ADJ ADJA
# 8 text1 1 1 Stadt stadt NOUN NN
# 9 text1 1 1 Deutschlands deutschlands PROPN NE LOC_B
# 10 text2 1 1 In in ADP APPR
# 11 text2 1 1 Hamburg hamburg PROPN NE LOC_B
# 12 text2 1 1 regnet regnet VERB VVFIN
# 13 text2 1 1 es es PRON PPER
# 14 text2 1 1 immer immer ADV ADV
# 15 text2 1 1 , , PUNCT $,
# 16 text2 1 1 das das PRON PDS
# 17 text2 1 1 ist ist AUX VAFIN
# 18 text2 1 1 also also ADV ADV
# 19 text2 1 1 so so ADV ADV
# 20 text2 1 1 wie wie CONJ KOKOM
(nouns <- with(txtparsed, subset(token, pos == "NOUN")))
# [1] "Stadt"
(propernouns <- with(txtparsed, subset(token, pos == "PROPN")))
# [1] "Halle" "Saale" "Deutschlands" "Hamburg" "London"
# [6] "James" "Bond" "Martini"
Here, you can see that the nouns you wanted are marked in the simpler pos
field as "proper nouns". The tag
field is a more detailed, German-language tagset that you could also select from.
The lists of selected nouns can then be used in quanteda:
library("quanteda")
myDfm <- dfm(txt, tolower = FALSE, remove_numbers = TRUE,
remove = stopwords("german"), remove_punct = TRUE)
head(myDfm)
# Document-feature matrix of: 3 documents, 14 features (66.7% sparse).
# (showing first 3 documents and first 6 features)
# features
# docs Halle Saale grünste Stadt Deutschlands Hamburg
# text1 1 1 1 1 1 0
# text2 0 0 0 0 0 1
# text3 0 0 0 0 0 0
head(dfm_select(myDfm, pattern = propernouns))
# Document-feature matrix of: 3 documents, 8 features (66.7% sparse).
# (showing first 3 documents and first 6 features)
# features
# docs Halle Saale Deutschlands Hamburg London James
# text1 1 1 1 0 0 0
# text2 0 0 0 1 1 0
# text3 0 0 0 0 0 1
Upvotes: 7