Reputation: 1145
Need to create an text sparce matrix (DTM) for classification. To prepare the text, first I need to eliminate (separate) the POS-tags the text. My guess was to do it like below. I'm new to R and don't now how to negate a REGEX (see below NOT!).
text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")
My guess how it could work:
(POSs <- regmatches(text, gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text)))
[[1]]
[1] "/KOUS" "/VVFIN" "./$."
[[2]]
[1] "/VVFIN" "/PTKVZ" ";/$."
[[3]]
[1] "-/TRUNC" "/APPR" "/KON"
[[4]]
[1] "/PIS" "/ADJD" "./$."
[[5]]
[1] "/NN" "!!!/NE"
But don't konw how to negate the expression like:
# VVV
(texts <- regmatches(text, NOT!(gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text))))
[[1]]
[1] "wenn" "ausläuft"
[[2]]
[1] "Kommt" "vor"
[[3]]
[1] "Durch" "und"
[[4]]
[1] "man" "zügig"
[[5]]
[1] "empfehlung"
Upvotes: 3
Views: 99
Reputation: 1145
One possibility is to eliminate the tags by, searching for POS-tags and replacing them with ''
(i.e. empty text):
text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")
(textlist <- strsplit(paste(gsub('[[:punct:]]*/[[:alpha:][:punct:]]*','', text), sep=' '), " "))
[[1]]
[1] "wenn" "ausläuft"
[[2]]
[1] "Kommt" "vor"
[[3]]
[1] "-RRB" "Durch" "und"
[[4]]
[1] "man" "zügig"
[[5]]
[1] "empfehlung"
With the friendly help of rawr
Upvotes: 1