Reputation: 508
I am looking for a way to create POS tags for single words/tokens from a list I have in R. I know that the accuracy will decrease if I do it for single tokens instead of sentences but the data I have are "delete edits" from Wikipedia and people mostly delete single, unconnected words instead of whole sentences. I have seen this question a few times for Python but I haven't found a solution for it in R yet.
My data will look somehwat like this
Tokens <- list(c("1976","green","Normandy","coast","[", "[", "template" "]","]","Fish","visting","England","?"))
And ideally, I would like to have something like this returned:
1976 CD
green JJ
Normandy NN
coast NN
[ x
[ x
template NN
] x
] x
Fish NN
visiting VBG
England NN
? x
I found some websites doing that online but I doubt that they are running anything in R. They also specifically state NOT to use it on single words/Tokens.
My Question thus: Is it possible to do this in R with reasonable accuracy? How would the code look like to not incorporate sentence structure? Would it be easier to just compare the lists to a huge tagged diary?
Upvotes: 0
Views: 995
Reputation: 2448
In general, there is no decent post tagger in native R, and all possible solutions rely on outside libraries. As one of such solutions, you can try our package spacyr
using spaCy
in the backend. It's not on CRAN yet but soon to be.
https://github.com/kbenoit/spacyr
The sample code is like this:
library(spacyr)
spacy_initialize()
Tokens <- c("1976","green","Normandy","coast","[", "[", "template", "]","]",
"Fish","visting","England","?")
spacy_parse(Tokens, tag = TRUE)
and the output is like this:
doc_id sentence_id token_id token lemma pos tag entity
1 text1 1 1 1976 1976 NUM CD DATE_B
2 text2 1 1 green green ADJ JJ
3 text3 1 1 Normandy normandy PROPN NNP ORG_B
4 text4 1 1 coast coast NOUN NN
5 text5 1 1 [ [ PUNCT -LRB-
6 text6 1 1 [ [ PUNCT -LRB-
7 text7 1 1 template template NOUN NN
8 text8 1 1 ] ] PUNCT -RRB-
9 text9 1 1 ] ] PUNCT -RRB-
10 text10 1 1 Fish fish NOUN NN
11 text11 1 1 visting vist VERB VBG
12 text12 1 1 England england PROPN NNP GPE_B
13 text13 1 1 ? ? PUNCT .
Although the package can do more, you can find what you need in tag
field.
NOTE: (2017-05-20)
Now spacyr
package is on CRAN, but the version has some issues with non-ascii characters. We recognized the issue after CRAN submission and resolved in the version on github. If you are planning to use it for German texts, please install the latest master on github.
devtools::install_github("kbenoit/spacyr", build_vignettes = FALSE)
This revision will be incorporated to CRAN package in next update.
NOTE2:
There are detailed instructions for installing spaCy and spacyr on Windows and Mac.
Windows: https://github.com/kbenoit/spacyr/blob/master/inst/doc/WINDOWS.md
Mac: https://github.com/kbenoit/spacyr/blob/master/inst/doc/MAC.md
Upvotes: 2
Reputation: 508
Heres the steps I took to make amatsuo_net's suggestion work for me:
Installing spaCy and english language library for anaconda:
Open Anaconda prompt as Admin
execute:
activate py36
conda config --add channels conda-forge
conda install spacy
python -m spacy link en_core_web_sm en
Using the Wrapper for R studio:
install.packages("fastmatch")
install.packages("RcppParallel")
library(fastmatch)
library(RcppParallel)
devtools::install_github("kbenoit/spacyr", build_vignettes = FALSE)
library(spacyr)
spacy_initialize(condaenv = "py36")
Tokens <- c("1976","green","Normandy","coast","[", "[", "template", "]","]","Fish","visting","England","?");Tokens
spacy_parse(Tokens, tag = TRUE)
Upvotes: 1