Creating POS tags for single words/tokens in R

Question

I am looking for a way to create POS tags for single words/tokens from a list I have in R. I know that the accuracy will decrease if I do it for single tokens instead of sentences but the data I have are "delete edits" from Wikipedia and people mostly delete single, unconnected words instead of whole sentences. I have seen this question a few times for Python but I haven't found a solution for it in R yet.

My data will look somehwat like this

Tokens <- list(c("1976","green","Normandy","coast","[", "[", "template" "]","]","Fish","visting","England","?"))

And ideally, I would like to have something like this returned:

1976                   CD
green                  JJ
Normandy               NN
coast                  NN
[                      x
[                      x
template               NN
]                      x
]                      x
Fish                   NN
visiting               VBG
England                NN
?                      x

I found some websites doing that online but I doubt that they are running anything in R. They also specifically state NOT to use it on single words/Tokens.

My Question thus: Is it possible to do this in R with reasonable accuracy? How would the code look like to not incorporate sentence structure? Would it be easier to just compare the lists to a huge tagged diary?

amatsuo_net · Accepted Answer

In general, there is no decent post tagger in native R, and all possible solutions rely on outside libraries. As one of such solutions, you can try our package spacyr using spaCy in the backend. It's not on CRAN yet but soon to be.

https://github.com/kbenoit/spacyr

The sample code is like this:

library(spacyr)
spacy_initialize()

Tokens <- c("1976","green","Normandy","coast","[", "[", "template", "]","]",
            "Fish","visting","England","?")
spacy_parse(Tokens, tag = TRUE)

and the output is like this:

   doc_id sentence_id token_id    token    lemma   pos   tag entity
1   text1           1        1     1976     1976   NUM    CD DATE_B
2   text2           1        1    green    green   ADJ    JJ       
3   text3           1        1 Normandy normandy PROPN   NNP  ORG_B
4   text4           1        1    coast    coast  NOUN    NN       
5   text5           1        1        [        [ PUNCT -LRB-       
6   text6           1        1        [        [ PUNCT -LRB-       
7   text7           1        1 template template  NOUN    NN       
8   text8           1        1        ]        ] PUNCT -RRB-       
9   text9           1        1        ]        ] PUNCT -RRB-       
10 text10           1        1     Fish     fish  NOUN    NN       
11 text11           1        1  visting     vist  VERB   VBG       
12 text12           1        1  England  england PROPN   NNP  GPE_B
13 text13           1        1        ?        ? PUNCT     .

Although the package can do more, you can find what you need in tag field.

NOTE: (2017-05-20)

Now spacyr package is on CRAN, but the version has some issues with non-ascii characters. We recognized the issue after CRAN submission and resolved in the version on github. If you are planning to use it for German texts, please install the latest master on github. devtools::install_github("kbenoit/spacyr", build_vignettes = FALSE) This revision will be incorporated to CRAN package in next update.

NOTE2:

There are detailed instructions for installing spaCy and spacyr on Windows and Mac.

Windows: https://github.com/kbenoit/spacyr/blob/master/inst/doc/WINDOWS.md

Mac: https://github.com/kbenoit/spacyr/blob/master/inst/doc/MAC.md

Creating POS tags for single words/tokens in R

Answers (2)

Related Questions