antecessor
antecessor

Reputation: 2800

For loop when extracting keywords with udpipe in R

Let's start with a reproducible example, which is a data frame called key composed by 8 columns and 3 rows:

key <- structure(c("Make Professional Maps with QGIS and Inkscape", 
"Gain the skills to produce original, professional, and aesthetically pleasing maps using free software", 
"English", "Inkscape 101 for Beginners - Design Vector Graphics", 
"Learn how to create and design vector graphics for free!", "English", 
"Design & Create Vector Graphics With Inkscape 2016", "The Beginners Guide to designing and creating Vector Graphics with Inkscape. No Experience needed!", 
"English", "Design a Logo for Free in Inkscape", "Learn from an award winning, published logo design professional!", 
"English", "Inkscape - Beginner to Pro", "If you want to have a decent learning curve, you are new to the program or even in design, this course is for you.", 
"English", "Creating 2D Textures in Inkscape", "A guide to creating colorful and interesting textures in inkscape.", 
"English", "Vector Art in Inkscape - Icon Design | Make Vector Graphics", 
"Learn Icon Design by creating Vector Graphics using the .SVG and PNG format with the Free Software Inkscape!", 
"English", "Inkscape and Bootstrap 3 -> Responsive Web Design!", 
"Design responsive websites using Free tools Inkscape and Bootstrap 3! Mood Boards and Style Tiles to Mobile First!", 
"English"), .Dim = c(3L, 8L), .Dimnames = list(c("Title", "Short_Description", 
"Language"), c("1", "2", "4", "5", "6", "9", "13", "15")))

I would like to extract keywords of every column independently. For such purpose, I use the udpipe package from R.

As I want to run the functions in every column, I run a for loop.

Before starting, we create the model with English as reference (see this link for more info):

library(udpipe)
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)

Ideally, my final output would be a dataframe with 8 columns, and so many rows as keywords were extracted.

I tried two methods:

Method 1: using dplyr

library(dplyr)
keywords <- list()
for(i in ncol(keywords_en_t)){
  keywords[[i]] <- keywords_en_t %>%
    udpipe_annotate(ud_model,s)
    as.data.frame()
}

Method 2:

key <- list()
stats <- list()
for(i in ncol(keywords_en_t)){
    key[[i]] <- as.data.frame(udpipe_annotate(ud_model, x = keywords_en_t[,i]))
    stats[[i]] <- subset(key[[i]], upos %in% "NOUN")
    stats <- txt_freq(x = stats$lemma)
}

Output

In both cases, or I get some errors or the output is not the expected.

As said, the output I expect is a dataframe with 8 columns representing in rows the keywords

Any idea?

Upvotes: 0

Views: 255

Answers (1)

phiver
phiver

Reputation: 23608

Unfortunately your code contains a lot of mistakes. Your loops don't go from 1 to the number of columns, but start just at 8. Either use 1:ncol or seq_along. Your key data is a matrix, not a data.frame. You need to supply udpipe_annotate a character vector. If you just supply a key[, 8] you are also supplying the dimnames to udpipe_annotate. That might generate keywords you don't need. In method 1 you use udpipe_annotate(ud_model,s) but there is no s defined. In the method 2 you use stats[[i]], and immediately afterwords you overwrite this by using stats.

To correct some things, first I transformed the data into a data.frame. Next I run the loop to create a list of vectors containing the keywords. After this I created a data.frame of the keywords. This part of the code takes into account different lengths of the vectors.

You might want to check on how you get your data, because it is more logical /tidy to have 3 columns ("Title", "Short_Description", "Language") and lots of rows.

Code

# Transform key into a data.frame. Now it is a matrix. 
key <- as.data.frame(key, stringsAsFactors = FALSE)

library(udpipe)
# prevent downloading ud model if it already exists in the working directory
ud_model <- udpipe_download_model(language = "english", overwrite = FALSE)
ud_model <- udpipe_load_model(ud_model$file_model)

# prepare list with correct length
keywords <- vector(mode = "list", length = ncol(key))

for(i in 1:ncol(key)){
  temp <- as.data.frame(udpipe_annotate(ud_model, x = key[, i]))
  keywords[[i]] <- temp$lemma[temp$upos == "NOUN"]
}

#transform list of vectors to data.frame. 
# Use sapply because vectors are of different lengths.
keywords <- as.data.frame(sapply(keywords, '[', seq(max(lengths(keywords)))), stringsAsFactors = FALSE)

keywords

        V1        V2         V3     V4       V5       V6     V7      V8
1    skill beginners  beginners   logo learning       2d Design     web
2      map    design      guide  award    curve  Texture format  design
3 software    Vector experience   logo  program    guide   <NA>  design
4     <NA>  graphics       <NA> design   design  texture   <NA> website
5     <NA>    vector       <NA>   <NA>   course inkscape   <NA>    tool
6     <NA>   graphic       <NA>   <NA>     <NA>     <NA>   <NA>    <NA>

Upvotes: 1

Related Questions