roody
roody

Reputation: 2663

Repeating sequential functions when creating multiple lists/matrices/dataframes within `lapply`

This question is an elaboration on a previous question that I'd asked about repeating functions on sequentially-labeled dataframes.

In the past, I needed to make minor alterations to data.tables read in from a folder to R (e.g. changing dates, recoding).

Now, however, my goals are a bit more complex: I'd like to read in several text files from a folder, take a random sample from those character vectos, read the random sample into a corpus (using the package tm) and then generate a new data.frame that has a list of words/phrases and their frequencies.

The code I've developed so far is as follows:

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 5)) # Finds words or phrases
files <- list.files("~/path/", full.names = TRUE, pattern="\\.txt$") # Reads in files

out <- lapply(1:length(files), function(x) {
  df <- scan(files[x], what="", sep="\n") # Read in files
  df<-sample(c(df),size=1500,replace=F) # Take random sample
  corpus <- Corpus(VectorSource(df)) # Create corpus
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, tolower)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer)) #Create term document matrix
  m <- as.matrix(tdm)
  v <- sort(rowSums(m),decreasing=TRUE)
  d <- data.frame(word = names(v),freq=v) # Create new dataframe with words & their frequencies
})

However, although this function works, I'm not sure how to access only the data.frames d while discarding the rest? Does out contain all of the objects created in lapply?

Upvotes: 0

Views: 555

Answers (2)

Abbas Ali
Abbas Ali

Reputation: 1

Thanks however this is what I got

do.call('rbind', out)

Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match

I used

lapply(seq_along(d.names), 
       function(i,x) {assign(paste0("a",i),x[[i]], envir=.GlobalEnv)},
       x=out) 

I wanted to retain the original dataframe names though and so I did this

lapply(seq_along(d.names), 
   function(i,x) {assign(paste0(d.names[i],i),x[[i]], envir=.GlobalEnv)},
   x=out) 

and it worked

Appreciate your input

Upvotes: 0

Steve Weston
Steve Weston

Reputation: 19677

The lapply function returns a list containing the values returned by the specified function. In your example, the function returns only the data frame that is assigned to d, so out will be a list containing only the d data frames. All of the other objects created by the function (such as tdm, m, and v) will be discarded, which seems to be what you want.

You can access the data frames in out by indexing them, as in out[[1]], with lapply, as in lapply(out, function(d) d$word), or by combining them with do.call('rbind', out).

Upvotes: 2

Related Questions