Reputation: 2663
This question is an elaboration on a previous question that I'd asked about repeating functions on sequentially-labeled dataframes.
In the past, I needed to make minor alterations to data.tables
read in from a folder to R (e.g. changing dates, recoding).
Now, however, my goals are a bit more complex: I'd like to read in several text files from a folder, take a random sample from those character vectos, read the random sample into a corpus (using the package tm
) and then generate a new data.frame
that has a list of words/phrases and their frequencies.
The code I've developed so far is as follows:
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 5)) # Finds words or phrases
files <- list.files("~/path/", full.names = TRUE, pattern="\\.txt$") # Reads in files
out <- lapply(1:length(files), function(x) {
df <- scan(files[x], what="", sep="\n") # Read in files
df<-sample(c(df),size=1500,replace=F) # Take random sample
corpus <- Corpus(VectorSource(df)) # Create corpus
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer)) #Create term document matrix
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v) # Create new dataframe with words & their frequencies
})
However, although this function works, I'm not sure how to access only the data.frames d
while discarding the rest? Does out
contain all of the objects created in lapply
?
Upvotes: 0
Views: 555
Reputation: 1
Thanks however this is what I got
do.call('rbind', out)
Error in rbind(deparse.level, ...) : numbers of columns of arguments do not match
I used
lapply(seq_along(d.names),
function(i,x) {assign(paste0("a",i),x[[i]], envir=.GlobalEnv)},
x=out)
I wanted to retain the original dataframe names though and so I did this
lapply(seq_along(d.names),
function(i,x) {assign(paste0(d.names[i],i),x[[i]], envir=.GlobalEnv)},
x=out)
and it worked
Appreciate your input
Upvotes: 0
Reputation: 19677
The lapply
function returns a list containing the values returned by the specified function. In your example, the function returns only the data frame that is assigned to d
, so out
will be a list containing only the d
data frames. All of the other objects created by the function (such as tdm
, m
, and v
) will be discarded, which seems to be what you want.
You can access the data frames in out
by indexing them, as in out[[1]], with lapply, as in lapply(out, function(d) d$word)
, or by combining them with do.call('rbind', out)
.
Upvotes: 2