Reputation: 117
I am relatively new to R, so this maybe a simple question. I tried searching extensively for an answer but couldn't find one.
I have a data frame in the form:
firstword nextword freq
a little 23
a great 46
a few 32
a good 15
about the 57
about how 34
about a 48
about it 27
by the 36
by his 52
by an 12
by my 16
This is just a tiny sample for illustration from my data set. My dataframe is over a million rows. firstword and nextword are character type. Each firstword can have many nextwords associated with it, while some may have only one.
How do I generate another dataframe from this such that it is sorted by desc. order of freq for each 'firstword' and contains only the top 6 nextwords at most.
I tried the following code.
small = ddply(df, "firstword", summarise, nextword=nextword[order(freq,decreasing=T)[1:6]])
This works for smaller subset of my data, but runs out of memory when I run it on my entire data.
Upvotes: 3
Views: 204
Reputation: 92282
Here's a similarly efficient approach using the data.table
package.
First, you don't need to arrange freq
in every group, sorting in only once is enough and more efficient. So one way would be simply
library(data.table)
setDT(df)[order(-freq), .SD[seq_len(6)], by = firstword]
another way (possibly more efficient) is to find the indexes using the .I
argument (Index) and then to subset
indx <- df[order(-freq), .I[seq_len(6)], by = firstword]$V1
df[indx]
Upvotes: 5
Reputation: 5503
dplyr
package is created for this purpose to handle large datasets. try this
library(dplyr)
df %>% group_by(firstword) %>% arrange(desc(Freq)) %>% top_n(6)
Upvotes: 3