Getting the top N sorted elements from a data.frame in R for large dataset

Question

I am relatively new to R, so this maybe a simple question. I tried searching extensively for an answer but couldn't find one.

I have a data frame in the form:

firstword  nextword   freq
a          little     23
a          great      46
a          few        32
a          good       15
about      the        57
about      how        34
about      a          48 
about      it         27
by         the        36
by         his        52
by         an         12
by         my         16

This is just a tiny sample for illustration from my data set. My dataframe is over a million rows. firstword and nextword are character type. Each firstword can have many nextwords associated with it, while some may have only one.

How do I generate another dataframe from this such that it is sorted by desc. order of freq for each 'firstword' and contains only the top 6 nextwords at most.

I tried the following code.

small = ddply(df, "firstword", summarise, nextword=nextword[order(freq,decreasing=T)[1:6]])

This works for smaller subset of my data, but runs out of memory when I run it on my entire data.

Koundy · Accepted Answer

dplyr package is created for this purpose to handle large datasets. try this

library(dplyr)

df %>% group_by(firstword) %>% arrange(desc(Freq)) %>% top_n(6)

Getting the top N sorted elements from a data.frame in R for large dataset

Answers (2)

Related Questions