Live Free
Live Free

Reputation: 117

Getting the top N sorted elements from a data.frame in R for large dataset

I am relatively new to R, so this maybe a simple question. I tried searching extensively for an answer but couldn't find one.

I have a data frame in the form:

firstword  nextword   freq
a          little     23
a          great      46
a          few        32
a          good       15
about      the        57
about      how        34
about      a          48 
about      it         27
by         the        36
by         his        52
by         an         12
by         my         16

This is just a tiny sample for illustration from my data set. My dataframe is over a million rows. firstword and nextword are character type. Each firstword can have many nextwords associated with it, while some may have only one.

How do I generate another dataframe from this such that it is sorted by desc. order of freq for each 'firstword' and contains only the top 6 nextwords at most.

I tried the following code.

small = ddply(df, "firstword", summarise, nextword=nextword[order(freq,decreasing=T)[1:6]])

This works for smaller subset of my data, but runs out of memory when I run it on my entire data.

Upvotes: 3

Views: 204

Answers (2)

David Arenburg
David Arenburg

Reputation: 92282

Here's a similarly efficient approach using the data.table package. First, you don't need to arrange freq in every group, sorting in only once is enough and more efficient. So one way would be simply

library(data.table)
setDT(df)[order(-freq), .SD[seq_len(6)], by = firstword]

another way (possibly more efficient) is to find the indexes using the .I argument (Index) and then to subset

indx <- df[order(-freq), .I[seq_len(6)], by = firstword]$V1
df[indx]

Upvotes: 5

Koundy
Koundy

Reputation: 5503

dplyr package is created for this purpose to handle large datasets. try this

library(dplyr)

df %>% group_by(firstword) %>% arrange(desc(Freq)) %>% top_n(6)

Upvotes: 3

Related Questions