Reputation: 671

Need to fetch only first two highest records by grouping two columns of dataframe in R

I have a data.frame which contains 4 columns with 13 rows. Below is the sample data. [Column name is in uppercase and data is in lower case]

Sample input data:

NAME.  MARKS  MONTH COUNTRY
ram       20. jan   India
ranjith   40.  jan   India
naren.    80.  jan.  India
Amir.     90.  feb.   India
kumar.    60.  feb     India
azhar     80.  feb   India
mark      90.  feb.  US
Alex.     55   feb.  US
chris     20   feb   US
rakesh    60.  jan   US
Mona.     70.   jan.  US
mano.     90.  mar.   UK
Ron.       37.  mar.  UK

Expected Output:

NAME    MARKS. MONTH  COUNTRY
naren    80.    jan.    India
ranjith  40.    jan.    India
Amir.    90.    feb.    India
Azhar.   80.    feb.    India
mark.    90.    feb.     US
Alex     55.    feb.     US
Mona.    70.    jan.     US
Rakesh.   60.    jan.    US
mano.     90.    mar.    UK
Ron.      37.    mar.    UK

Question : From the input dataframe I want to select only the highest two mark values from each group called MONTH and COUNTRY. Sample output is given above.

Can anyone share the sample code to produce the correct output and assign it to new dataframe. Any method is preferable including sqldf.

Upvotes: 1

Answers (3)

Rentrop

Reputation: 21497

You can do it as follows using data.table. Thanks to @Arun for his suggestions on improving the answer.

require(data.table)
dat <- fread(txt)
dat[order(MARKS), tail(.SD, 2L), by=c("MONTH", "COUNTRY")]

Note that this only computes the order vector, and does not rearrange the entire data.table first before to perform the grouping operations (hence more memory efficient). .SD contains the subset of data for each group and is itself a data.table.

With too many groups, tail(.SD, 2L) could be slightly slower, In that case, we can use .I which returns the indices, and then do the subset one last time finally as follows:

ix = dat[order(MARKS), .(I=tail(.I, 2L)), by=c("MONTH", "COUNTRY")][, I]
dat[ix]

This results in:

    MONTH COUNTRY    NAME MARKS
 1:   jan   India ranjith    40
 2:   jan   India   naren    80
 3:   feb   India   kumar    60
 4:   feb   India   azhar    80
 5:   feb      US    Alex    55
 6:   feb      US   chris    20
 7:   feb   India  rakesh    60
 8:   feb   India    Mona    70
 9:   mar      UK    mano    90
10:   mar      UK     Ron    37

Where txt is your data without the ending .

txt <- "NAME  MARKS  MONTH COUNTRY
    ram       20 jan   India
    ranjith   40  jan   India
    naren    80  jan  India
    Amir     90  feb   India
    kumar    60  feb     India
    azhar     80  feb   India
    mark      90  feb  US
    Alex     55   feb  US
    chris     20   feb   US
    rakesh    60  jan   US
    Mona     70   jan  US
    mano     90  mar   UK
    Ron       37  mar  UK"

Upvotes: 2

akrun

Reputation: 886938

Here is an option with base R (no packages used). We extract the first 3 letters from 'MONTH' using substr (as there are some . in some cases). Using ave, we get the logical index based on the rank after grouping by 'COUNTRY' and 'MONTH', it can be used to subset the rows.

df1$MONTH <- substr(df1$MONTH, 1, 3)
df1[with(df1, as.logical(ave(MARKS, COUNTRY, MONTH,
                    FUN = function(x) rank(-x) %in% 1:2))),]

Upvotes: 0

alistaire

Reputation: 43334

In dplyr, you can group_by, arrange, and slice. With some cleaning:

library(dplyr)

       # take out .s
df %>% mutate_all(sub, pattern = '.', replacement = '', fixed = TRUE) %>% 
  # convert to numbers, if necessary
  mutate_all(type.convert, as.is = TRUE) %>% 
  # set grouping for following operations
  group_by(MONTH, COUNTRY) %>% 
  # sort by MARKS, descending
  arrange(desc(MARKS)) %>%
  # subset to top two rows of each group
  slice(1:2)

## Source: local data frame [10 x 4]
## Groups: MONTH, COUNTRY [5]
## 
##      NAME. MARKS MONTH COUNTRY
##      <chr> <int> <chr>   <chr>
## 1     Amir    90   feb   India
## 2    azhar    80   feb   India
## 3     mark    90   feb      US
## 4     Alex    55   feb      US
## 5    naren    80   jan   India
## 6  ranjith    40   jan   India
## 7     Mona    70   jan      US
## 8   rakesh    60   jan      US
## 9     mano    90   mar      UK
## 10     Ron    37   mar      UK

Upvotes: 1

Need to fetch only first two highest records by grouping two columns of dataframe in R

Answers (3)

Related Questions