Reputation: 671
I have a data.frame which contains 4 columns with 13 rows. Below is the sample data. [Column name is in uppercase and data is in lower case]
Sample input data:
NAME. MARKS MONTH COUNTRY
ram 20. jan India
ranjith 40. jan India
naren. 80. jan. India
Amir. 90. feb. India
kumar. 60. feb India
azhar 80. feb India
mark 90. feb. US
Alex. 55 feb. US
chris 20 feb US
rakesh 60. jan US
Mona. 70. jan. US
mano. 90. mar. UK
Ron. 37. mar. UK
Expected Output:
NAME MARKS. MONTH COUNTRY
naren 80. jan. India
ranjith 40. jan. India
Amir. 90. feb. India
Azhar. 80. feb. India
mark. 90. feb. US
Alex 55. feb. US
Mona. 70. jan. US
Rakesh. 60. jan. US
mano. 90. mar. UK
Ron. 37. mar. UK
Question : From the input dataframe I want to select only the highest two mark values from each group called MONTH and COUNTRY. Sample output is given above.
Can anyone share the sample code to produce the correct output and assign it to new dataframe. Any method is preferable including sqldf
.
Upvotes: 1
Views: 65
Reputation: 21497
You can do it as follows using data.table
. Thanks to @Arun for his suggestions on improving the answer.
require(data.table)
dat <- fread(txt)
dat[order(MARKS), tail(.SD, 2L), by=c("MONTH", "COUNTRY")]
Note that this only computes the order vector, and does not rearrange the entire data.table first before to perform the grouping operations (hence more memory efficient). .SD
contains the subset of data for each group and is itself a data.table.
With too many groups, tail(.SD, 2L)
could be slightly slower, In that case, we can use .I
which returns the indices, and then do the subset one last time finally as follows:
ix = dat[order(MARKS), .(I=tail(.I, 2L)), by=c("MONTH", "COUNTRY")][, I]
dat[ix]
This results in:
MONTH COUNTRY NAME MARKS
1: jan India ranjith 40
2: jan India naren 80
3: feb India kumar 60
4: feb India azhar 80
5: feb US Alex 55
6: feb US chris 20
7: feb India rakesh 60
8: feb India Mona 70
9: mar UK mano 90
10: mar UK Ron 37
Where txt
is your data without the ending .
txt <- "NAME MARKS MONTH COUNTRY
ram 20 jan India
ranjith 40 jan India
naren 80 jan India
Amir 90 feb India
kumar 60 feb India
azhar 80 feb India
mark 90 feb US
Alex 55 feb US
chris 20 feb US
rakesh 60 jan US
Mona 70 jan US
mano 90 mar UK
Ron 37 mar UK"
Upvotes: 2
Reputation: 886938
Here is an option with base R
(no packages used). We extract the first 3 letters from 'MONTH' using substr
(as there are some .
in some cases). Using ave
, we get the logical index based on the rank
after grouping by 'COUNTRY' and 'MONTH', it can be used to subset the rows.
df1$MONTH <- substr(df1$MONTH, 1, 3)
df1[with(df1, as.logical(ave(MARKS, COUNTRY, MONTH,
FUN = function(x) rank(-x) %in% 1:2))),]
Upvotes: 0
Reputation: 43334
In dplyr, you can group_by
, arrange
, and slice
. With some cleaning:
library(dplyr)
# take out .s
df %>% mutate_all(sub, pattern = '.', replacement = '', fixed = TRUE) %>%
# convert to numbers, if necessary
mutate_all(type.convert, as.is = TRUE) %>%
# set grouping for following operations
group_by(MONTH, COUNTRY) %>%
# sort by MARKS, descending
arrange(desc(MARKS)) %>%
# subset to top two rows of each group
slice(1:2)
## Source: local data frame [10 x 4]
## Groups: MONTH, COUNTRY [5]
##
## NAME. MARKS MONTH COUNTRY
## <chr> <int> <chr> <chr>
## 1 Amir 90 feb India
## 2 azhar 80 feb India
## 3 mark 90 feb US
## 4 Alex 55 feb US
## 5 naren 80 jan India
## 6 ranjith 40 jan India
## 7 Mona 70 jan US
## 8 rakesh 60 jan US
## 9 mano 90 mar UK
## 10 Ron 37 mar UK
Upvotes: 1