Reputation: 1484
I have a question about check which level is the most.
Here is my data:
Year Area V1 V2 V3
2014 USA 100 25 50
2014 USA 200 50 60
2014 USA 200 50 50
2014 USA 200 50 50
2014 USA 300 75 40
2014 ASIA 100 25 60
2014 ASIA 100 25 70
2014 ASIA 300 75 60
2014 ASIA 400 100 60
2014 ASIA 500 125 70
2015 USA 100 25 80
2015 USA 300 75 80
2015 USA 300 75 70
2015 USA 300 75 90
2015 USA 500 125 40
2015 ASIA 400 100 90
2015 ASIA 400 100 80
2015 ASIA 300 75 80
2016 USA 500 125 60
2016 USA 500 125 60
2016 ASIA 100 25 50
What I want is:
Year Area V1 V2 V3 Count
2014 USA 200 50 50 5
2015 USA 300 75 80 5
2016 USA 500 125 60 2
2014 ASIA 100 25 60 5
2015 ASIA 400 100 80 3
2016 ASIA 100 25 50 1
In V1
, there are 5 levels(100, 200, 300, 400, and 500).
In V2
, there are also 5 levels, which are 0.25*V1
.
In V3
, there are 6 levels.
The result I want is group by Year
and Area
. Moreover, V1
is the max count of levels. For example, in Year == 2014
and Area == USA
, V1
contains 1 level 100, 3 level 200, and 1 level 300. So, the result should be 200 because it is the most one. V2
and V3
are the same.
Any idea?
DATA
dt <- fread("Year Area V1 V2 V3
2014 USA 100 25 50
2014 USA 200 50 60
2014 USA 200 50 50
2014 USA 200 50 50
2014 USA 300 75 40
2014 ASIA 100 25 60
2014 ASIA 100 25 70
2014 ASIA 300 75 60
2014 ASIA 400 100 60
2014 ASIA 500 125 70
2015 USA 100 25 80
2015 USA 300 75 80
2015 USA 300 75 70
2015 USA 300 75 90
2015 USA 500 125 40
2015 ASIA 400 100 90
2015 ASIA 400 100 80
2015 ASIA 300 75 80
2016 USA 500 125 60
2016 USA 500 125 60
2016 ASIA 100 25 50")
Upvotes: 2
Views: 526
Reputation: 887991
We can get the Mode
function from @KenWilliam's answer here.
library(data.table)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Grouped by 'Year', 'Area' (using data.table
syntax as it is already a data.table
from reading with fread
), loop through the Subset of data.table (.SD
), get the element that has the highest frequency using Mode
, similarly get the frequency of columns using table
, find the sum
, collapse the multiple columns to a single one by taking the maximum of each row (pmax
) and concatenate (c
) with the list
of columns of 'Mode'
dt[, c(lapply(.SD, Mode),
Count = do.call(pmax, lapply(.SD, function(x) sum(table(x))))), by = .(Year, Area)]
# Year Area V1 V2 V3 Count
#1: 2014 USA 200 50 50 5
#2: 2014 ASIA 100 25 60 5
#3: 2015 USA 300 75 80 5
#4: 2015 ASIA 400 100 80 3
#5: 2016 USA 500 125 60 2
#6: 2016 ASIA 100 25 50 1
Upvotes: 2
Reputation: 7453
Using dplyr
we can:
dt %>%
# we count the number of item within Year x Area groups
group_by(Year, Area) %>%
mutate(Count=n()) %>%
# now we count the number of V1 levels within the Year x Area groups
group_by(Year, Area, V1) %>%
mutate(Count_V1=n()) %>%
arrange(desc(Count_V1)) %>%
# now we come back to Year x Area grouping and retain the most abundant entry
group_by(Year, Area) %>%
slice(1) %>%
ungroup() %>%
# cosmetics
arrange(Area, Year) -> dt2
Which results in:
dt2
# A tibble: 6 x 7
Year Area V1 V2 V3 Count Count_V1
<int> <fctr> <int> <int> <int> <int> <int>
1 2014 ASIA 100 25 60 5 2
2 2015 ASIA 400 100 90 3 2
3 2016 ASIA 100 25 50 1 1
4 2014 USA 200 50 60 5 3
5 2015 USA 300 75 80 5 3
6 2016 USA 500 125 60 2 2
But as stated in the comments, V2 and V3 values are confusing since they're not based on a clear criterium. From what I understand they can be removed, just like Count_V1
:
dt2 %>% select(-Count_V1, -V2, -V3)
# A tibble: 6 x 4
Year Area V1 Count
<int> <fctr> <int> <int>
1 2014 ASIA 100 5
2 2015 ASIA 400 3
3 2016 ASIA 100 1
4 2014 USA 200 5
5 2015 USA 300 5
6 2016 USA 500 2
Upvotes: 2