R: Better way of splitting this sample

Question

I'm a beginner in R and pretty much everything I do comes from typical methodology I've learned from other languages. However, whenever I've seeked for R related answers here, code structure was much different than what I'd have expected.

I have a data.table that contains panel data for individuals. I want to look at the mean outcome of a characteristic, and then split the sample in twice: Those that are above the median of the mean outcome, and those who are below.

Here's the structure of my data.table, yearly:

       user     wage year
1: 65122111     9.74 2003
2: 65122111     7.85 2004
3: 65122111    97.16 2005
4: 65122111    48.22 2006
5: 65122111    91.24 2007
6: 65122111     9.35 2008
7: 65122112    80.00 2007
8: 65122112     0.00 2008

And here's what I do:

## get mean wages
meanWages <- yearly[, list(meanWage = mean(wage)), by=(user)]
## split by median
highWage <- meanWages[meanWage > median(meanWages[, meanWage]), user]
lowWage <- meanWages[meanWage < median(meanWages[, meanWage]), user]
## split original sample
yearlyHigh <- yearly[is.element(user,highWage),]
yearlyLow <- yearly[is.element(user,highWage),]

I suppose this is giving me what I expect (checking for correctness is quite cumbersome), but it seems to be very clumpy and inefficient. What would be a more efficient and compressed way of doing the same thing?

shadow · Accepted Answer

You can also use the dplyr package. Might not be as efficient, but it is very easy to read.

yearly %>% 
  group_by(user) %>% 
  mutate(meanwage = mean(wage)) %>% 
  filter(meanwage >= median(meanwage))

Rarely is it helpful to actually split the data. Just group by the wage category instead and use groupwise operations instead.

yearly %>% 
  group_by(user) %>%
  mutate(meanwage = mean(wage)) %>%
  ungroup %>%
  mutate(cat = ifelse(meanwage >= median(meanwage), "high", "low")) %>%
  group_by(cat) %>%
  do(data.table("further analyses here ..."))

Or just using data.table:

yearly[, meanwage := mean(wage), by=user]
yearly[, cat := ifelse(meanwage >= median(meanwage), "high", "low")]
yearly[, "further analyses here ...", by = cat]

R: Better way of splitting this sample

Answers (2)

Related Questions