FooBar
FooBar

Reputation: 16488

R: Better way of splitting this sample

I'm a beginner in R and pretty much everything I do comes from typical methodology I've learned from other languages. However, whenever I've seeked for R related answers here, code structure was much different than what I'd have expected.

I have a data.table that contains panel data for individuals. I want to look at the mean outcome of a characteristic, and then split the sample in twice: Those that are above the median of the mean outcome, and those who are below.

Here's the structure of my data.table, yearly:

       user     wage year
1: 65122111     9.74 2003
2: 65122111     7.85 2004
3: 65122111    97.16 2005
4: 65122111    48.22 2006
5: 65122111    91.24 2007
6: 65122111     9.35 2008
7: 65122112    80.00 2007
8: 65122112     0.00 2008

And here's what I do:

## get mean wages
meanWages <- yearly[, list(meanWage = mean(wage)), by=(user)]
## split by median
highWage <- meanWages[meanWage > median(meanWages[, meanWage]), user]
lowWage <- meanWages[meanWage < median(meanWages[, meanWage]), user]
## split original sample
yearlyHigh <- yearly[is.element(user,highWage),]
yearlyLow <- yearly[is.element(user,highWage),]

I suppose this is giving me what I expect (checking for correctness is quite cumbersome), but it seems to be very clumpy and inefficient. What would be a more efficient and compressed way of doing the same thing?

Upvotes: 4

Views: 112

Answers (2)

shadow
shadow

Reputation: 22293

You can also use the dplyr package. Might not be as efficient, but it is very easy to read.

yearly %>% 
  group_by(user) %>% 
  mutate(meanwage = mean(wage)) %>% 
  filter(meanwage >= median(meanwage))

Rarely is it helpful to actually split the data. Just group by the wage category instead and use groupwise operations instead.

yearly %>% 
  group_by(user) %>%
  mutate(meanwage = mean(wage)) %>%
  ungroup %>%
  mutate(cat = ifelse(meanwage >= median(meanwage), "high", "low")) %>%
  group_by(cat) %>%
  do(data.table("further analyses here ..."))

Or just using data.table:

yearly[, meanwage := mean(wage), by=user]
yearly[, cat := ifelse(meanwage >= median(meanwage), "high", "low")]
yearly[, "further analyses here ...", by = cat]

Upvotes: 3

Yevgeny Tkach
Yevgeny Tkach

Reputation: 667

you can try the following, although I can't be certain that this is most efficient or compact.

yearly[, meanwage := mean(wage), by=user]
yearlyHigh <- yearly[meanwage >= median(meanwage)]
yearlyLow <- yearly[meanwage < median(meanwage)]

Upvotes: 3

Related Questions