Reputation: 3407
(I have a feeling I'm gonna feel really dumb after I get an answer but I just could not figure this out.)
I have a data.frame with an empty column on the end. It will mostly be filled with NAs, but I want to populate some rows of it with a value. This column represents a guess at data that is missing from one of the columns in the data.frame.
My initial data.frame looks something like this:
Game | Rating | MinPlayers | MaxPlayers | MaxPlayersGuess
---------------------------------------------------------
A | 6 | 3 | 6 |
B | 7 | 3 | 7 |
C | 6.5 | 3 | N/A |median(df$MaxPlayers[df$MinPlayers ==3,])
D | 7 | 3 | 6 |
E | 7 | 3 | 5 |
F | 9.5 | 2 | 5 |
G | 6 | 2 | 4 |
H | 7 | 2 | 4 |
I | 6.5 | 2 | N/A |median(df$MaxPlayers[df$MinPlayers ==2,])
J | 7 | 2 | 2 |
K | 7 | 2 | 4 |
Notice that two of the rows have "N/A" for MaxPlayers. What I am attempting to do is use the information I have to make a guess at What the MaxPlayers might be. If median(MaxPlayers) for 3 players games is 6, MaxPlayerGuess should equal 6 for games with MinPlayers == 3 and MaxPlayers == N/A. (I have attempted to indicate in code what value MaxPlayerGuess should get in the example above.)
The resulting data.frame would look like this:
Game | Rating | MinPlayers | MaxPlayers | MaxPlayersGuess
---------------------------------------------------------
A | 6 | 3 | 6 |
B | 7 | 3 | 7 |
C | 6.5 | 3 | N/A |6
D | 7 | 3 | 6 |
E | 7 | 3 | 5 |
F | 9.5 | 2 | 5 |
G | 6 | 2 | 4 |
H | 7 | 2 | 4 |
I | 6.5 | 2 | N/A |4
J | 7 | 2 | 2 |
K | 7 | 2 | 4 |
To share the results of one attempt:
gld$MaxPlayersGuess <- ifelse(is.na(gld$MaxPlayers), median(gld$MaxPlayers[gld$MinPlayers,]), NA)
Error in gld$MaxPlayers[gld$MinPlayers, ] :
incorrect number of dimensions
Upvotes: 0
Views: 1863
Reputation: 2496
If you want to use dplyr
, you can try:
input:
df <- data.frame(
Game = LETTERS[1:11],
Rating = c(6,7,6.5,7,7,9.5,6,7,6.5,7,7),
MinPlayers = c(rep(3,5), rep(2,6)),
MaxPlayers = c(6,7,NA,6,5,5,4,4,NA,2,4)
)
process:
df %>%
group_by(MinPlayers) %>%
mutate(MaxPlayers = if_else(is.na(MaxPlayers), median(MaxPlayers, na.rm=TRUE), MaxPlayers))
this groups the data basis MinPlayers
and then assigns median value of MaxPlayers
to the rows with missing data.
output:
Source: local data frame [11 x 4]
Groups: MinPlayers [2]
Game Rating MinPlayers MaxPlayers
<fctr> <dbl> <dbl> <dbl>
1 A 6.0 3 6
2 B 7.0 3 7
3 C 6.5 3 6
4 D 7.0 3 6
5 E 7.0 3 5
6 F 9.5 2 5
7 G 6.0 2 4
8 H 7.0 2 4
9 I 6.5 2 4
10 J 7.0 2 2
11 K 7.0 2 4
Upvotes: 1
Reputation: 1790
I think that you already have everything you need in @griffmer's answer. But a less elegant but maybe more intuitive way could be a loop:
## Your data:
df <- data.frame(
Game = LETTERS[1:11],
Rating = c(6,7,6.5,7,7,9.5,6,7,6.5,7,7),
MinPlayers = c(rep(3,5), rep(2,6)),
MaxPlayers = c(6,7,NA,6,5,5,4,4,NA,2,4)
)
## Loop over rows:
df$MaxPlayersGuess <- vapply(1:nrow(df), function(ii){
if (is.na(df$MaxPlayers[ii])){
median(df$MaxPlayers[df$MinPlayers == df$MinPlayers[ii]],
na.rm = TRUE)
} else {
df$MaxPlayers[ii]
}
}, numeric(1))
which gives you
df
# Game Rating MinPlayers MaxPlayers MaxPlayersGuess
# 1 A 6.0 3 6 6
# 2 B 7.0 3 7 7
# 3 C 6.5 3 NA 6
# 4 D 7.0 3 6 6
# 5 E 7.0 3 5 5
# 6 F 9.5 2 5 5
# 7 G 6.0 2 4 4
# 8 H 7.0 2 4 4
# 9 I 6.5 2 NA 4
# 10 J 7.0 2 2 2
# 11 K 7.0 2 4 4
Upvotes: 1
Reputation: 377
Updating relative to posted example.
Here's my tip of the day, sometimes it's easier to compute what you want and then grab it when you need rather than use all of these logical contigencies. You're trying to come up with a way to compute it all at once and that's making it confusing, break it into steps. You need to know the median value of "MaxPlayer" for each possible group of "MinPlayer". Then you want to use that value when MaxPlayer is missing. So here's a simple way to do that.
#generate fake data
MinPlayer <- rep(3:2, each = 4)
MaxPlayer <- rep(2:5, each = 2, times = 2)
df <- data.frame(MinPlayer, MaxPlayer)
#replace some values of MaxPlayer with NA
df$MaxPlayer <- ifelse(df$MaxPlayer == 3, NA, df$MaxPlayer)
####STARTING DATA
# > df
# MinPlayer MaxPlayer
# 1 3 2
# 2 3 2
# 3 3 NA
# 4 3 NA
# 5 2 4
# 6 2 4
# 7 2 5
# 8 2 5
# 9 3 2
# 10 3 2
# 11 3 NA
# 12 3 NA
# 13 2 4
# 14 2 4
# 15 2 5
# 16 2 5
####STEP 1
#find the median of MaxPlayer for each group of MinPlayer (e.g., when MinPlayer == 1, 2 or whatever)
#just add a column to the data frame that has the right median value for each subset of MinPlayer in it and grab that value to use later.
library(plyr) #plyr is a great way to compute things across data subsets
df <- ddply(df, c("MinPlayer"), transform,
median.minp = median(MaxPlayer, na.rm = TRUE)) #ignore NAs in the median
####STEP 2
#anytime that MaxPlayer == NA, grab the median value to replace the NA, otherwise keep the MaxPlayer value
df$MaxPlayer <- ifelse(is.na(df$MaxPlayer), df$median.minp, df$MaxPlayer)
####STEP 3
#you had to compute an extra column you don't really want, so drop it now that you're done with it
df <- df[ , !(names(df) %in% "median.minp")]
####RESULT
# > df
# MinPlayer MaxPlayer
# 1 2 4
# 2 2 4
# 3 2 5
# 4 2 5
# 5 2 4
# 6 2 4
# 7 2 5
# 8 2 5
# 9 3 2
# 10 3 2
# 11 3 2
# 12 3 2
# 13 3 2
# 14 3 2
# 15 3 2
# 16 3 2
Old answer below here....
Please post a reproducible example!!
#fake data
this <- rep(1:2, each = 1, times = 2)
that <- rep(3:2, each = 1, times = 2)
df <- data.frame(this, that)
If you're just asking about basic indexing.... E.g., find values where something meets a condition, this will return the row indexes of values matching the condition (look up ?which):
> which(df$this < df$that)
[1] 1 3
This will return the VALUE of things matching your condition not the row index - you just use the row index returned by "which" to find the corresponding value in the correct column of your data frame (here "this")
> df[which(df$this < df$that), "this"]
[1] 1 1
If you want to apply some calculation when "this" is "less" than that and add a new column to your data frame, just use "ifelse". If else creates a logical vector where stuff matches your condition and then does stuff to things matching your condition (e.g., where your logical test == TRUE).
#if "this" is < "that", multiply by 2
df$result <- ifelse(df$this < df$that, df$this * 2, NA)
> df
this that result
1 1 3 2
2 2 2 NA
3 1 3 2
4 2 2 NA
Without an reproducible example, no more can be provided.
Upvotes: 2