Reputation: 981
I have a data frame, df, with 2 columns. When I create a 3rd column and try to update the first value only, it populates the entire column instead. Could someone explain why that is and what the solution for it is?
#Create data frame with 2 columns and 4 observations
name <- c("Bob", "Lauren", "Joe", "Chris")
age <- c(45, 34, 54, 12)
df <- data.frame(name, age)
#Create new column
df$occupation[1] <- "Builder"
The code above produces the following results:
df
name age occupation
1 Bob 45 Builder
2 Lauren 34 Builder
3 Joe 54 Builder
4 Chris 12 Builder
The desired result are:
df
name age occupation
1 Bob 45 Builder
2 Lauren 34 <NA>
3 Joe 54 <NA>
4 Chris 12 <NA>
Thank you!
Upvotes: 2
Views: 109
Reputation: 47146
As occupation
does not yet exists is created, recycling the first value. Here is how I would do this.
name <- c("Bob", "Lauren", "Joe", "Chris")
age <- c(45, 34, 54, 12)
df <- data.frame(name, age, occupation=NA)
df$occupation <- NA
df$occupation[1] <- "Builder"
or
df <- data.frame(name, age, occupation=NA)
df$occupation[1] <- "Builder"
Note that
df <- data.frame(name, age)
df$occupation[2] <- "Builder"
also does work as you expected. It recycles c(NA, "Builder")
(thanks to @joran for pointing this out).
Upvotes: 3
Reputation: 173567
I think this could use a little more clarification.
Consider the setup:
name <- c("Bob", "Lauren", "Joe", "Chris")
age <- c(45, 34, 54, 12)
df <- data.frame(name, age)
and now look at what happens when we do:
debugonce(`$<-.data.frame`)
> df$x[1] <- "a"
debugging in: `$<-.data.frame`(`*tmp*`, "x", value = "a")
debug: {
cl <- oldClass(x)
class(x) <- NULL
nrows <- .row_names_info(x, 2L)
if (!is.null(value)) {
N <- NROW(value)
if (N > nrows)
stop(sprintf(ngettext(N, "replacement has %d row, data has %d",
"replacement has %d rows, data has %d"), N, nrows),
domain = NA)
if (N < nrows)
if (N > 0L && (nrows%%N == 0L) && length(dim(value)) <=
1L)
value <- rep(value, length.out = nrows)
else stop(sprintf(ngettext(N, "replacement has %d row, data has %d",
"replacement has %d rows, data has %d"), N, nrows),
domain = NA)
if (is.atomic(value) && !is.null(names(value)))
names(value) <- NULL
}
x[[name]] <- value
class(x) <- cl
return(x)
}
Note that this was called with value = "a"
and eventually we're going to simply run x[[name]] <- value
, so "a" is recycled along every row.
That seems simple enough, but what happens when we do (be sure to wipe out the column x
between each of these!):
debugonce(`$<-.data.frame`)
> df$x[2] <- "a"
debugging in: `$<-.data.frame`(`*tmp*`, "x", value = c(NA, "a"))
#Rest snipped...
Oho! This time it was called with value = c(NA,"a")
, so contrary to RobertH's answer above we see that the recycling in fact yields:
> df
name age x
1 Bob 45 <NA>
2 Lauren 34 a
3 Joe 54 <NA>
4 Chris 12 a
Confused? What if we try:
debugonce(`$<-.data.frame`)
> df$x[3] <- "a"
debugging in: `$<-.data.frame`(`*tmp*`, "x", value = c(NA, NA, "a"))
Hmmm. This one ends in an error, because the recycling fails.
For completion:
debugonce(`$<-.data.frame`)
> df$x[4] <- "a"
debugging in: `$<-.data.frame`(`*tmp*`, "x", value = c(NA, NA, NA, "a"))
And that one results in:
> df
name age x
1 Bob 45 <NA>
2 Lauren 34 <NA>
3 Joe 54 <NA>
4 Chris 12 a
So what's going on here? Well, remember that nonexistent columns of a data frame (or nonexistent elements of a list, really) are treated as NULL
. And so we're referencing the 1st, 2nd, etc. element of NULL
.
Now run:
> `[<-`(NULL,1,1)
[1] 1
> `[<-`(NULL,2,1)
[1] NA 1
> `[<-`(NULL,3,1)
[1] NA NA 1
> `[<-`(NULL,4,1)
[1] NA NA NA 1
and you can start to see how the various calls are being pieced together.
Upvotes: 2
Reputation: 19867
This gives the expected output if you don't want to initialize the variable before:
df[1,"occupation"] <- "Builder"
I have no idea why...
Upvotes: 1