Reputation: 981

Why does new data frame column populate all values?

I have a data frame, df, with 2 columns. When I create a 3rd column and try to update the first value only, it populates the entire column instead. Could someone explain why that is and what the solution for it is?

#Create data frame with 2 columns and 4 observations
name <- c("Bob", "Lauren", "Joe", "Chris")
age <- c(45, 34, 54, 12)
df <- data.frame(name, age)

#Create new column
df$occupation[1] <- "Builder"

The code above produces the following results:

    df
    name age occupation
1    Bob  45    Builder
2 Lauren  34    Builder
3    Joe  54    Builder
4  Chris  12    Builder

The desired result are:

 df
    name age occupation
1    Bob  45    Builder
2 Lauren  34       <NA>
3    Joe  54       <NA>
4  Chris  12       <NA>

Thank you!

Upvotes: 2

Answers (3)

Robert Hijmans

Reputation: 47146

As occupation does not yet exists is created, recycling the first value. Here is how I would do this.

name <- c("Bob", "Lauren", "Joe", "Chris")
age <- c(45, 34, 54, 12)
df <- data.frame(name, age, occupation=NA)

df$occupation <- NA
df$occupation[1] <- "Builder"

df <- data.frame(name, age, occupation=NA)
df$occupation[1] <- "Builder"

Note that

df <- data.frame(name, age)
df$occupation[2] <- "Builder"

also does work as you expected. It recycles c(NA, "Builder") (thanks to @joran for pointing this out).

Upvotes: 3

joran

Reputation: 173567

I think this could use a little more clarification.

Consider the setup:

name <- c("Bob", "Lauren", "Joe", "Chris")
age <- c(45, 34, 54, 12)
df <- data.frame(name, age)

and now look at what happens when we do:

debugonce(`$<-.data.frame`)
> df$x[1] <- "a"
debugging in: `$<-.data.frame`(`*tmp*`, "x", value = "a")
debug: {
    cl <- oldClass(x)
    class(x) <- NULL
    nrows <- .row_names_info(x, 2L)
    if (!is.null(value)) {
        N <- NROW(value)
        if (N > nrows) 
            stop(sprintf(ngettext(N, "replacement has %d row, data has %d", 
                "replacement has %d rows, data has %d"), N, nrows), 
                domain = NA)
        if (N < nrows) 
            if (N > 0L && (nrows%%N == 0L) && length(dim(value)) <= 
                1L) 
                value <- rep(value, length.out = nrows)
            else stop(sprintf(ngettext(N, "replacement has %d row, data has %d", 
                "replacement has %d rows, data has %d"), N, nrows), 
                domain = NA)
        if (is.atomic(value) && !is.null(names(value))) 
            names(value) <- NULL
    }
    x[[name]] <- value
    class(x) <- cl
    return(x)
}

Note that this was called with value = "a" and eventually we're going to simply run x[[name]] <- value, so "a" is recycled along every row.

That seems simple enough, but what happens when we do (be sure to wipe out the column x between each of these!):

debugonce(`$<-.data.frame`)
> df$x[2] <- "a"
debugging in: `$<-.data.frame`(`*tmp*`, "x", value = c(NA, "a"))
#Rest snipped...

Oho! This time it was called with value = c(NA,"a"), so contrary to RobertH's answer above we see that the recycling in fact yields:

> df
    name age    x
1    Bob  45 <NA>
2 Lauren  34    a
3    Joe  54 <NA>
4  Chris  12    a

Confused? What if we try:

debugonce(`$<-.data.frame`)
> df$x[3] <- "a"
debugging in: `$<-.data.frame`(`*tmp*`, "x", value = c(NA, NA, "a"))

Hmmm. This one ends in an error, because the recycling fails.

For completion:

debugonce(`$<-.data.frame`)
> df$x[4] <- "a"
debugging in: `$<-.data.frame`(`*tmp*`, "x", value = c(NA, NA, NA, "a"))

And that one results in:

> df
    name age    x
1    Bob  45 <NA>
2 Lauren  34 <NA>
3    Joe  54 <NA>
4  Chris  12    a

So what's going on here? Well, remember that nonexistent columns of a data frame (or nonexistent elements of a list, really) are treated as NULL. And so we're referencing the 1st, 2nd, etc. element of NULL.

Now run:

> `[<-`(NULL,1,1)
[1] 1
> `[<-`(NULL,2,1)
[1] NA  1
> `[<-`(NULL,3,1)
[1] NA NA  1
> `[<-`(NULL,4,1)
[1] NA NA NA  1

and you can start to see how the various calls are being pieced together.

Upvotes: 2

scoa

Reputation: 19867

This gives the expected output if you don't want to initialize the variable before:

df[1,"occupation"] <- "Builder"

I have no idea why...

Upvotes: 1

Why does new data frame column populate all values?

Answers (3)

Related Questions