Vinterwoo
Vinterwoo

Reputation: 3941

How can I replace omitted (not NA) data with a zero

I have a data set that looks like this:

ByYear <- data.frame( V1 = c(2005,2006,2007,2008,2005,2006,2008,2006,2007,2005,2006,2007,2008),
                      V2 = c(0.5,0.2,1,1.6,2,5,8,4,3,6,8,6,5), 
                      V3 = c('A','A','A','A','B','B','B','C','C','D','D','D','D'))

Which gives me

> ByYear
   V1   V2  V3
1  2005 0.5 A
2  2006 0.2 A
3  2007 1.0 A
4  2008 1.6 A
5  2005 2.0 B
6  2006 5.0 B
7  2008 8.0 B
8  2006 4.0 C
9  2007 3.0 C
10 2005 6.0 D
11 2006 8.0 D
12 2007 6.0 D
13 2008 5.0 D

Some of the years are missing from V1. This is due to an error in how the data were entered. I know this is a touchy subject, but I know for a fact that in this case a missing year in V1 means the value in V2 should be 0.

Is there a way I can create a new data set that adds a row with a zero value to any missing year like so:

> ByYear
  V1   V2  V3
  2005 0.5 A
  2006 0.2 A
  2007 1.0 A
  2008 1.6 A
  2005 2.0 B
  2006 5.0 B
  2007 0.0 B
  2008 8.0 B
  2005 0.0 C
  2006 4.0 C
  2007 3.0 C
  2008 0.0 C
  2005 6.0 D
  2006 0.0 D
  2007 6.0 D
  2008 5.0 D

Thanks for everyone for all your help!

Upvotes: 0

Views: 334

Answers (5)

Richie Cotton
Richie Cotton

Reputation: 121187

Use table to find the missing year/group combinations.

Frequencies <- with(ByYear, as.data.frame(table(V1, V3)))
MissingValues <- subset(Frequencies, Freq == 0, c(V1, V3))

Set the value of V2 to be 0 (or NA or whatver you want), then append this to your original dataset.

MissingValues$V2 <- 0
rbind(ByYear, MissingValues)

Upvotes: 1

John
John

Reputation: 23768

As others have suggested, there are a number of ways to do this. Here's one using the fact that reshape will fill in holes in a complete matrix for you.

y <- reshape(ByYear, direction = 'wide', timevar = 'V1', v.names = 'V2', idvar = 'V3')
y <- reshape(y, direction = 'long')
y$V3[is.na(y$V3)] <- 0

Here's another that just uses replacement. First you construct a new data.frame (df) that contains a complete V1 and V3 but with V2 set to 0.

uV1 <- unique(ByYear$V1)
uV3 <- unique(ByYear$V3)
df <- data.frame(V1 = rep(uV1, length(uV3)), V3 = rep(uV3, each = length(uV1)), V2 = 0)

This is also helped by making a new interaction variable so that you can have a single unique identifier of each row.

df$i <- interaction(df$V1, df$V3)
ByYear$i <- interaction(ByYear$V1, ByYear$V3)

And now, in the new data.frame replace V2 with V2 from ByYear.

df$V2[df$i %in% ByYear$i] <- ByYear$V2

Upvotes: 0

Itamar
Itamar

Reputation: 2151

A naive (non-vectorized) approach:

for (year in 2001:2010) {
   if (sum(ByYear["V1"]==year) == 0) {
      # add zero value for the found year
   }
}

Upvotes: 0

IRTFM
IRTFM

Reputation: 263489

I tried to come up with a simple set of tests to make Dirk's suggestion work, but the repeating sequence with missing values stymied me. Brute force seemed too have mre promise. Identify the "interior" missing values and then indentify the gaps at either end where the increase by 1 rule may breakdown

for(i in seq_along(ByYear$V1[2:nrow(ByYear)]) ) if(
        ByYear$V1[i+1] -ByYear$V1[i] > 1){
         ByYear <- rbind(ByYear[1:i, ], c(v1[i]+1,0), ByYear[(i+1):NROW(ByYear), ])}

for(i in seq_along(ByYear$V1[2:nrow(ByYear)]) ) if(  
        ByYear$V1[i] ==2007 & ByYear$V1[i+1] != 2008 ){
         ByYear <- rbind(ByYear[1:i, ], c(2008, 0), ByYear[(i+1):NROW(ByYear), ])}
# I think you need to fill in all the missing 2008's before the missing 2005's
for(i in seq_along(ByYear$V1[2:nrow(ByYear)]) ) if(
        ByYear$V1[i] ==2008 & ByYear$V1[i+1] != 2005 ){
         ByYear <- rbind(ByYear[1:i, ], c(2005, 0), ByYear[(i+1):NROW(ByYear), ])}

ByYear
      V1  V2
1   2005 0.5
2   2006 0.2
3   2007 1.0
4   2008 1.6
5   2005 2.0
6   2006 5.0
7   2007 0.0
71  2008 8.0
9   2005 0.0
8   2006 4.0
91  2007 3.0
12  2008 0.0
10  2005 6.0
11  2006 8.0
121 2007 6.0
13  2008 5.0

Upvotes: 0

Dirk is no longer here
Dirk is no longer here

Reputation: 368639

There are several ways to do this, and the simplest is just by indexing.

Let's create some data first:

R> X <- data.frame(year=seq(2000,2010,by=1), val=0)
R> V <- data.frame(year=c(2003,2005,2007), val=c(1:3))

And let's look at it

R> X
   year val
1  2000   0
2  2001   0
3  2002   0
4  2003   0
5  2004   0
6  2005   0
7  2006   0
8  2007   0
9  2008   0
10 2009   0
11 2010   0
R> V
  year val
1 2003   1
2 2005   2
3 2007   3

So now we want to inject V into X at the right spots. A boolean match of where V appears in X makes that possible:

R> X[ X$year %in% V$year, "val"] <- V$val

Look at the result of X$year %in% V$year in isolation:

R> X$year %in% V$year
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE

So now the result:

R> X
   year val
1  2000   0
2  2001   0
3  2002   0
4  2003   1
5  2004   0
6  2005   2
7  2006   0
8  2007   3
9  2008   0
10 2009   0
11 2010   0
R> 

Caveat: You need regular indices for this to work. Years, or quarters or months work. Business days is harder, but there are other methods too.

Upvotes: 1

Related Questions