Reputation: 2531
I've a dataframe df
file with the following data:
ID P1 P2 Year Month A B
11084 23 43 2001 April 41.9 -99.99
67985 76 12 2001 May 6.9 -9.99
11084 34 64 2001 June -999 -99.99
34084 56 77 2001 July NA -99.99
11043 90 54 2001 August NA -99.99
23084 55 32 2001 September 50.8 -99.99
11084 77 14 2001 October 0 -99.99
54328 89 56 2001 November -999 -99.99
I'm trying to add two new columns and fill 'Yes'/'No' values for the records with missing values. My expected output is:
ID P1 P2 Year Month A B A_miss B_miss
11084 23 43 2001 April 41.9 -99.99 No Yes
67985 76 12 2001 May 6.9 123 No No
11084 34 64 2001 June -999 -99.99 Yes Yes
34084 56 77 2001 July NA -99.99 Yes Yes
11043 90 54 2001 August NA -99.99 Yes Yes
23084 55 32 2001 September 50.8 -99.99 No Yes
11084 77 14 2001 October 0 -99.99 No Yes
54328 89 56 2001 November -999 -99.99 Yes Yes
I'm new to R. I was trying to achieve this using simple for
loop and if/else
conditions in the following way:
for(i in length(df$A))
{
if(df$A[i] == -999 || df$A[i] == 'NA')
df$A_miss[i] <- 'Yes'
else
df$A_miss[i] <- 'No'
}
I was firstly trying the loop on 'A' column, but only the else
part was executing everytime I try and the 'No' values are being filled in the entire 'A_miss' column. I'm unable to find out why the if
part isn't working.
Where am I going wrong?
Upvotes: 1
Views: 87
Reputation: 734
Using the which command might increase the speed of the process:
df$A_miss[which(df$A==-999 | is.na(df$A))] <- 'Yes'
df$A_miss[which(df$A_miss!='Yes')] <- 'no'
Upvotes: 0
Reputation: 23788
Your loop is not correctly defined. This one works:
for (i in 1:length(df$A)) {
if(df$A[i] == -999 || is.na(df$A[i]) )
df$A_miss[i] <- 'Yes'
else
df$A_miss[i] <- 'No'
}
The limit should be set as (i in 1:length(df$A))
, and not as (i in length(df$A)
. Hope this helps.
PS: As you can see, the important correction pointed out by @Pascal has been implemented here.
PPS: The version below should be much faster than your code with the for
loop:
df$A_miss <- 'No'
df$A_miss[which(df$A==-999 | is.na(df$A)] <- 'Yes'
(I just noticed that this solution is very similar to the one that had been suggested earlier by @Daniel Fischer)
Upvotes: 3
Reputation: 3364
A vectorized version:
df <- structure(list(ID = c(11084L, 67985L, 11084L, 34084L, 11043L,
23084L, 11084L, 54328L), P1 = c(23L, 76L, 34L, 56L, 90L, 55L,
77L, 89L), P2 = c(43L, 12L, 64L, 77L, 54L, 32L, 14L, 56L), Year = c(2001L,
2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L), Month = structure(c(1L,
5L, 4L, 3L, 2L, 8L, 7L, 6L), .Label = c("April", "August", "July",
"June", "May", "November", "October", "September"), class = "factor"),
A = c(41.9, 6.9, -999, NA, NA, 50.8, 0, -999), B = c(-99.99,
123, -99.99, -99.99, -99.99, -99.99, -99.99, -99.99), A_miss = c("No",
"No", "Yes", "Yes", "Yes", "No", "No", "Yes")), .Names = c("ID",
"P1", "P2", "Year", "Month", "A", "B", "A_miss"), row.names = c(NA,
-8L), class = "data.frame")
df$A_miss <- ifelse(df$A == -999 | is.na(df$A), "yes", "no")
df$B_miss <- ifelse(df$B == -99.99 | is.na(df$B), "yes", "no")
ID P1 P2 Year Month A B A_miss B_miss
1 11084 23 43 2001 April 41.9 -99.99 no yes
2 67985 76 12 2001 May 6.9 123.00 no no
3 11084 34 64 2001 June -999.0 -99.99 yes yes
4 34084 56 77 2001 July NA -99.99 yes yes
5 11043 90 54 2001 August NA -99.99 yes yes
6 23084 55 32 2001 September 50.8 -99.99 no yes
7 11084 77 14 2001 October 0.0 -99.99 no yes
8 54328 89 56 2001 November -999.0 -99.99 yes yes
Upvotes: 2
Reputation: 3380
Maybe you could try this, without any loop or if clause:
df$A[(df$A==-999)|(is.na(df$A))] <- "yes"
df$A[df$A!="yes"] <- "no"
Upvotes: 0