Reputation: 33
In the midst of merging several data sets, I'm trying to remove all rows of a data frame that have a missing value for one particular variable (I want to keep the NAs in some of the other columns for the time being). I used the following line:
data.frame <- data.frame[!is.na(data.frame$year),]
This successfully removes all rows with NAs for year
, (and no others), but the other columns, which previously had data, are now entirely NAs. In other words, non-missing values are being converted to NA. Any ideas as to what's going on here? I've tried these alternatives and got the same outcome:
data.frame <- subset(data.frame, !is.na(year))
data.frame$x <- ifelse(is.na(data.frame$year) == T, 1, 0);
data.frame <- subset(data.frame, x == 0)
Am I using is.na
incorrectly? Are there any alternatives to is.na
in this scenario? Any help would be greatly appreciated!
Edit Here is code that should reproduce the issue:
#data
tc <- read.csv("http://dl.dropbox.com/u/4115584/tc2008.csv")
frame <- read.csv("http://dl.dropbox.com/u/4115584/frame.csv")
#standardize NA codes
tc[tc == "."] <- NA
tc[tc == -9] <- NA
#standardize spatial units
colnames(frame)[1] <- "loser"
colnames(frame)[2] <- "gainer"
frame$dyad <- paste(frame$loser,frame$gainer,sep="")
tc$dyad <- paste(tc$loser,tc$gainer,sep="")
drops <- c("loser","gainer")
tc <- tc[,!names(tc) %in% drops]
frame <- frame[,!names(frame) %in% drops]
rm(drops)
#merge tc into frame
data <- merge(tc, frame, by.x = "year", by.y = "dyad", all.x=T, all.y=T) #year column is duplicated in this process. I haven't had this problem with nearly identical code using other data.
rm(tc,frame)
#the first column in the new data frame is the duplicate year, which does not actually contain years. I'll rename it.
colnames(data)[1] <- "double"
summary(data$year) #shows 833 NA's
summary(data$procedur) #note that at this point there are non-NA values
#later, I want to create 20 year windows following the events in the tc data. For simplicity, I want to remove cases with NA in the year column.
new.data <- data[!is.na(data$year),]
#now let's see what the above operation did
summary(new.data$year) #missing years were successfully removed
summary(new.data$procedur) #this variable is now entirely NA's
Upvotes: 1
Views: 10567
Reputation: 118799
I think the actual problem is with your merge
.
After you merge and have the data in data
, if you do:
# > table(data$procedur, useNA="always")
# 1 2 3 4 5 6 <NA>
# 122 112 356 59 39 19 192258
You see there are these many (122+112...+19
) values for data$procedur
. But, all these values are corresponding to data$year = NA
.
> all(is.na(data$year[!is.na(data$procedur)]))
# [1] TRUE # every value of procedur occurs where year = NA
So, basically, all values of procedur
are also removed because you removed those rows checking for NA
in year
.
To solve this problem, I think you should use merge
as:
merge(tc, frame, all=T) # it'll automatically calculate common columns
# also this will not result in duplicated year column.
Check if this merge gives you the desired result.
Upvotes: 2
Reputation: 569
Try complete.cases
:
data.frame.clean <- data.frame[complete.cases(data.frame$year),]
...though, as noted above, you may want to pick a more descriptive name.
Upvotes: 0