Ke Tian
Ke Tian

Reputation: 177

How to remove "Not Available" in a data frame

I want to remove the "Not Available" in the following data frame, but when I change Number to numeric using the following code, the "Not Available" becomes 4:

c1 <- c("India", "America", "China", "Europe", "Japan")
c2 <- c(2.3, 3.5, "Not Available", 1.2, 1.2)
data <- data.frame(Name=c1, Number=c2)
data$Number <- as.numeric(data$Number)

The result is:

data

##      Name Number
## 1   India      2
## 2 America      3
## 3   China      4
## 4  Europe      1
## 5   Japan      1

How can I remove the "Not Available" rows in this data frame?

Upvotes: 3

Views: 5610

Answers (2)

akrun
akrun

Reputation: 887128

We could also read the dataset with na.strings = "Not Available" in the read.csv/read.table so that it will return as NA value which can be removed with ?is.na or ?complete.cases or ?na.omit.

df1 <- read.csv("file.csv", na.strings="Not Available")
res <- df1[complete.cases(df1$Number),]

Upvotes: 2

jbaums
jbaums

Reputation: 27388

This is because:

  1. An R data.frame only allows a single class of data per column.
  2. When you create a data.frame, the default behaviour is for character columns to be coerced to factor, which are stored as numeric values (corresponding to factor levels) with labels. Your c2 vector is a character vector since it has a character element ("Not Available"), and as such the Number column of data is a factor column.
  3. When you coerce a factor directly to numeric, the resulting numbers indicate the factor levels.

To achieve the behaviour you're after, you can either prevent the character data from being coerced to a factor when creating the data.frame:

data <- data.frame(Name=c1, Number=c2, stringsAsFactors=FALSE)
data$Number <- as.numeric(data$Number)

data
##      Name        Number
## 1   India           2.3
## 2 America           3.5
## 3   China            NA
## 4  Europe           1.2
## 5   Japan           1.2

Alternatively, you can coerce the factor to numeric via character:

data$Number <- as.numeric(as.character(data$Number))

Neither of these options will "remove the Not Available rows", as you've requested. They just convert the "Not Available" elements (and any other "text" elements of the Number column) to NA. To remove the rows containing "Not Available", you can do:

data <- data.frame(Name=c1, Number=c2, stringsAsFactors=FALSE)
na.omit(data)

or, using your original data object:

data <- data.frame(Name=c1, Number=c2)
data[data$Number != 'Not Available', ]

Upvotes: 5

Related Questions