Reputation: 451
My dataframe contains entries 'n/a' that cannot be detected by na.omit(). I am aware that the mutate function in the tidyverse package is able to transform the 'n/a' entries into proper null values and to be dropped. Here is what I have attempted but an error occurs: Error in replace(value, value == "n/a", NA) : object 'value' not found. Thanks in advance!
> head(data)
customer_id gender past_3_years_bike_related_purchases DOB
1 1 F 93 19644
2 2 Male 81 29571
3 5 Female 56 28258
4 8 Male 31 22735
5 9 Female 97 26733
6 12 Male 58 34536
job_industry_category wealth_segment owns_car tenure state
1 Health Mass Customer Yes 11 New South Wales
2 Financial Services Mass Customer Yes 16 New South Wales
3 n/a Affluent Customer Yes 8 New South Wales
4 n/a Mass Customer No 7 New South Wales
5 Argiculture Affluent Customer Yes 8 New South Wales
6 Manufacturing Mass Customer No 8 QLD
data %>%
mutate(value = replace(value, value == "n/a", NA)) %>%
drop_na()
Upvotes: 0
Views: 10691
Reputation: 101753
A data.table
option
> setDT(df)[!"n/a", on = .(job_industry_category)]
customer_id gender past_3_years_bike_related_purchases DOB
1: 1 Female 93 19644
2: 2 Male 81 29571
3: 9 Female 97 26733
4: 12 Male 58 34536
job_industry_category wealth_segment owns_car tenure state
1: Health Mass Customer Yes 11 New South Wales
2: Financial Services Mass Customer Yes 16 New South Wales
3: Argiculture Affluent Customer Yes 8 New South Wales
4: Manufacturing Mass Customer No 8 QLD
Data
> dput(df)
structure(list(customer_id = c(1L, 2L, 5L, 8L, 9L, 12L), gender = c("Female",
"Male", "Female", "Male", "Female", "Male"), past_3_years_bike_related_purchases = c(93L,
81L, 56L, 31L, 97L, 58L), DOB = c(19644L, 29571L, 28258L, 22735L,
26733L, 34536L), job_industry_category = c("Health", "Financial Services",
"n/a", "n/a", "Argiculture", "Manufacturing"), wealth_segment = c("Mass Customer",
"Mass Customer", "Affluent Customer", "Mass Customer", "Affluent Customer",
"Mass Customer"), owns_car = c("Yes", "Yes", "Yes", " No", "Yes",
" No"), tenure = c(11L, 16L, 8L, 7L, 8L, 8L), state = c("New South Wales",
"New South Wales", "New South Wales", "New South Wales", "New South Wales",
"QLD")), class = "data.frame", row.names = c(NA, -6L))
Upvotes: 1
Reputation: 887223
We can use na_if
to convert to the elements to NA
and use drop_na
library(dplyr)
library(tidyr)
data %>%
mutate(value = na_if(job_industry_category, "n/a")) %>%
drop_na()
Upvotes: 1
Reputation: 10855
The n/a
values can also be converted to values that work with na.omit()
when the data is read into R by use of the na.strings()
argument.
For example, if we take the data from the original post and convert it to a pipe separated values file, we can use na.strings()
to include n/a
as a missing value with read.csv()
, and then use na.omit()
to subset the data.
textData <- "customer_id|gender|past_3_years_bike_related_purchases|DOB|job_industry_category|wealth_segment|owns_car|tenure|state
1|Female| 93|19644|Health|Mass Customer|Yes|11|New South Wales
2|Male| 81|29571|Financial Services|Mass Customer|Yes|16|New South Wales
5|Female| 56|28258|n/a|Affluent Customer|Yes|8|New South Wales
8|Male| 31|22735|n/a|Mass Customer| No|7|New South Wales
9|Female| 97|26733|Argiculture|Affluent Customer|Yes| 8|New South Wales
12|Male| 58|34536|Manufacturing|Mass Customer| No| 8|QLD"
data <- read.csv(text = textData,
header = TRUE,
na.strings = c("n/a","na"),
sep="|")
data
> data
customer_id gender past_3_years_bike_related_purchases DOB job_industry_category
1 1 Female 93 19644 Health
2 2 Male 81 29571 Financial Services
3 5 Female 56 28258 <NA>
4 8 Male 31 22735 <NA>
5 9 Female 97 26733 Argiculture
6 12 Male 58 34536 Manufacturing
wealth_segment owns_car tenure state
1 Mass Customer Yes 11 New South Wales
2 Mass Customer Yes 16 New South Wales
3 Affluent Customer Yes 8 New South Wales
4 Mass Customer No 7 New South Wales
5 Affluent Customer Yes 8 New South Wales
6 Mass Customer No 8 QLD
As we can see from the output, rows 3 and 4 now have <NA>
for job_industry_category
.
# now omit missing values
na.omit(data)
...and now the rows with <NA>
values are removed from the data frame.
> na.omit(data)
customer_id gender past_3_years_bike_related_purchases DOB job_industry_category
1 1 Female 93 19644 Health
2 2 Male 81 29571 Financial Services
5 9 Female 97 26733 Argiculture
6 12 Male 58 34536 Manufacturing
wealth_segment owns_car tenure state
1 Mass Customer Yes 11 New South Wales
2 Mass Customer Yes 16 New South Wales
5 Affluent Customer Yes 8 New South Wales
6 Mass Customer No 8 QLD
Upvotes: 2
Reputation: 389047
You need to use the column name in which you want to detect "n/a" values.
library(dplyr)
library(tidyr)
data %>%
mutate(value = replace(job_industry_category,
job_industry_category == "n/a", NA)) %>%
drop_na()
You can also do this without converting those values to actual NA
.
data %>% filter(job_industry_category != "n/a")
#Base R :
subset(data, job_industry_category != "n/a")
Upvotes: 5