Reputation: 35
I have a dataset I am working with that has multiple variables. Some of these variables have observations that are NA. After running the regression there are some observations deleted because of the NA. I subset my data with the following code:
jsubset= jtrain %>% select(2,11,23,24,28)
Which returned the following output: jsubset
A tibble: 471 x 5
fcode d89 cgrant clemploy cgrant_1
<dbl> <dbl> <dbl> <dbl> <dbl>
1 410032 0 0 NA NA
2 410032 0 0 0.270 0
3 410032 1 0 -0.0630 0
4 410440 0 0 NA NA
5 410440 0 0 0.0800 0
6 410440 1 0 0.0741 0
7 410495 0 0 NA NA
8 410495 0 0 0.223 0
9 410495 1 0 -0.0408 0
10 410500 0 0 NA NA
... with 461 more rows
How would I separate FCODES that have values for the remaining columns and FCODES that while they have values for some columns there are NA? Or is it easier to count observations used and deleted from the regression? I know the command in STATA is e(sample) that does this easier but trying to replicate in R.
Upvotes: 1
Views: 157
Reputation: 12410
You can use filter
with if_any()
for both cases. Taking your sample:
library(dplyr)
df <- read.csv(text="
fcode d89 cgrant clemploy cgrant_1
410032 0 0 NA NA
410032 0 0 0.270 0
410032 1 0 -0.0630 0
410440 0 0 NA NA
410440 0 0 0.0800 0
410440 1 0 0.0741 0
410495 0 0 NA NA
410495 0 0 0.223 0
410495 1 0 -0.0408 0
410500 0 0 NA NA", sep="")
df_no_na <- df %>%
filter(!if_any(everything(), is.na))
df_na <- df %>%
filter(if_any(everything(), is.na))
df_no_na:
fcode d89 cgrant clemploy cgrant_1
1 410032 0 0 0.2700 0
2 410032 1 0 -0.0630 0
3 410440 0 0 0.0800 0
4 410440 1 0 0.0741 0
5 410495 0 0 0.2230 0
6 410495 1 0 -0.0408 0
df_na:
fcode d89 cgrant clemploy cgrant_1
1 410032 0 0 NA NA
2 410440 0 0 NA NA
3 410495 0 0 NA NA
4 410500 0 0 NA NA
Upvotes: 1
Reputation: 3228
Should be straight forward if you have a proper data frame. First, I just copied your data into a text file, named it, then turned it into a data frame:
slack <- read.table("slack.txt")
colnames(slack) <- c("fcode",
"d89",
"cgrant",
"clemploy",
"cgrant_1")
slack <- data.frame(slack)
Then just arrange the values, which will naturally reorder the NA values you mentioned:
slack %>%
arrange(cgrant_1)
Giving you this:
fcode d89 cgrant clemploy cgrant_1
1 3 410032 1 0 -0.0630
2 9 410495 1 0 -0.0408
3 6 410440 1 0 0.0741
4 5 410440 0 0 0.0800
5 8 410495 0 0 0.2230
6 2 410032 0 0 0.2700
7 1 410032 0 0 NA
8 4 410440 0 0 NA
9 7 410495 0 0 NA
10 10 410500 0 0 NA
If you just want to get rid of the NA values:
slack %>%
na.omit()
Which gives you:
fcode d89 cgrant cemploy cgrant_1
2 2 410032 0 0 0.2700
3 3 410032 1 0 -0.0630
5 5 410440 0 0 0.0800
6 6 410440 1 0 0.0741
8 8 410495 0 0 0.2230
9 9 410495 1 0 -0.0408
You can also subset that data into a new dataframe with the following code:
# NA Removed Dataframe:
no_na_slack <- slack %>%
na.omit()
# Is NA Dataframe:
is_na_slack <- slack %>%
filter_all(any_vars(is.na(.)))
Upvotes: 1