MNK2008-
MNK2008-

Reputation: 35

how to count observations used and deleted when using a regression in r

I have a dataset I am working with that has multiple variables. Some of these variables have observations that are NA. After running the regression there are some observations deleted because of the NA. I subset my data with the following code:

jsubset= jtrain %>% select(2,11,23,24,28)

Which returned the following output: jsubset

A tibble: 471 x 5
    fcode   d89 cgrant clemploy cgrant_1
    <dbl> <dbl>  <dbl>    <dbl>    <dbl>
 1 410032     0      0  NA            NA
 2 410032     0      0   0.270         0
 3 410032     1      0  -0.0630        0
 4 410440     0      0  NA            NA
 5 410440     0      0   0.0800        0
 6 410440     1      0   0.0741        0
 7 410495     0      0  NA            NA
 8 410495     0      0   0.223         0
 9 410495     1      0  -0.0408        0
10 410500     0      0  NA            NA
 ... with 461 more rows

How would I separate FCODES that have values for the remaining columns and FCODES that while they have values for some columns there are NA? Or is it easier to count observations used and deleted from the regression? I know the command in STATA is e(sample) that does this easier but trying to replicate in R.

Upvotes: 1

Views: 157

Answers (2)

Mr. T
Mr. T

Reputation: 12410

You can use filter with if_any() for both cases. Taking your sample:

library(dplyr)
df <- read.csv(text="
 fcode   d89 cgrant clemploy cgrant_1
 410032     0      0  NA            NA
 410032     0      0   0.270         0
 410032     1      0  -0.0630        0
 410440     0      0  NA            NA
 410440     0      0   0.0800        0
 410440     1      0   0.0741        0
 410495     0      0  NA            NA
 410495     0      0   0.223         0
 410495     1      0  -0.0408        0
 410500     0      0  NA            NA", sep="")


df_no_na <- df %>%      
        filter(!if_any(everything(), is.na))

df_na <- df %>%     
        filter(if_any(everything(), is.na))

df_no_na:

   fcode d89 cgrant clemploy cgrant_1
1 410032   0      0   0.2700        0
2 410032   1      0  -0.0630        0
3 410440   0      0   0.0800        0
4 410440   1      0   0.0741        0
5 410495   0      0   0.2230        0
6 410495   1      0  -0.0408        0

df_na:

   fcode d89 cgrant clemploy cgrant_1
1 410032   0      0       NA       NA
2 410440   0      0       NA       NA
3 410495   0      0       NA       NA
4 410500   0      0       NA       NA

Upvotes: 1

Shawn Hemelstrand
Shawn Hemelstrand

Reputation: 3228

Arranging by NA values

Should be straight forward if you have a proper data frame. First, I just copied your data into a text file, named it, then turned it into a data frame:

slack <- read.table("slack.txt")
colnames(slack) <- c("fcode",
                     "d89",
                     "cgrant",
                     "clemploy",
                     "cgrant_1")
slack <- data.frame(slack)

Then just arrange the values, which will naturally reorder the NA values you mentioned:

slack %>% 
  arrange(cgrant_1)

Giving you this:

   fcode    d89 cgrant clemploy cgrant_1 
1      3 410032      1       0  -0.0630   
2      9 410495      1       0  -0.0408   
3      6 410440      1       0   0.0741   
4      5 410440      0       0   0.0800   
5      8 410495      0       0   0.2230   
6      2 410032      0       0   0.2700   
7      1 410032      0       0       NA  
8      4 410440      0       0       NA  
9      7 410495      0       0       NA  
10    10 410500      0       0       NA  

Removing NA values

If you just want to get rid of the NA values:

slack %>% 
  na.omit()

Which gives you:

  fcode    d89 cgrant cemploy cgrant_1 
2     2 410032      0       0   0.2700   
3     3 410032      1       0  -0.0630   
5     5 410440      0       0   0.0800   
6     6 410440      1       0   0.0741   
8     8 410495      0       0   0.2230   
9     9 410495      1       0  -0.0408   

Subset data with NA values removed

You can also subset that data into a new dataframe with the following code:

# NA Removed Dataframe:    
no_na_slack <- slack %>% 
      na.omit()

# Is NA Dataframe:
is_na_slack <- slack %>% 
      filter_all(any_vars(is.na(.)))

Upvotes: 1

Related Questions