Nihat
Nihat

Reputation: 59

R imputation for strings in selected columns of large dataset

I am struggling with multiple data-imputation packages in R and need your advice.

I have a data-set with 150.000 rows and 270 columns.

Every column has some missing data, but i need to make imputation only in 7 of them. I need all the columns to be considered during the filling of those 7 columns. No numerical data, only strings.

I have tried to use MICE, but it takes too long and do not gives any result because of the break. I believe I am coding it completely wrong.

A                  |  B          |  C           |  D        |  E       | 
------------------------------------------------------------------------
DEEP DIGGING ALL   |  1989       |  Digging     |  Sumer    |  Cups    |
SURFACE DIGGING    |  1989       |  N/A         |  Sumer    |  Glasses |
CLAIMS OFFSHORE    |  1990       |  N/A         |  Assyria  |  N/A     | 
OFFSHORE CLAIMS    |  1990       |  Offshore    |  Assyria  |  N/A     |  
CLAIMS OFFSHORE    |  1990       |  Offshore    |  Assyria  |  Cups    |
OFFSHORE CLAIMS    |  1990       |  Offshore    |  Assyria  |  Cups    |

What I am trying to get is the table, where the column "C" is imputed based on all of the columns, but N/As in column "E" are ignored.

Desirable result:

A                  |  B          |  C           |  D        |  E       | 
------------------------------------------------------------------------
DEEP DIGGING ALL   |  1989       |  Digging     |  Sumer    |  Cups    |
SURFACE DIGGING    |  1989       |  Digging     |  Sumer    |  Glasses |
CLAIMS OFFSHORE    |  1990       |  Offshore    |  Assyria  |  N/A     | 
OFFSHORE CLAIMS    |  1990       |  Offshore    |  Assyria  |  N/A     |  
CLAIMS ONSHORE     |  1990       |  Offshore    |  Assyria  |  Cups    |
OFFSHORE CLAIMS    |  1990       |  Offshore    |  Assyria  |  Cups    |

I'm not sure if the "MICE" is the good path to take, but I did not get anywhere with my attempts in "missForest". So I really depend on your help.

Many thanks in advance!

Upvotes: 0

Views: 205

Answers (1)

akrun
akrun

Reputation: 886938

We can use fill from tidyr

library(dplyr)
library(tidyr)
df1 %>%
   group_by(B) %>%
   fill(C, .direction = 'updown')

Upvotes: 1

Related Questions