gogo88
gogo88

Reputation: 309

Replace NAs with the values selected by another column

I have a table with missing values in the first column. There is relationship between the first and second column namely, There is a unique NAME in the first column corresponding to many IDs in the second column.

E.g:

The NA in NAME column needs to be replaced with one of the existing values in its column (A or B or C). The selection is done by its ID from the second column. E.g the first NA has ID3. Looking up in the table, ID3 correspond to A therefore, NA will be replaced with A.
The third column doesn’t play any roll at this step but I will need it as well for the next operations.

This is how it currently looks out:

NAME |    ID     |    Value
A    |    ID1    |    V1
A    |    ID2    |    V2
A    |    ID3    |    V3
B    |    ID4    |    V4
C    |    ID5    |    V5
C    |    ID6    |    V6
NA   |    ID3    |    V7
NA   |    ID1    |     V8
NA   |   ID5     |    V9
NA   |   ID4     |    V10
NA   |   ID5     |    V11

This is how it should look like

Name |    ID     |    Value
A    |    ID1    |    V1
A    |    ID2    |    V2
A    |    ID3    |    V3
B    |    ID4    |    V4
C    |    ID5    |    V5
C    |    ID6    |    V6
A    |    ID3    |    V7
A    |    ID1    |    V8
C    |    ID5    |    V9
B    |    ID4    |    V10
C    |    ID5    |    V11

I've got stuck between different approaches with dplyr and data table and can't make any progress. Any idea of how I can move on? Thank you.

Upvotes: 1

Views: 70

Answers (2)

Serhii Kushchenko
Serhii Kushchenko

Reputation: 938

First, you must somehow feed your data into R. This process is beyound the scope of the answer.

temp
#    NAME  ID Value
# 1     A ID1    V1
# 2     A ID2    V2
# 3     A ID3    V3
# 4     B ID4    V4
# 5     C ID5    V5
# 6     C ID6    V6
# 7  <NA> ID3    V7
# 8  <NA> ID1    V8
# 9  <NA> ID5    V9
# 10 <NA> ID4   V10
# 11 <NA> ID5   V11

The following solution does not use dplyr, only basic R with do.call.

temp2 <- table (temp$NAME, temp$ID)
temp2

#    ID1 ID2 ID3 ID4 ID5 ID6
#  A   1   1   1   0   0   0
#  B   0   0   0   1   0   0
#  C   0   0   0   0   1   1

You cannot use the table directly. First, you must convert it into the dataframe.

temp3 <- as.data.frame.matrix(temp2)
temp3

#    ID1 ID2 ID3 ID4 ID5 ID6
# A   1   1   1   0   0   0
# B   0   0   0   1   0   0
# C   0   0   0   0   1   1

For example, you want to replace the NA corresponding to ID5 with "C".

rownames (temp3[which(temp3$ID5 == 1),])
# [1] "C"

Now let's put this into a function.

returnName <- function (df, id)
{
    rowTemp <- do.call ('$', list(df, id))
    name <- rownames (df[which(rowTemp == 1),])
    return (name)
}

temp$NAME <- sapply (temp$ID, returnName, df = temp3)

temp
#    NAME  ID Value
# 1     A ID1    V1
# 2     A ID2    V2
# 3     A ID3    V3
# 4     B ID4    V4
# 5     C ID5    V5
# 6     C ID6    V6
# 7     A ID3    V7
# 8     A ID1    V8
# 9     C ID5    V9
# 10    B ID4   V10
# 11    C ID5   V11

Upvotes: 0

AntoniosK
AntoniosK

Reputation: 16121

Here's a dplyr solution:

library(dplyr)

# example data
dt = read.table(text = "
NAME     ID         Value
A        ID1        V1
A        ID2        V2
A        ID3        V3
B        ID4        V4
C        ID5        V5
C        ID6        V6
NA       ID3        V7
NA       ID1         V8
NA      ID5         V9
NA      ID4         V10
NA      ID5         V11 
", header=T, stringsAsFactors=F)

dt %>%
  group_by(ID) %>%                               # for each ID
  mutate(NAME = unique(NAME[!is.na(NAME)])) %>%  # get the non-NA value of NAME
  ungroup()                                      # forget the grouping


# # A tibble: 11 x 3
#   NAME  ID    Value
#   <chr> <chr> <chr>
# 1 A     ID1   V1   
# 2 A     ID2   V2   
# 3 A     ID3   V3   
# 4 B     ID4   V4   
# 5 C     ID5   V5   
# 6 C     ID6   V6   
# 7 A     ID3   V7   
# 8 A     ID1   V8   
# 9 C     ID5   V9   
#10 B     ID4   V10  
#11 C     ID5   V11 

Note: This will work with factor variables as well. For the specific example you posted you don't really need unique(.), but this is a more general case (i.e. if you have multiple rows for a specific ID and non-NA NAME value).

Upvotes: 1

Related Questions