Reputation: 309
I have a table with missing values in the first column. There is relationship between the first and second column namely, There is a unique NAME in the first column corresponding to many IDs in the second column.
E.g:
The NA in NAME column needs to be replaced with one of the existing values in its column (A or B or C). The selection is done by its ID from the second column.
E.g the first NA has ID3. Looking up in the table, ID3 correspond to A therefore, NA will be replaced with A.
The third column doesn’t play any roll at this step but I will need it as well for the next operations.
This is how it currently looks out:
NAME | ID | Value
A | ID1 | V1
A | ID2 | V2
A | ID3 | V3
B | ID4 | V4
C | ID5 | V5
C | ID6 | V6
NA | ID3 | V7
NA | ID1 | V8
NA | ID5 | V9
NA | ID4 | V10
NA | ID5 | V11
This is how it should look like
Name | ID | Value
A | ID1 | V1
A | ID2 | V2
A | ID3 | V3
B | ID4 | V4
C | ID5 | V5
C | ID6 | V6
A | ID3 | V7
A | ID1 | V8
C | ID5 | V9
B | ID4 | V10
C | ID5 | V11
I've got stuck between different approaches with dplyr and data table and can't make any progress. Any idea of how I can move on? Thank you.
Upvotes: 1
Views: 70
Reputation: 938
First, you must somehow feed your data into R. This process is beyound the scope of the answer.
temp
# NAME ID Value
# 1 A ID1 V1
# 2 A ID2 V2
# 3 A ID3 V3
# 4 B ID4 V4
# 5 C ID5 V5
# 6 C ID6 V6
# 7 <NA> ID3 V7
# 8 <NA> ID1 V8
# 9 <NA> ID5 V9
# 10 <NA> ID4 V10
# 11 <NA> ID5 V11
The following solution does not use dplyr
, only basic R with do.call
.
temp2 <- table (temp$NAME, temp$ID)
temp2
# ID1 ID2 ID3 ID4 ID5 ID6
# A 1 1 1 0 0 0
# B 0 0 0 1 0 0
# C 0 0 0 0 1 1
You cannot use the table directly. First, you must convert it into the dataframe.
temp3 <- as.data.frame.matrix(temp2)
temp3
# ID1 ID2 ID3 ID4 ID5 ID6
# A 1 1 1 0 0 0
# B 0 0 0 1 0 0
# C 0 0 0 0 1 1
For example, you want to replace the NA corresponding to ID5 with "C".
rownames (temp3[which(temp3$ID5 == 1),])
# [1] "C"
Now let's put this into a function.
returnName <- function (df, id)
{
rowTemp <- do.call ('$', list(df, id))
name <- rownames (df[which(rowTemp == 1),])
return (name)
}
temp$NAME <- sapply (temp$ID, returnName, df = temp3)
temp
# NAME ID Value
# 1 A ID1 V1
# 2 A ID2 V2
# 3 A ID3 V3
# 4 B ID4 V4
# 5 C ID5 V5
# 6 C ID6 V6
# 7 A ID3 V7
# 8 A ID1 V8
# 9 C ID5 V9
# 10 B ID4 V10
# 11 C ID5 V11
Upvotes: 0
Reputation: 16121
Here's a dplyr
solution:
library(dplyr)
# example data
dt = read.table(text = "
NAME ID Value
A ID1 V1
A ID2 V2
A ID3 V3
B ID4 V4
C ID5 V5
C ID6 V6
NA ID3 V7
NA ID1 V8
NA ID5 V9
NA ID4 V10
NA ID5 V11
", header=T, stringsAsFactors=F)
dt %>%
group_by(ID) %>% # for each ID
mutate(NAME = unique(NAME[!is.na(NAME)])) %>% # get the non-NA value of NAME
ungroup() # forget the grouping
# # A tibble: 11 x 3
# NAME ID Value
# <chr> <chr> <chr>
# 1 A ID1 V1
# 2 A ID2 V2
# 3 A ID3 V3
# 4 B ID4 V4
# 5 C ID5 V5
# 6 C ID6 V6
# 7 A ID3 V7
# 8 A ID1 V8
# 9 C ID5 V9
#10 B ID4 V10
#11 C ID5 V11
Note:
This will work with factor variables as well.
For the specific example you posted you don't really need unique(.)
, but this is a more general case (i.e. if you have multiple rows for a specific ID and non-NA
NAME value).
Upvotes: 1