user3833190
user3833190

Reputation:

How to matching missing IDs?

I have a large table with 50000 obs. The following mimic the structure:

ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B",NA,"D","E",NA,"G","H","I")
b <- c(11,2233,12,2,22,13,23,23,100)
c <- c(12,10,12,23,16,17,7,9,7)
df <- data.frame(ID ,a,b,c)

Where there are some missing values on the vector "a". However, I have some tables where the ID and the missing strings are included:

ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B","C","D","E","F","G","H","I")

key <- data.frame(ID,a)

Is there a way to include the missing strings from key into the column a using the ID?

Upvotes: 6

Views: 96

Answers (3)

David Arenburg
David Arenburg

Reputation: 92302

Another options is to use data.tables fast binary join and update by reference capabilities

library(data.table)
setkey(setDT(df), ID)[key, a := i.a]
df
#    ID a    b  c
# 1:  1 A   11 12
# 2:  2 B 2233 10
# 3:  3 C   12 12
# 4:  4 D    2 23
# 5:  5 E   22 16
# 6:  6 F   13 17
# 7:  7 G   23  7
# 8:  8 H   23  9
# 9:  9 I  100  7

If you want to replace only the NAs (not all the joined cases), a bit more complicated implemintation will be

setkey(setDT(key), ID)
setkey(setDT(df), ID)[is.na(a), a := key[.SD, a]]

Upvotes: 2

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193637

You can just use match; however, I would recommend that both your datasets are using characters instead of factors to prevent headaches later on.

key$a <- as.character(key$a)
df$a <- as.character(df$a)

df$a[is.na(df$a)] <- key$a[match(df$ID[is.na(df$a)], key$ID)]
df
#   ID a    b  c
# 1  1 A   11 12
# 2  2 B 2233 10
# 3  3 C   12 12
# 4  4 D    2 23
# 5  5 E   22 16
# 6  6 F   13 17
# 7  7 G   23  7
# 8  8 H   23  9
# 9  9 I  100  7

Of course, you could always stick with factors and factor the entire "ID" column and use the labels to replace the values in column "a"....

factor(df$ID, levels = key$ID, labels = key$a)
## [1] A B C D E F G H I
## Levels: A B C D E F G H I

Assign that to df$a and you're done....

Upvotes: 1

MarkeD
MarkeD

Reputation: 2631

Named vectors make nice lookup tables:

lookup <- a
names(lookup) <- as.character(ID)

lookup is now a named vector, you can access each value by lookup[ID] e.g. lookup["2"] (make sure the number is a character, not numeric)

## should give you a vector of a as required.
lookup[as.character(ID_from_big_table)]

Upvotes: 0

Related Questions