Reputation: 81
short version: How do I replace values within a data frame with a string found within another data frame?
longer version: I'm a biologist working with many species of bees. I have a data set with many thousands of bees. Each row has a unique bee ID # along with all the relevant info about that specimen (data of capture, GPS location, etc). The species information for each bee has not been entered because it takes a long time to ID them. When IDing, I end up with boxes of hundred of bees, all of the same species. I enter these into a separate data frame. I am trying to write code that will update the original data file with species information (family, genus, species, sex, etc) as I ID the bees. Currently, in the original data file, the species info is blank and is interpreted as NA within R. I want to have R find all unique bee ID #'s and fill in the species info, but I am having trouble figuring out how to replace the NA values with a string (e.g. "Andrenidae")
Here is a simple example of what I am trying to do:
rawData<-data.frame(beeID=c(1:20),family=rep(NA,20))
speciesInfo<-data.frame(beeID=seq(1,20,3),family=rep("Andrenidae",7))
rawData[rawData$beeID == 4,"family"] <- speciesInfo[speciesInfo$beeID == 4,"family"]
So, I am replacing things as I want, but with a number rather than the family name (a string). What I would eventually like to do is write a little loop to add in all the species info, e.g.:
for (i in speciesInfo$beeID){
rawData[rawData$beeID == i,"family"] <- speciesInfo[speciesInfo$beeID == i,"family"]
}
Thanks in advance for any advice!
Cheers,
Zak
EDIT:
I just noticed that the first two methods below add a new column each time, which would cause problems if I needed to add species info multiple times (which I typically do). For example:
rawData<-data.frame(beeID=c(1:20),family=rep(NA,20))
Andrenidae<-data.frame(beeID=seq(1,20,3),family=rep("Andrenidae",7))
Halictidae<-data.frame(beeID=seq(1,20,3)+1,family=rep("Halictidae",7))
# using join
library(plyr)
rawData <- join(rawData, Andrenidae, by = "beeID", type = "left")
rawData <- join(rawData, Halictidae, by = "beeID", type = "left")
# using merge
rawData <- merge(x=rawData,y=Andrenidae,by='beeID',all.x=T,all.y=F)
rawData <- merge(x=rawData,y=Halictidae,by='beeID',all.x=T,all.y=F)
Is there a way to either collapse the columns so that I have one, unified data frame? Or a way to update the rawData rather than adding a new column each time? Thanks in advance!
Upvotes: 8
Views: 4907
Reputation: 115382
A data.table
solution that will be memory and time efficient.
stringsAsFactors = F
for rbindlist (a super-fast version of do.call(rbind,list) / rbind)rawData
object and removed family.Create the data -
rawData <- data.frame(beeID = c(1:20), other_stuff = sample(letters, 20), stringsAsFactors = F)
Andrenidae <- data.frame(beeID = seq(1, 20, 3), family = rep("Andrenidae", 7), stringsAsFactors = F)
Halictidae <- data.frame(beeID = seq(1, 20 , 3)+ 1, family = rep("Halictidae", 7), stringsAsFactors = F)
library(data.table)
# convert to data.table
rawDT <- as.data.table(rawData)
# combine the list of Species-specific data.frames into a large data.table
speciesInfo <- rbindlist(list(Andrenidae, Halictidae))
# set the keys, to allow efficient use of data.table and its merging
# abilities. The keys are the same for both
setkeyv(rawDT, 'beeID')
setkeyv(speciesInfo, 'beeID')
# merge by key
speciesInfo[rawDT, nomatch = NA]
## beeID family other_stuff
## 1: 1 Andrenidae s
## 2: 2 Halictidae x
## 3: 3 NA i
## 4: 4 Andrenidae e
## 5: 5 Halictidae v
## 6: 6 NA q
## 7: 7 Andrenidae w
## 8: 8 Halictidae c
## 9: 9 NA u
## 10: 10 Andrenidae z
## 11: 11 Halictidae y
## 12: 12 NA a
## 13: 13 Andrenidae l
## 14: 14 Halictidae r
## 15: 15 NA h
## 16: 16 Andrenidae o
## 17: 17 Halictidae n
## 18: 18 NA g
## 19: 19 Andrenidae p
## 20: 20 Halictidae m
or
rawDT[speciesInfo]
## beeID other_stuff family
## 1: 1 s Andrenidae
## 2: 2 x Halictidae
## 3: 4 e Andrenidae
## 4: 5 v Halictidae
## 5: 7 w Andrenidae
## 6: 8 c Halictidae
## 7: 10 z Andrenidae
## 8: 11 y Halictidae
## 9: 13 l Andrenidae
## 10: 14 r Halictidae
## 11: 16 o Andrenidae
## 12: 17 n Halictidae
## 13: 19 p Andrenidae
## 14: 20 m Halictidae
Which ever is the data you are interested in
Upvotes: 2
Reputation: 163
Here is a function I think will work for you. This uses match
to find and index of values in your annotation dataframe, and then replaces the values in the rawData.
replaceID <- function(to,from,mergeBy,values){
x <- match(from[,mergeBy],to[,mergeBy])
to[,values][x] <- as.character(from[,values])
return(to)
}
> rawData <- replaceID(rawData,Halictidae,"beeID","family")
> rawData
beeID family
1 1 <NA>
2 2 Halictidae
3 3 <NA>
4 4 <NA>
5 5 Halictidae
6 6 <NA>
7 7 <NA>
8 8 Halictidae
9 9 <NA>
10 10 <NA>
11 11 Halictidae
12 12 <NA>
13 13 <NA>
14 14 Halictidae
15 15 <NA>
16 16 <NA>
17 17 Halictidae
18 18 <NA>
19 19 <NA>
20 20 Halictidae
Upvotes: 4
Reputation: 32986
Another option is to use ?join
in package plyr
library(plyr)
#Adding family ahead of time was unnecessary so I'll remove it alongside the join.
join(rawData, speciesInfo, by = "beeID", type = "left")[,-2]
beeID family
1 1 Andrenidae
2 2 <NA>
3 3 <NA>
4 4 Andrenidae
5 5 <NA>
6 6 <NA>
7 7 Andrenidae
8 8 <NA>
9 9 <NA>
10 10 Andrenidae
11 11 <NA>
12 12 <NA>
13 13 Andrenidae
14 14 <NA>
15 15 <NA>
16 16 Andrenidae
17 17 <NA>
18 18 <NA>
19 19 Andrenidae
20 20 <NA>
# If you anticipate adding new species over time,
# simply rbind those into a single reference data.frame to merge with your rawData.
# Like so:
library(plyr)
rawData <- join(rawData, rbind(Andrenidae, Halictidae), by = "beeID", type = "left")
# To keep you code clean, you could do this step ahead of time
species_list <- rbind(Andrenidae, Halictidae)
rawData <- join(rawData, species_list, by = "beeID", type = "left")
Upvotes: 3
Reputation: 57210
You could use the merge function, e.g. :
rawData <- data.frame(beeID=c(1:20),family=rep(NA,20))
speciesInfo <- data.frame(beeID=seq(1,20,3),
family=c(rep('Halictidae',4), rep("Andrenidae",3)))
merged <- merge(x=rawData,y=speciesInfo,by='beeID',all.x=T,all.y=F)
merged$family.x <- NULL # remove the family.x column
names(merged) <- c('beeID','family') # rename the columns
N.B.
It is not necessary to initialize rawData
with the family
column.
Merge function will add it automatically, e.g. :
rawData <- data.frame(beeID=c(1:20))
speciesInfo <- data.frame(beeID=seq(1,20,3),
family=c(rep('Halictidae',4), rep("Andrenidae",3)))
merged <- merge(x=rawData,y=speciesInfo,by='beeID',all.x=T,all.y=F)
> merged
beeID family
1 1 Halictidae
2 2 <NA>
3 3 <NA>
4 4 Halictidae
5 5 <NA>
6 6 <NA>
7 7 Halictidae
8 8 <NA>
9 9 <NA>
10 10 Halictidae
11 11 <NA>
12 12 <NA>
13 13 Andrenidae
14 14 <NA>
15 15 <NA>
16 16 Andrenidae
17 17 <NA>
18 18 <NA>
19 19 Andrenidae
20 20 <NA>
Upvotes: 2