Identicon
Identicon

Reputation: 129

How to merge three data frames using two variables based on conditions in R

R users, I want to merge data from three different data frames (studentsPublic, studentsPrivate, studentsState) into one data frame called Final_Desired_df. Final_Desired_df. I want to use either students' email addresses or their social security numbers (ssn) . The example below illustrates what I need and also a description of the Final_Desired_df. Thanks in advance for assistance.

studentsPublic = randomNames::randomNames(10)
emailPublic = c('[email protected]', NA, '[email protected]', '[email protected]', '[email protected]',NA, NA, '[email protected]', '[email protected]', '[email protected]')
examPublic = rnorm(10, mean=15, sd=5)
d1_PublicSchool = data.frame(studentsPublic, emailPublic, examPublic)

studentsPrivate = randomNames::randomNames(10)
emailPivate = c('[email protected]', NA, NA, NA, '[email protected]',NA, NA, '[email protected]', '[email protected]', NA)
ssnPrivate = c(NA, 12, 34, NA,45, 67, NA, 32, 23, NA )
exanPrivate = rnorm(10, mean=15, sd=5)
d2_PrivateSchool = data.frame(studentsPrivate, emailPivate, ssnPrivate, exanPrivate)

studentsState = randomNames::randomNames(30)
emailState = c('[email protected]', NA, '[email protected]', '[email protected]', '[email protected]',NA, NA, '[email protected]', '[email protected]', '[email protected]')
ssnState = c(NA, 12, 34, NA,45, 67, NA, 32, 23, NA)
sexState = rep(c('male', 'female'), 15,15)
d3_StateSchools = data.frame(studentsState, emailState, ssnState, sexState)

Final_Desired_df = should include all students from d1_PublicSchool whose email addresses are in the d3_StateSchools; and all students from d2_PrivateSchool whose either their emailPivate are in the d3_StateSchools OR their ssnPrivate are in the d3_StateSchools.

Thanks in advance.

Upvotes: 0

Views: 42

Answers (1)

dmuenzel
dmuenzel

Reputation: 101

How about this? I had to rename the columns to concatenate to final data frame and added a final step of removing duplicate rows.

# all students from d1_PublicSchool whose email addresses are in the d3_StateSchools
students_from_d1_in_d3_email<-d1_PublicSchool[which(d1_PublicSchool$emailPublic %in% d3_StateSchools$emailState),]

# add missing column of ssn as NAs
students_from_d1_in_d3_email<-cbind(students_from_d1_in_d3_email$studentsPublic,students_from_d1_in_d3_email$emailPublic,"ssn"=NA,students_from_d1_in_d3_email$examPublic)

# adjust column names to match
colnames(students_from_d1_in_d3_email)<-c("name","email","ssn","exam")

# all students from d2_PrivateSchool whose emailPivate are in the d3_StateSchools
students_from_d2_in_d3_email<-d2_PrivateSchool[which(d2_PrivateSchool$emailPivate %in% d3_StateSchools$emailState),]

# adjust column names to match
colnames(students_from_d2_in_d3_email)<-c("name","email","ssn","exam")

# all students from d2_PrivateSchool whose ssnPrivate are in the d3_StateSchools
students_from_d2_in_d3_SSN<-d2_PrivateSchool[which(d2_PrivateSchool$ssnPrivate %in% d3_StateSchools$ssnState),]

# adjust column names to match
colnames(students_from_d2_in_d3_SSN)<-c("name","email","ssn","exam")

# Final dataframe
Final_Desired_df<-rbind(students_from_d1_in_d3_email,students_from_d2_in_d3_email,students_from_d2_in_d3_SSN)


# Remove duplicate students in final dataframe
Final_Desired_df<-unique(Final_Desired_df)

Upvotes: 1

Related Questions