agodinezmm
agodinezmm

Reputation: 1

Merging two csv files to remove duplicates from output file using dplyr - R

I have two CSV files: One file contains contact information for survey respondents which we have yet to survey, and another csv files contains information for survey respondents we have already contacted. I need to make a new csv file of contacts we have not contacted and I believe I can do this by merging the two files in R and creating a new file which excludes the already contacted survey respondents through R. I am very new to R and want to sharpen my skills, but this task is a little over my head and would appreciate any help or advice

Upvotes: 0

Views: 710

Answers (1)

Zoe
Zoe

Reputation: 1000

So, since I don't know your data, let's assume you have the data already read in using read.csv(). df1 is the first dataset containing all people and their contact info. df2 contains their answers. Both share the same ID (without unique ID, this won't work, but maybe you have to rename a column or two) which is stored in the column "customer_ID".

# this is my dummy data
df1 <- data.frame("customer_ID" = 1:100,
                  "address" = rep(c("saturn", "mars"), 25))
df2 <- data.frame("customer_ID" = c(1:25, 75:99),
                  "likes_apples" = rep(c(TRUE, FALSE), 50))

You could extract the ID's from both tables and combine them. I converted it into a data frame so we can use dplyr.

df_combined <- data.frame("customer_ID" = c(df1$customer_ID, df2$customer_ID))

When you have the IDs together, you can group the data by "customer_ID" and count the number of data points per group. You store only those IDs which occur only once.

once_only <- df_combined %>%
  group_by(customer_ID) %>%
  filter(n() == 1)

You can then filter the data frame with the contact info on whether the ID is contained in the filtered data points:

df1[df1$customer_ID %in% once_only$customer_ID,]

I bet there are many better ways to do this, but I hope it helped anyway!

Edit: Okay, so I obviously learned something new from the comments. The way easier method would be:

anti_join(df1, df2, by="customer_ID")

Upvotes: 2

Related Questions