Reputation: 87

Find Duplicates in R based on multiple characters

I can't seem to remember how to code this properly in R -

if I want to remove duplicates within a csv file based on multiple entries - first name and last name that are stored in separate columns

Then I can code: file[(duplicated(file$First.Name),] but that only looks at the first name, I want it to look at the last same simultaneously.

If this is my starting file:

    Steve Jones
    Eric Brown
    Sally Edwards
    Steve Jones
    Eric Davis

I want the output to be

    Steve Jones
    Eric Brown
    Sally Edwards
    Eric Davis

Only removing names of first and last name matching.

Upvotes: 0

Answers (4)

Sven Hohenstein

Reputation: 81733

You can use

file[!duplicated(file[c("First.Name", "Last.Name")]), ]

Upvotes: 1

akrun

Reputation: 887851

If there is a single column, use sub to remove the substring (i.e. first name) followed by space, get the logical vector (!duplicated(..) based on that to subset the rows of the dataset.

df1[!duplicated(sub("\\w+\\s+", "", df1$Col1)),,drop=FALSE]
#           Col1
#1   Steve Jones
#2    Eric Brown
#3 Sally Edwards
#5    Eric Davis

If it is based on two columns and the dataset have two columns, just do duplicated directly on the dataset to get the logical vector, negate it and subset the rows.

df1[!duplicated(df1), , drop=FALSE]
#  first.name second.name
#1      Steve       Jones
#2       Eric       Brown
#3      Sally     Edwards
#5       Eric       Davis

Upvotes: 1

Kunal Puri

Reputation: 3427

Here is the solution for better performance (using data.table assuming First Name and Last Name are stored in separate columns):

> df <- read.table(text = 'Steve Jones
+     Eric Brown
+     Sally Edwards
+     Steve Jones
+     Eric Davis')

> colnames(df) <- c("First.Name","Last.Name")
> df
  First.Name Last.Name
1      Steve     Jones
2       Eric     Brown
3      Sally   Edwards
4      Steve     Jones
5       Eric     Davis

Here is where data.table specific code begins

> dt <- setDT(df)
> unique(dt,by=c('First.Name','Last.Name'))
   First.Name Last.Name
1:      Steve     Jones
2:       Eric     Brown
3:      Sally   Edwards
4:       Eric     Davis

Upvotes: 1

Zahiro Mor

Reputation: 1718

try:

!duplicated(paste(File$First.Name,File$Last.Name))

Upvotes: 0

Find Duplicates in R based on multiple characters

Answers (4)

Related Questions