user1236418
user1236418

Reputation: 173

How do I remove duplicated SNPs using PLink?

I am working with PLINK to analyse genome-wide data.

Does anyone know how to remove duplicated SNPs?

Upvotes: 6

Views: 18750

Answers (5)

Debashruti
Debashruti

Reputation: 1

I was just tackling a similar problem. A few SNPs in my .bim file were present in duplicates, triplets or quadruplets (PLINK recognises all as duplicate variants). Firsly, you can recognise which are the problematic positions using PLINK 1.9 functionality --list-duplicate-vars. Secondly, to keep one copy and remove subsequent repetitions we need to use PLINK2 functionality: --rm-dup force-first

Worked like a charm for me.

plink2 --bfile inputfile --rm-dup force-first --make-bed --out file_cleaned_data

Upvotes: 0

Yuri Plotkin
Yuri Plotkin

Reputation: 121

A couple of others ideas that might be of help/interest:

  1. You can also remove vcf duplicates using bcftools with the command bcftools norm -D, --remove-duplicates bcftools documentation can be found at https://samtools.github.io/bcftools/bcftools.html

  2. In the spirit of also just using Unix to remove duplicates, I've previously used the following (input is a compressed vcf file) gunzip -c input.vcf.gz | grep "^[^##]" | cut -f3 | sort | uniq -d > plink.dupvar plink.dupvar is the filename the PLINK program looks for when performing the duplication removal step.

Upvotes: 2

Benjamatic
Benjamatic

Reputation: 51

In PLINK 1.9, use --list-duplicate-vars suppress-first, which will list duplicates, and remove one (the first one), leaving the other intact. I've know this to slip up though.

Instead of using --exclude as Davy suggested, you can also use --extract, keeping rather than getting rid of a list of SNPs. There's an easy method on any Unix based system (assuming your data are in PED/MAP format and cut up by chromossome):

for i in {1..22}; do
  cat yourfile_chr${i}.map | grep "$i" | cut -f -4 | uniq | cut -f -2 | keepers_chr${i}.txt;
done

This will create a keepers_chr.txt file with SNP IDs for SNPs at unique locations. Then run PLINK feeding it your original file(s) and use --extract keepers_chr, with --make-bed --out unique_file

Upvotes: 5

user2765374
user2765374

Reputation: 1

With R is easier, although you have to use a TPED file. Once you manage to get a TPED file just copy and paste this into a R console:

a = read.table("yourfile.TPED",sep = " ",header=FALSE)
b = a[!duplicated(a$V2),]
write.table(b,file="newfile.TPED",sep=" ",quote = FALSE,col.names = FALSE, row.names=FALSE)

The newfile.TPED without duplicates will apper in the R working directory. HINT: you can change the yourfile.TPED and newfile.TPED part of the script for the actual name of you file.

Upvotes: 0

Davy Kavanagh
Davy Kavanagh

Reputation: 4939

There is no command to do it automatically that I am aware of, but the way I have done it in the past is to get a list of SNPs that are duplicated, change the duplicates to rs1001.dup for example, then run --update-allele --update-name and then create a list of the duplicates, so all the entries will have .dup at the end of their names, and then run --extract duplicateSNPs.txt --make-bed --out yourfilename.dups.removed

Getting a list of SNPs that are duplicated shouldn't be too hard if you are familiar with R. Sorry to give you a "well just learn X!!!" answer

Upvotes: 3

Related Questions