Reputation: 173
I am working with PLINK to analyse genome-wide data.
Does anyone know how to remove duplicated SNPs?
Upvotes: 6
Views: 18750
Reputation: 1
I was just tackling a similar problem. A few SNPs in my .bim file were present in duplicates, triplets or quadruplets (PLINK recognises all as duplicate variants). Firsly, you can recognise which are the problematic positions using PLINK 1.9 functionality --list-duplicate-vars. Secondly, to keep one copy and remove subsequent repetitions we need to use PLINK2 functionality: --rm-dup force-first
Worked like a charm for me.
plink2 --bfile inputfile --rm-dup force-first --make-bed --out file_cleaned_data
Upvotes: 0
Reputation: 121
A couple of others ideas that might be of help/interest:
You can also remove vcf duplicates using bcftools with the command bcftools norm -D, --remove-duplicates
bcftools documentation can be found at https://samtools.github.io/bcftools/bcftools.html
In the spirit of also just using Unix to remove duplicates, I've previously used the following (input is a compressed vcf file) gunzip -c input.vcf.gz | grep "^[^##]" | cut -f3 | sort | uniq -d > plink.dupvar
plink.dupvar is the filename the PLINK program looks for when performing the duplication removal step.
Upvotes: 2
Reputation: 51
In PLINK 1.9, use --list-duplicate-vars suppress-first
, which will list duplicates, and remove one (the first one), leaving the other intact. I've know this to slip up though.
Instead of using --exclude
as Davy suggested, you can also use --extract
, keeping rather than getting rid of a list of SNPs. There's an easy method on any Unix based system (assuming your data are in PED/MAP format and cut up by chromossome):
for i in {1..22}; do
cat yourfile_chr${i}.map | grep "$i" | cut -f -4 | uniq | cut -f -2 | keepers_chr${i}.txt;
done
This will create a keepers_chr.txt
file with SNP IDs for SNPs at unique locations. Then run PLINK feeding it your original file(s) and use --extract keepers_chr
, with --make-bed --out unique_file
Upvotes: 5
Reputation: 1
With R is easier, although you have to use a TPED file. Once you manage to get a TPED file just copy and paste this into a R console:
a = read.table("yourfile.TPED",sep = " ",header=FALSE)
b = a[!duplicated(a$V2),]
write.table(b,file="newfile.TPED",sep=" ",quote = FALSE,col.names = FALSE, row.names=FALSE)
The newfile.TPED
without duplicates will apper in the R working directory.
HINT: you can change the yourfile.TPED
and newfile.TPED
part of the script for the actual name of you file.
Upvotes: 0
Reputation: 4939
There is no command to do it automatically that I am aware of, but the way I have done it in the past is to get a list of SNPs that are duplicated, change the duplicates to rs1001.dup for example, then run --update-allele --update-name
and then create a list of the duplicates, so all the entries will have .dup
at the end of their names, and then run --extract duplicateSNPs.txt --make-bed --out yourfilename.dups.removed
Getting a list of SNPs that are duplicated shouldn't be too hard if you are familiar with R. Sorry to give you a "well just learn X!!!" answer
Upvotes: 3