Bahareh
Bahareh

Reputation: 1

Converting PLINK binary files into python dataframe

I'm working with a genetic dataset (roughly 23,000 samples and 300,000 SNPs as features). I got my files in PLINK binary format files (.bed, .bim, .fam). Listed below are their sizes:

My aim is to convert them into (pandas) dataframes and then start my predictive analysis in Python (it's a machine learning project).

I was adviced to combine all 3 binary files into one vcf (variant call format) file. The result (vcf file) is a 26G file using PLINK software. There are python packages and codes for converting vcf files into pandas dataframes, but my remote system memory is limited (15 Gi). Due to the nature of the dataset, I can only work with university computers.

My question is, considering all my limitations, how do I convert my dataset into a dataframe that can be used in machine learning? Let me know if you need more details.

Upvotes: 0

Views: 650

Answers (1)

Daniel King
Daniel King

Reputation: 510

Why are you trying to convert it to a VCF?

Unfortunately, I don't think you can load the whole dataset into Python. 23,000 samples by 300,000 variants is ~1.7 GB if each genotype is 2 bits; however, I suspect your machine learning algorithm will expect 32-bit or 64-bit floating point numbers. Using 64-bit floats, you'll need 55 GB.

You might try using the Python library Hail (disclaimer: I'm a Hail maintainer). You can stream through the data row by row.

import hail as hl

mt = hl.import_plink(bed='...bed', bim='...bim', fam='...fam')
mt.show()

You can use Hail to filter to a smaller set of useful variants and then dump those into your machine learning system. For example, you can filter to relatively rare variants:

mt = hl.variant_qc(mt)
mt = mt.filter_rows(
    (mt.variant_qc.AF[0] < 0.1) | (mt.variant_qc.AF[0] > 0.9)
)

import numpy as np
dataset = np.array(hl.float(mt.GT.n_alt_alleles()).collect())

Upvotes: 0

Related Questions