Reputation: 387
I'm trying to access genome data within SciKit Allele, a tool used for genome data, based on Numpy.
I'm not great with python, but am trying to iterate through each variant and extract the relevant column in the array to then create nodes in a Neo4j database using the Neo4j Rest Client.
The code below generates an array of all variants and all data types:
variants = allel.VariantChunkedTable(callset[chrom]['variants'], names=['AC','AF_AFR', 'AF_AMR', 'AF_ASN', 'AF_EUR', 'AF_MAX', 'CGT', 'CLR', 'CSQ', 'DP', 'DP4', 'ESP_MAF', 'FILTER_LowQual', 'FILTER_MinHWE', 'FILTER_MinVQSLOD', 'FILTER_PASS', 'HWE', 'ICF', 'ID', 'IS', 'PC2', 'PCHI2', 'POS', 'PR', 'QCHI2', 'QUAL', 'REF', 'ALT', 'INDEL', 'SHAPEIT', 'SNP_ID', 'TYPE', 'UGT', 'VQSLOD', 'dbSNPmismatch', 'is_snp', 'numalt', 'svlen'], index='POS')
I (think I) have declared the variables in an array form as follows:
pos = variants['POS'][:]
alt = variants['ALT'][:]
dp = variants['DP'][:]
ac = variants['AC'][:]
type = variants['TYPE'][:]
svlen = variants['svlen'][:]
qual = variants['QUAL'][:]
vq = variants['VQSLOD'][:]
These variables create arrays, such as:
In: pos
Out: array([ 28590, 50481, 52152, ..., 249225077, 249229702,
249231222], dtype=int32)
I'm now trying to access the variables for each row, but can't seem to work out how to do so. My current attempt looks like this (for the first 10 rows):
for variant in variants[0:10]:
a1 = db.nodes.create(pos=pos[variant], bp=alt[variant][0], DP=dp[variant], AC=ac[variant][0], type=type[variant][0], svlen=svlen[variant][0], qual=qual[variant], vqslod=vq[variant])
a1.relationships.create("belongs_to", c1)
Unfortunately, this comes up with the following error:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
Can anyone help me figure out how to get the specific variable for each attribute?
Upvotes: 0
Views: 61
Reputation: 458
Since I wasn't able to follow your code without a reproducible example, I had to create one based on the scikit-allel documentation:
https://scikit-allel.readthedocs.io/en/stable/model/chunked.html#variantchunkedtable
import h5py
import allel
import os
#cleanup
h5_file = 'callset.h5'
os.remove(h5_file) if os.path.exists(h5_file) else None
chrom = [b'chr1', b'chr1', b'chr2', b'chr2', b'chr3', b'chr3']
pos = [2, 7, 3, 9, 6, 11]
dp = [35, 12, 78, 22, 99, 96]
qd = [4.5, 6.7, 1.2, 4.4, 2.8, 3.2]
ac = [(1, 2), (3, 4), (5, 6), (7, 8), (9, 10), (11, 12)]
with h5py.File(h5_file, mode='w') as h5f:
h5g = h5f.create_group('/3L/variants')
h5g.create_dataset('CHROM', data=chrom, chunks=True)
h5g.create_dataset('POS', data=pos, chunks=True)
h5g.create_dataset('DP', data=dp, chunks=True)
h5g.create_dataset('QD', data=qd, chunks=True)
h5g.create_dataset('AC', data=ac, chunks=True)
callset = h5py.File(h5_file, mode='r')
variants = allel.VariantChunkedTable(callset['/3L/variants'],
names=['CHROM', 'POS', 'AC', 'QD', 'DP'])
So, the variants
variable (with just 6 rows in this example) looks like that:
>>> variants
<VariantChunkedTable shape=(6,) dtype=[('CHROM', 'S4'), ('POS', '<i8'), ('AC', '<i8', (2,)), ('QD', '<f8'), ('DP', '<i8')]
nbytes=264 cbytes=264 cratio=1.0 values=h5py._hl.group.Group>
CHROM POS AC QD DP
0 b'chr1' 2 [1 2] 4.5 35
1 b'chr1' 7 [3 4] 6.7 12
2 b'chr2' 3 [5 6] 1.2 78
3 b'chr2' 9 [7 8] 4.4 22
4 b'chr3' 6 [ 9 10] 2.8 99
5 b'chr3' 11 [11 12] 3.2 96
You have defined the pos
, alt
, dp
, etc. arrays correctly (i.e. pos = variants['POS'][:]
, etc.)
Then, in your loop, I assume your goal is to iterate through the first 10 rows of the variants
variable, get some values from each row, e.g. pos[variant]
, ac[variant][0]
, dp[variant]
and create a new node in the GraphDatabase with these attributes.
The way you've written the loop currently, you get a full row from variants
at each iteration and you try to use it as an index to access elements from the pos
, alt
, ... arrays, which throws the error.
The right way to do it is to iterate through numeric indices; in my example, to iterate over all 6 rows of the variants
variable you should run:
for i in range(len(variants)):
print(f"> Row {i}")
print(pos[i])
print(dp[i])
print(ac[i][0])
The pos[i], dp[i] etc. values can then be fed to db.nodes.create
in name=value
pairs.
Of course for the first 10 rows you just need to use for i in range(10)
.
Upvotes: 1