IndexError when assigning variables to Neo4jRestClient

Question

I'm trying to access genome data within SciKit Allele, a tool used for genome data, based on Numpy.

I'm not great with python, but am trying to iterate through each variant and extract the relevant column in the array to then create nodes in a Neo4j database using the Neo4j Rest Client.

The code below generates an array of all variants and all data types:

variants = allel.VariantChunkedTable(callset[chrom]['variants'], names=['AC','AF_AFR', 'AF_AMR', 'AF_ASN', 'AF_EUR', 'AF_MAX', 'CGT', 'CLR', 'CSQ', 'DP', 'DP4', 'ESP_MAF', 'FILTER_LowQual', 'FILTER_MinHWE', 'FILTER_MinVQSLOD', 'FILTER_PASS', 'HWE', 'ICF', 'ID', 'IS', 'PC2', 'PCHI2', 'POS', 'PR', 'QCHI2', 'QUAL', 'REF', 'ALT', 'INDEL', 'SHAPEIT', 'SNP_ID', 'TYPE', 'UGT', 'VQSLOD', 'dbSNPmismatch', 'is_snp', 'numalt', 'svlen'], index='POS')

I (think I) have declared the variables in an array form as follows:

pos = variants['POS'][:]
alt = variants['ALT'][:]
dp = variants['DP'][:]
ac = variants['AC'][:]
type = variants['TYPE'][:]
svlen = variants['svlen'][:]
qual = variants['QUAL'][:]
vq = variants['VQSLOD'][:]

These variables create arrays, such as:

In: pos
Out: array([    28590,     50481,     52152, ..., 249225077, 249229702,
       249231222], dtype=int32)

I'm now trying to access the variables for each row, but can't seem to work out how to do so. My current attempt looks like this (for the first 10 rows):

for variant in variants[0:10]:
    a1 = db.nodes.create(pos=pos[variant], bp=alt[variant][0], DP=dp[variant], AC=ac[variant][0], type=type[variant][0], svlen=svlen[variant][0], qual=qual[variant], vqslod=vq[variant])
    a1.relationships.create("belongs_to", c1)

Unfortunately, this comes up with the following error:

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Can anyone help me figure out how to get the specific variable for each attribute?

dvitsios · Accepted Answer

Since I wasn't able to follow your code without a reproducible example, I had to create one based on the scikit-allel documentation:

https://scikit-allel.readthedocs.io/en/stable/model/chunked.html#variantchunkedtable

import h5py
import allel
import os

#cleanup
h5_file = 'callset.h5'
os.remove(h5_file) if os.path.exists(h5_file) else None

chrom = [b'chr1', b'chr1', b'chr2', b'chr2', b'chr3', b'chr3']
pos = [2, 7, 3, 9, 6, 11]
dp = [35, 12, 78, 22, 99, 96]
qd = [4.5, 6.7, 1.2, 4.4, 2.8, 3.2]
ac = [(1, 2), (3, 4), (5, 6), (7, 8), (9, 10), (11, 12)]

with h5py.File(h5_file, mode='w') as h5f:
    h5g = h5f.create_group('/3L/variants')
    h5g.create_dataset('CHROM', data=chrom, chunks=True)
    h5g.create_dataset('POS', data=pos, chunks=True)
    h5g.create_dataset('DP', data=dp, chunks=True)
    h5g.create_dataset('QD', data=qd, chunks=True)
    h5g.create_dataset('AC', data=ac, chunks=True)


callset = h5py.File(h5_file, mode='r')
variants = allel.VariantChunkedTable(callset['/3L/variants'],
                                names=['CHROM', 'POS', 'AC', 'QD', 'DP'])

So, the variants variable (with just 6 rows in this example) looks like that:

>>> variants

    CHROM   POS AC      QD  DP
0   b'chr1' 2   [1 2]   4.5 35
1   b'chr1' 7   [3 4]   6.7 12
2   b'chr2' 3   [5 6]   1.2 78
3   b'chr2' 9   [7 8]   4.4 22
4   b'chr3' 6   [ 9 10] 2.8 99
5   b'chr3' 11  [11 12] 3.2 96

You have defined the pos, alt, dp, etc. arrays correctly (i.e. pos = variants['POS'][:], etc.)

Then, in your loop, I assume your goal is to iterate through the first 10 rows of the variants variable, get some values from each row, e.g. pos[variant], ac[variant][0], dp[variant] and create a new node in the GraphDatabase with these attributes.

The way you've written the loop currently, you get a full row from variants at each iteration and you try to use it as an index to access elements from the pos, alt, ... arrays, which throws the error.

The right way to do it is to iterate through numeric indices; in my example, to iterate over all 6 rows of the variants variable you should run:

for i in range(len(variants)):
    print(f"> Row {i}")
    print(pos[i])
    print(dp[i])
    print(ac[i][0])

The pos[i], dp[i] etc. values can then be fed to db.nodes.create in name=value pairs. Of course for the first 10 rows you just need to use for i in range(10).

IndexError when assigning variables to Neo4jRestClient

Answers (1)

Related Questions