Reputation: 1
The Xquery schema is shown here: https://www.ncbi.nlm.nih.gov/data_specs/schema_alt/NCBI_BlastOutput2.xsd (Outfmt = 16 on Blast+ command line).
The aim is to:
So far I have managed to filter on one search term and one blacklist item. But I had to do a very strange pathing to get there.
On the .xml document it is a simple path of Hit/description/HitDescr/title
for example. You can see below that I had to repeatedly use //*:title[1]
or what have you even once I'd pulled an item out, even when there's only one. This means the code breaks if I want to use 'Search' instead of 'Hit' and want to pull out the query name for the .csv.
I get an error saying title should be an item not a series but I've specified title
as [1]
and it's doing my head in. The error also goes got bit-score if the database has more than one .xml file in it for some strange reason. It worked for a database with exact one .xml file in.
declare namespace blast = "http://www.ncbi.nlm.nih.gov"
declare variable $searchTerm as xs:string external := "virus";
declare variable $blacklist as xs:string external := "Phage";
declare variable $bitscore as xs:int external := 50;
let $options := map {
'format' : 'xquery',
'header': true(),
'separator': 'comma'
}
let $hits := //*:Hit
let $hasParams := for $hit in $hits
where $hit//*:title[1][not(text() contains text {$blacklist})] and $hit//*:title[1][text() contains text {$searchTerm}] and $hit//*:bit-score[1][data() > $bitscore]
return $hit
let $data := map {
'names' : ['species name', 'bitscore'],
'records' : (for $entry in $hasParams
return[string($entry//*:title), string($entry//*:bit-score)]
)
}
return file:write(
'/tmp/output.csv',
csv:serialize($data, $options)
)
This works fine and forms a basis for building the .csv I was previously using Python for (slow because I have an entire folder of .xml files to do at a time this way), it just seems wrong.
Upvotes: 0
Views: 46
Reputation: 163342
I suspect (but it's a bit of a guess because you don't describe the problem very precisely) that you're making the common mistake of writing $hit//*:title[1]
when you meant ($hit//*:title)[1]
. The former expression selects every title that is the first title child of its parent.
Upvotes: 0