Reputation: 549
I am working on an XQuery library for getting simple geospatial information from GPS files (it's called GPXQuery and available at GitHub). GPX files often contain tracks of GPS coordinates, and can get quite big. My biggest test file has 20'000 points in it. GPX is very simple:
<gpx version="1.1" xmlns="http://www.topografix.com/GPX/1/1">
<trk>
<name>Berkeley Test Walk #1</name>
<trkseg>
<trkpt lon="-122.26794633083045" lat="37.878523925319314">
<ele>78.4000015258789</ele>
There is a long sequence of <trkpt>
elements, representing all the recorded GPS coordinates. I want to be able to process at least 100'000, hopefully more.
My first slightly complicated function calculates the distance of a recorded GPS track. The math does not matter here. The problem is that I run into stack issues. For my 20'000 point example, my standard Saxon setup already stops. I am sure that this could be fixed with more generous memory allocation, but I am wondering if something more fundamental may be going on.
My function should qualify for tail recursion optimization, but that's a bit hard to tell and may differ from product to product. Here's the function(s), and they are invoked by gpxquery:trk-distance($GPX/gpx:gpx)[1]
to get the distance of the first GPX track in a given GPX document $GPX
:
module namespace gpxquery = "https://github.com/dret/GPXQuery";
declare namespace xsd = "http://www.w3.org/2001/XMLSchema";
declare namespace math = "http://www.w3.org/2005/xpath-functions/math";
declare namespace gpx = "http://www.topografix.com/GPX/1/1";
declare variable $gpxquery:earth-radius := 6371000.0;
declare function gpxquery:trk-distance($gpx as element(gpx:gpx))
as xsd:float*
{
for $trk in 1 to count($gpx/gpx:trk)
return sum(gpxquery:trk-distance-recurse($gpx/gpx:trk[$trk]/gpx:trkseg/gpx:trkpt))
};
declare function gpxquery:trk-distance-recurse($trkpts as element(gpx:trkpt)*)
as xsd:float*
{
if ( count($trkpts) le 1 )
then 0
else (
gpxquery:distance-between-points($trkpts[1]/@lat, $trkpts[1]/@lon, $trkpts[2]/@lat, $trkpts[2]/@lon) ,
gpxquery:trk-distance-recurse($trkpts[position() gt 1])
)
};
declare function gpxquery:distance-between-points($lat1 as xsd:float, $lon1 as xsd:float, $lat2 as xsd:float, $lon2 as xsd:float)
as xsd:float
{
let $dlat := ($lat2 - $lat1) * math:pi() div 180
let $dlon := ($lon2 - $lon1) * math:pi() div 180
let $rlat1 := $lat1 * math:pi() div 180
let $rlat2 := $lat2 * math:pi() div 180
let $a := math:sin($dlat div 2) * math:sin($dlat div 2) + math:sin($dlon div 2) * math:sin($dlon div 2) * math:cos($rlat1) * math:cos($rlat2)
let $c := 2 * math:atan2(math:sqrt($a), math:sqrt(1-$a))
return xsd:float($c * $gpxquery:earth-radius)
};
Is there anything I should/could do differently in terms of code structure and algorithm to avoid these memory issues for bigger files? Or does this look like a reasonable approach for the general problem, and I whoever is using the library simply has to make sure that the runtime environment can deal with the requirements of deeply nested recursive calls?
Any feedback from people working with recursive functions and running into similar issues would be greatly appreciated.
Upvotes: 3
Views: 609
Reputation: 549
the non-recursive version: shorter, faster, and less memory required:
declare function gpxquery:trk-distance($gpx as element(gpx:gpx))
as xsd:double*
{
for $trk in 1 to count($gpx/gpx:trk)
let $trkpts := $gpx/gpx:trk[$trk]/gpx:trkseg/gpx:trkpt
return sum(
for $i in 1 to count($trkpts)-1
return gpxquery:haversine($trkpts[$i]/@lat, $trkpts[$i]/@lon, $trkpts[$i+1]/@lat, $trkpts[$i+1]/@lon)
)
};
declare function gpxquery:haversine($lat1 as xsd:double, $lon1 as xsd:double, $lat2 as xsd:double, $lon2 as xsd:double)
as xsd:double
{
(: This is the Haversine formula as described by http://stackoverflow.com/questions/365826/calculate-distance-between-2-gps-coordinates :)
let $dlat := ($lat2 - $lat1) * math:pi() div 180
let $dlon := ($lon2 - $lon1) * math:pi() div 180
let $rlat1 := $lat1 * math:pi() div 180
let $rlat2 := $lat2 * math:pi() div 180
let $a := math:sin($dlat div 2) * math:sin($dlat div 2) + math:sin($dlon div 2) * math:sin($dlon div 2) * math:cos($rlat1) * math:cos($rlat2)
let $c := 2 * math:atan2(math:sqrt($a), math:sqrt(1-$a))
return xsd:double($c * 6371000.0)
};
Upvotes: 0
Reputation: 163595
Saxon doesn't identify this function as tail recursive, because it applies an operator (the comma operator) to the result of the recursion, and any processing applied to the result disqualifies it as a tail call.
Interestingly if you rewrite it as an XSLT named template then it probably will qualify as tail recursive. That's because XSLT named templates (in Saxon) are natively evaluated in "push" mode - they write their results sequentially to an output sequence - which means that the "," operation is in effect implicit. With a bit of effort one could probably devise a similar strategy for functions.
But I'm trying to understand why this recursion is even necessary.
As far as I can see the algorithm you are implementing is to take a sequence $S and compute something along the lines of
f($S[1], $S[2]) + f($S[2], $S[3]) + f($S[3], $S[4]) ...
that is, you're applying a function to successive pairs of adjacent values and then computing the sum of these function applications.
You could write this as
sum(for-each-pair($S, tail($S), $f))
where $f is the function to be applied.
Or in more conventional XQuery 1.0 style you could write something like
sum(
for $i in 1 to count($S)-1
return f($S[i], $S[i+1])
)
Upvotes: 4