Reputation: 3744
I have a data dump, of which the following is one row of it:
{,lat:26.3832456,distance:678.4075116373302,lon:120.4731951,address:tourism:viewpoint,},{,lat:26.3830149,distance:622.2862561842148,lon:120.473753,address:name:xe7,xbe,x85,xe6,xbc,xa2,xe5,x9d,xaa,tourism:viewpoint,},{,lat:26.3833609,distance:363.7364243757184,lon:120.4763708,address:name:xe5,x9c,x8b,xe4,xb9,x8b,xe5,x8c,x97,xe7,x96,x86,tourism:viewpoint,},{,lat:26.3823648,distance:223.60523114628876,lon:120.4821298,address:name:xe5,x90,x8e,xe6,xbe,xb3,natural:bay,},{,lat:26.3788243,distance:470.02293394005875,lon:120.480733,address:name:xe5,x90,x8e,xe6,xbe,xb3,xe5,xb1,xb1,source:GNS,natural:peak,},{,lat:26.3750042,distance:893.4290785528082,lon:120.4808826,address:name:xe8,x93,xae,xe8,x8a,xb1,xe5,x9c,x92,source:GNS,natural:peak,},{,lat:26.3763331,distance:742.92090763674,lon:120.4795115,address:name:xe8,xa5,xbf,xe5,xbc,x95,xe5,xb3,xb6,place:hamlet,source:GNS,},{,lat:26.378645,distance:623.327734488774,lon:120.4839399,address:source:PGS,natural:coastline,},{,lat:26.3801244,distance:418.6308872217763,lon:120.4772875,address:highway:residential,},{,lat:26.3791422,distance:434.6736862343828,lon:120.4792953,address:highway:residential,},{,lat:26.3779802,distance:739.2129423740619,lon:120.4751349,address:highway:unclassified,},{,lat:26.3770924,distance:675.0424314750977,lon:120.4815607,address:highway:residential,},{,lat:26.3760869,distance:798.0261247167285,lon:120.4821517,address:highway:path,},{,lat:26.3766434,distance:737.1372670528466,lon:120.4821003,address:highway:path,},{,lat:26.3813278,distance:384.84440601318613,lon:120.4766175,address:highway:path,},{,lat:26.3755092,distance:833.3985359252805,lon:120.4802778,address:highway:road,},{,lat:26.3785345,distance:496.6253230490143,lon:120.4799081,address:highway:road,}
The part within each pair of braces (i.e., "{...}") represents information about one identity. I need to compare the distance
field of each pair of braces, and then display the content of the braces with the least distance. For instance, in the example of the above row, I want to output the following:
{,lat:26.3823648,distance:223.60523114628876,lon:120.4821298,address:name:xe5,x90,x8e,xe6,xbe,xb3,natural:bay,}
as this is the one with the least value of the distance
field.
How to do this? I have written the following code to only extract all the distances to compare them, but even that does not work:
require 'rubygems'
require 'mechanize'
require 'csv'
CSV.open('Output.csv', "wb") do |csv|
CSV.foreach('Original.csv', :headers=>true) do |row|
vector = row.split(",")
dist = vector.match("^.*\/distance:\/(.*)\/")
csv << dist
end
end
My idea was to extract all the distances, compare them, find the smallest, go back to the original string to locate the braces with that particular distance, and then output the content in those braces. But this seems like a kind of convoluted way of doing this. Is there a more elegant way to output the brace with the smallest distance? Thanks.
Upvotes: 0
Views: 48
Reputation: 110665
Let str
be a variable holding the given string.
The first step is to split the string on commas that are preceded by a right brace and followed by a left brace:
r0 = /
(?<=}) # match a right brace in a positive lookbehind
, # match a comma
(?={) # match a right brace in a positive lookahead
/x # free-spacing regex definition mode
arr = str.split(r0)
#=> ["{,lat:26.3832456,distance:678.4075116373302,lon:120.4731951,...}",
# "{,lat:26.3830149,distance:622.2862561842148,lon:120.473753,...}",
# ...
# "{,lat:26.3750042,distance:893.4290785528082,lon:120.4808826,...}",
# ...
# "{,lat:26.3785345,distance:496.6253230490143,lon:120.4799081,}"]
str.split(r0).size
#=> 17
We then apply max_by
to that array, where max_by
's block returns the distance for each string, expressed as a float.
r1 = /
(?<=,distance:) # match ",distance:" in a positive lookbehind
\d+ # match one or more digits
\. # match a decimal point
\d+ # match one or more digits
/x # free-spacing regex definition mode
arr.max_by { |s| s[r1].to_f }
#=> "{,lat:26.3750042,distance:893.4290785528082,lon:120.4808826,...}"
I've assumed that every string in the array contains a distance field. If some strings may not, the above expression would be converted to:
arr.max_by { |s| (s[r1] || -Float::INFINITY).to_f }
One would also need to check if the string returned contained a distance field.
We can put this together in a single expression.
str.split(/(?<=}),(?={)/).
max_by { |s| (s[/(?<=,distance:)\d+\.\d+/] || -Float::INFINITY).to_f }
Upvotes: 1
Reputation: 9226
Not very elegant, but it seems to work:
s.scan(/\{[^{}]*\}/).min_by { |r| r =~ /distance:(.*),/; $1.to_f }
where s
would be your initial data dump as a string.
scan
splits the initial data into an array of records (anything between pairs of braces which is not a brace is considered part of a record). min_by loops through that array looking for the record which has a minimum value given by the block passed as a parameter - in this case the block is just a regex match looking for the distance value in the record.
Upvotes: 2