Bruno Peixoto
Bruno Peixoto

Reputation: 99

How to scrape Wikipedia GPS latitude/longitude?

I have been wondering how is it possible to scrap Wikipedia information. For example, I have a list of world cities and want to obtain their approximate latitude and longitude. Take Miami as an example. When I type curl https://en.wikipedia.org/wiki/Miami | grep -E '(latitude|longitude)', somewhere in the HTML there will be a tag mark like below.

<span class="latitude">25°46′31″N</span> <span class="longitude">80°12′31″W</span>

I know I can extract it with some regex string, but I speak a very poor regexish. Can some of you help me on this?

Upvotes: 0

Views: 188

Answers (2)

Gilles Qu&#233;not
Gilles Qu&#233;not

Reputation: 185530

With and :

$ xidel -se '
    concat(
        (//span[@class="latitude"]/text())[1],
        " ",
        (//span[@class="longitude"]/text())[1]
    )
' 'https://en.wikipedia.org/wiki/Miami'

Output

25°46′31″N 80°12′31″W

Or

saxon-lint --html --xpath '<XPATH EXP>' <URL>

If you want most known tools:

curl -s 'https://en.wikipedia.org/wiki/Miami' > Miami.html
xmlstarlet format -H Miami.html 2>/dev/null | sponge Miami.html
xmlstarlet sel -t -v '<XPATH EXP>' Miami.html

Not mentioned, but regex are not the right tool to parse HTML

Upvotes: 1

Reino
Reino

Reputation: 3443

You can't parse HTML with RegEx. Please use an HTML-parser like instead:

$ xidel -s "https://en.wikipedia.org/wiki/Miami" -e '
  (//span[@class="geo-dms"])[1],
  (//span[@class="geo-dec"])[1],
  (//span[@class="geo"])[1],
  replace((//span[@class="geo"])[1],";",())
'
25°46′31″N 80°12′31″W
25.775163°N 80.208615°W
25.775163; -80.208615
25.775163 -80.208615

Take your pick.

Upvotes: 1

Related Questions