Get time in HTML tags using curl and grep/sed/awk

Question

I'm trying to extract just the arrival times from this web page. I'm running this in terminal on OSX 10.9.5

http://www.flyokc.com/Arrivals.aspx

I've come as far as isolating just the tags

curl 'www.flyokc.com/arrivals.aspx' | grep 'labelTime'

However, I'm terrible at RegEx so I haven't figured out just to grab the times from these tags. Thoughts on how I can do that?

Eventually, I'd like to group them by the hour of the day and display the number of arrivals by hour, in descending order

joepd · Accepted Answer

Parsing HTML/XML with regex is bad. That being sad, this seems to work at this moment for your use case:

gawk '
BEGIN{
    PROCINFO["sorted_in"]="@ind_num_asc"
    FS="[<>: ]+"
}
/labelTime/&&/ContentPlaceHolderMain/{
    if($6="PM") a[$4+12]+=1
    else a[$4]+=1
}
END{
    for(h in a)
        print h, a[h]
}' <(curl 'www.flyokc.com/arrivals.aspx' 2>/dev/null)

Edit: An account of what works why:

Set the field separator to the html delimiters, spacing, and HH:MM seperator.
Then grab the sixth field (Hours) (this is only in a sense a regex what you asked for...)
If the sixth field is "PM", add 12 hours to it (you want to sort numerically in the end). +1 the count for that hour.
After processing of input, display the results. Because the array access order has been defined to sort numerically on the keys, no need to external sort commands are necessary.

Get time in HTML tags using curl and grep/sed/awk

Answers (2)

Related Questions