Mark Cheek
Mark Cheek

Reputation: 265

Get time in HTML tags using curl and grep/sed/awk

I'm trying to extract just the arrival times from this web page. I'm running this in terminal on OSX 10.9.5

http://www.flyokc.com/Arrivals.aspx

I've come as far as isolating just the tags

curl 'www.flyokc.com/arrivals.aspx' | grep 'labelTime'

However, I'm terrible at RegEx so I haven't figured out just to grab the times from these tags. Thoughts on how I can do that?

Eventually, I'd like to group them by the hour of the day and display the number of arrivals by hour, in descending order

Upvotes: 0

Views: 847

Answers (2)

joepd
joepd

Reputation: 4841

Parsing HTML/XML with regex is bad. That being sad, this seems to work at this moment for your use case:

gawk '
BEGIN{
    PROCINFO["sorted_in"]="@ind_num_asc"
    FS="[<>: ]+"
}
/labelTime/&&/ContentPlaceHolderMain/{
    if($6="PM") a[$4+12]+=1
    else a[$4]+=1
}
END{
    for(h in a)
        print h, a[h]
}' <(curl 'www.flyokc.com/arrivals.aspx' 2>/dev/null)

Edit: An account of what works why:

  • Set the field separator to the html delimiters, spacing, and HH:MM seperator.

  • Then grab the sixth field (Hours) (this is only in a sense a regex what you asked for...)

  • If the sixth field is "PM", add 12 hours to it (you want to sort numerically in the end). +1 the count for that hour.

  • After processing of input, display the results. Because the array access order has been defined to sort numerically on the keys, no need to external sort commands are necessary.

Upvotes: 2

l&#39;L&#39;l
l&#39;L&#39;l

Reputation: 47169

If you're simply looking to grab the arrival times such as 12:00 PM, etc. awk with curl should work:

curl -s 'http://flyokc.com/arrivals.aspx' | awk '/labelTime/{print substr($2,68,5),substr($3,1,2)}'

Output:

12:47 PM
...

How it works:

CURL silently grabs the source of the webpage, then AWK takes the output and uses "labelTime" to pick out the line which contains the arrival times. Since awk grabs the entire <span> where the string resides, substring is used to start at position 68, then the result is printed.

Upvotes: 2

Related Questions