Reputation: 265
I'm trying to extract just the arrival times from this web page. I'm running this in terminal on OSX 10.9.5
http://www.flyokc.com/Arrivals.aspx
I've come as far as isolating just the tags
curl 'www.flyokc.com/arrivals.aspx' | grep 'labelTime'
However, I'm terrible at RegEx so I haven't figured out just to grab the times from these tags. Thoughts on how I can do that?
Eventually, I'd like to group them by the hour of the day and display the number of arrivals by hour, in descending order
Upvotes: 0
Views: 847
Reputation: 4841
Parsing HTML/XML with regex is bad. That being sad, this seems to work at this moment for your use case:
gawk '
BEGIN{
PROCINFO["sorted_in"]="@ind_num_asc"
FS="[<>: ]+"
}
/labelTime/&&/ContentPlaceHolderMain/{
if($6="PM") a[$4+12]+=1
else a[$4]+=1
}
END{
for(h in a)
print h, a[h]
}' <(curl 'www.flyokc.com/arrivals.aspx' 2>/dev/null)
Edit: An account of what works why:
Set the field separator to the html delimiters, spacing, and HH:MM seperator.
Then grab the sixth field (Hours) (this is only in a sense a regex what you asked for...)
If the sixth field is "PM", add 12 hours to it (you want to sort numerically in the end). +1 the count for that hour.
After processing of input, display the results. Because the array access order has been defined to sort numerically on the keys, no need to external sort commands are necessary.
Upvotes: 2
Reputation: 47169
If you're simply looking to grab the arrival times such as 12:00 PM, etc. awk
with curl
should work:
curl -s 'http://flyokc.com/arrivals.aspx' | awk '/labelTime/{print substr($2,68,5),substr($3,1,2)}'
Output:
12:47 PM
...
How it works:
CURL
silently grabs the source of the webpage, then AWK
takes the output and uses "labelTime" to pick out the line which contains the arrival times. Since awk grabs the entire <span>
where the string resides, substring is used to start at position 68, then the result is printed.
Upvotes: 2