Reputation: 137
I am scraping the Newark Liberty International Airport's website to keep track of their daily schedules. Here is the piece of code I have developed:
library(rvest)
url <- read_html('https://www.airport-ewr.com/newark-departures-terminal-C?
tp=6&day=tomorrow')
population <- url %>% html_nodes(xpath = '//*[@id="flight_detail"]') %>%
html_text() %>% gsub(pattern = '\\t|\\r|\\n', replacement = ' ') %>%
trimws() %>% gsub(pattern = '\\s+', replacement = " ")
gsub() is for removing the leading and trailing whitespaces and extra spaces within the text. The code works well and I have attached the snippet of the output:
I want to convert this character string into a dataframe which would contain values as shown below:
Any help is appreciated !!
Upvotes: 1
Views: 274
Reputation: 7312
Try this out:
library(rvest)
url <- read_html('https://www.airport-ewr.com/newark-departures-terminal-C?tp=6&day=tomorrow')
population <- url %>% html_nodes(xpath = '//*[@id="flight_detail"]') %>%
html_text()
First we read in the raw text rows. Then I noticed that each column is separated by \n
but sometimes there's more than one, so first we gsub
out the extra \n
delimiters, then string split by \n
, and rbind
the output into a data.frame
popDF <- as.data.frame(
do.call('rbind',strsplit(gsub("(\\n)+", "\\\n",population),split="\n", fixed=TRUE))
)
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 Austin (AUS) United Airlines UA 2427 06:00 am Depart: 06:00 am C Term. C Scheduled - On-time [+]
2 Austin (AUS) SAS SK 6868 06:00 am Depart: 06:00 am C Term. C Scheduled - On-time [+]
3 Boston (BOS) United Airlines UA 1699 06:00 am Depart: 06:00 am C Term. C Scheduled - On-time [+]
4 Columbus (CMH) CommutAir C5 4973 06:00 am Depart: 06:00 am C Term. C Scheduled - On-time [+]
5 Columbus (CMH) United Airlines UA 4973 06:00 am Depart: 06:00 am C Term. C Scheduled - On-time [+]
6 Detroit (DTW) Republic Airlines YX 3482 06:00 am Depart: 06:00 am C Term. C Scheduled - On-time [+]
Upvotes: 3