Reputation: 661
I am trying to read xml data from the following link:
https://www.sec.gov/Archives/edgar/data/1026081/000092189520001626/infotable.xml
I am using the rvest package and doing this
library(rvest)
url <- "https://www.sec.gov/Archives/edgar/data/1026081/000092189520001626/infotable.xml"
test <- url %>%
read_xml() %>%
xml_nodes("nameOfIssuer") %>%
xml_text()
But this is not working. "test" is empty. I have also tried xpath. I have also tried other variations such as
test <- url %>%
read_xml() %>%
xml_nodes("infoTable") %>%
xml_text()
I feel like I am missing something super basic. How would I go about scraping specific node information from here.
Thanks in advance!
Upvotes: 0
Views: 493
Reputation: 173803
Yes, you're missing the fact that the nodes you are trying to scrape are inside a specific xml namespace. Strip out the namespace and you're good to go.
url %>% read_xml() %>% xml_ns_strip() %>% xml_nodes("nameOfIssuer") %>% xml_text()
#> [1] "BANCORP 34 INC" "BANC OF CALIFORNIA INC"
#> [3] "BANKWELL FINL GROUP INC" "CBM BANCORP INC"
#> [5] "CARTER BK & TR MARTINSVILLE" "CITIZENS FINL GROUP"
#> [7] "CIVISTA BANCSHARES INC" "COLUMBIA FINL INC"
#> [9] "CONNECTONE BANCORP INC NEW" "FSB BANCORP INC"
#> [11] "FIRST UTD CORP" "HV BANCORP INC"
#> [13] "HARBORONE BANCORP INC NEW" "INVESTORS BANCORP INC NEW"
#> [15] "MSB FINL CORP NEW" "MALVERN BANCORP INC"
#> [17] "MID SOUTHERN BANCORP INC" "NORTHEAST BK LEWISTON ME"
#> [19] "PB BANCORP INC" "PEAPACK-GLADSTONE FINL CORP"
#> [21] "PIONEER BANCORP INC" "PROVIDENT BANCORP INC"
#> [23] "PRUDENTIAL BANCORP INC NEW" "RICHMOND MUT BANCORPORATIN I"
#> [25] "SELECT BANCORP INC NEW" "STERLING BANCORP DEL"
#> [27] "WATERSTONE FINL INC MD" "WINTRUST FINL CORP"
Upvotes: 1