Web scraping from Uniprot using R

Question

I want to scrape from a Uniprot webpage like this http://www.uniprot.org/uniprot/Q4DQV8 the strings that starts with "Tc00" (in this case "Tc00.1047053511911.60") using R. I've tried the following but the function read_html() doesn't retrieve me any data I can like that.

library(tidyverse)
library(rvest)

page <- read_html(url)
    
page_text <- page %>% html_text()

extracted_string <- str_extract(page_text, "Tc00\S*")

print(extracted_string)

Also page %>% html_nodes("body") %>% # Ajustar o seletor conforme necessário html_text() gives me

"[1] "UniProt website fallback messageIf you are not seeing anything on this page, it might be for multiple reasons:You might have JavaScript disabled: make sure to enable JavaScript on your browser, or use a browser that supports JavaScript.You might have an outdated browser: make sure that your browser is up to date as older versions might not work with the website.There might have been a network issue: ensure that your connectivity is stable and try to reload the page to see if it solves the issue. Reload this page// workaround for Safari 10.1 supporting module but ignoring nomodule // From https://gist.github.com/samthor/64b114e4a4f539915a95b91ffd340acc (function () { var d = document; var c = d.createElement('script'); if (!('noModule' in c) && 'onbeforeload' in c) { var s = false; d.addEventListener( 'beforeload', function (e) { if (e.target === c) { s = true; } else if (!e.target.hasAttribute('nomodule') || !s) { return; } e.preventDefault(); }, true ); c.type = 'module'; c.src = '.'; d.head.appendChild(c); c.remove(); } })();""

JavaScript is enabled on my browser.

Can anyone help me, please?

Till · Accepted Answer

The website uses javascript to display its contents. You can use rvest::read_html_live() instead of rvest::read_html() to work with websites like this. rvest::read_html_live() opens the webpage in a headless chrome browser and renders the website by executing its javascript. You can then query that rendered content with rvest.

Check out the documentation and example for rvest::read_html_live() to learn more.

library(rvest)

page <- read_html_live("https://www.uniprot.org/uniprotkb/Q4DQV8/entry")

page_text <- page |>
  html_node("body") |>
  html_text()

extracted_string <- stringr::str_extract(page_text, "Tc00\S*")

print(extracted_string)
#> [1] "Tc00.1047053511911.60"

Web scraping from Uniprot using R

Answers (1)

Related Questions