Honorato
Honorato

Reputation: 151

Web scraping from Uniprot using R

I want to scrape from a Uniprot webpage like this http://www.uniprot.org/uniprot/Q4DQV8 the strings that starts with "Tc00" (in this case "Tc00.1047053511911.60") using R. I've tried the following but the function read_html() doesn't retrieve me any data I can like that.

library(tidyverse)
library(rvest)

page <- read_html(url)
    
page_text <- page %>% html_text()

extracted_string <- str_extract(page_text, "Tc00\\S*")

print(extracted_string)

Also page %>% html_nodes("body") %>% # Ajustar o seletor conforme necessário html_text() gives me

"[1] "UniProt website fallback messageIf you are not seeing anything on this page, it might be for multiple reasons:You might have JavaScript disabled: make sure to enable JavaScript on your browser, or use a browser that supports JavaScript.You might have an outdated browser: make sure that your browser is up to date as older versions might not work with the website.There might have been a network issue: ensure that your connectivity is stable and try to reload the page to see if it solves the issue. Reload this page// workaround for Safari 10.1 supporting module but ignoring nomodule\n // From https://gist.github.com/samthor/64b114e4a4f539915a95b91ffd340acc\n (function () {\n var d = document;\n var c = d.createElement('script');\n if (!('noModule' in c) && 'onbeforeload' in c) {\n var s = false;\n d.addEventListener(\n 'beforeload',\n function (e) {\n if (e.target === c) {\n s = true;\n } else if (!e.target.hasAttribute('nomodule') || !s) {\n return;\n }\n e.preventDefault();\n },\n true\n );\n\n c.type = 'module';\n c.src = '.';\n d.head.appendChild(c);\n c.remove();\n }\n })();""

JavaScript is enabled on my browser.

Can anyone help me, please?

Upvotes: 0

Views: 74

Answers (1)

Till
Till

Reputation: 6663

The website uses javascript to display its contents. You can use rvest::read_html_live() instead of rvest::read_html() to work with websites like this. rvest::read_html_live() opens the webpage in a headless chrome browser and renders the website by executing its javascript. You can then query that rendered content with rvest.

Check out the documentation and example for rvest::read_html_live() to learn more.

library(rvest)

page <- read_html_live("https://www.uniprot.org/uniprotkb/Q4DQV8/entry")

page_text <- page |>
  html_node("body") |>
  html_text()

extracted_string <- stringr::str_extract(page_text, "Tc00\\S*")

print(extracted_string)
#> [1] "Tc00.1047053511911.60"

Upvotes: 2

Related Questions