Jose David
Jose David

Reputation: 139

Problems scraping content from a news website

I am trying to collect the headlines/titles and other elements from a news website. However, the tags I am using (that I have found using the gadget selector and inspecting the website code) seem not to be working.

For the headlines I've tried the tags '.article-h' and '.article-h-link' without no result. The same happen for the dates ('.date.right') and the leads ('.result-intro')

url_test <- read_html('https://www.semana.com/Buscador?query=proceso%20paz%20farc&post=semana&limit=10&offset=0&from=2012%2F08%2F26&to=2016%2F12%2F03')
titles <- html_text(html_nodes(url_test, '.article-h-link'))

I always get "character (0)". Interestingly, though, if a try to collect the information within the home page (www.semana.com), those same tags work without problem. What can be the problem?

Upvotes: 0

Views: 72

Answers (1)

QHarr
QHarr

Reputation: 84465

Content is dynamically loaded via javascript running in browser. This won't happen with rvest. You may need to browser automation such as RSelenium or you can do as below.

The page does a POST request you can mimic with httr.

require(httr)
require(jsonlite)
require(magrittr)

headers = c(
  'User-Agent' = 'Mozilla/5.0',
  'Content-Type' = 'application/json; charset=UTF-8'
)

data = '{"request":{"param0":"query=proceso%20paz%20farc","param1":"post=semana","param7":"limit=10","param8":"offset=0", "param9":"from=2012/08/26", "param10":"to=2016/12/03"},"preview":false}'

res <- httr::POST(url = 'https://www.semana.com/ws/Buscador/ESPSearch', httr::add_headers(.headers=headers), body = data)

data <- content(res,as="text") %>% jsonlite::fromJSON(.)

Some of the json content has html as associated values. These will need re-parsing with html parser. You can explore the articles with

df <- data$documents
print(df)

Perhaps easier is to do a regex replace to remove anything between <span and > so you are just left with the text content within $highlights

Basic regex before converting for use with R would be:

<\/?span[^>]*>

e.g.

df$highlights <- lapply(df$highlights, function(x) {gsub("<\\/?span[^>]*>", "", x)})

Upvotes: 1

Related Questions