stats_noob
stats_noob

Reputation: 5907

Scraping Comments from a Reddit Post?

I found this reddit post here - https://www.reddit.com/r/obama/comments/xgsxy7/donald_trump_and_barack_obama_are_among_the/ .

I would like to use the API in such a way, such that I can get all the comments from this post.

I tried looking into the documentation of this API (e.g. https://github.com/pushshift/api) and this does not seem possible? If somehow I cold get the LINK_ID pertaining to this reddit post, I think I would be able to do it then.

Is this possible to do?

UPDATE: Can someone please show me how to do this in R?

Thanks!

library(jsonlite)

part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="    
part2 = "h&before="
part3 = "h&size=500"

results = list()
for (i in 1:10)
{tryCatch({
    {
        url_i<-  paste0(part1, i+1,  part2, i,  part3)
        r_i <-  fromJSON(url_i)
      
        results[[i]] <- data.frame(r_i$data$body , r_i$data$id, r_i$data$parent_id, r_i$data$link_id)
        
        #myvec_i <- sapply(results, NROW)
        
        #print(c(i, sum(myvec_i))) 
        print(i)
        #ifelse(i %% 200 == 0, saveRDS(results, "results_index.RDS"), "" )
    }
}, error = function(e){})
}

final = do.call(rbind.data.frame, results)

Upvotes: 1

Views: 1414

Answers (2)

Nathan Bomshteyn
Nathan Bomshteyn

Reputation: 100

This is how you can do it in R

# Import required library
library(jsonlite)

# Set API endpoint and parameters
part1 <- "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 <- "h&before="
part3 <- "h&size=500"

# Initialize empty list for storing results
results <- list()

# Loop through API requests
for (i in 1:10) {
  # Construct API request URL
  url_i <- paste0(part1, i+1, part2, i, part3)
  
  # Send GET request to the API
  r_i <- fromJSON(url_i)
  
  # Extract data from API response and store in list
  results[[i]] <- data.frame(body = r_i$data$body, 
                             id = r_i$data$id, 
                             parent_id = r_i$data$parent_id, 
                             link_id = r_i$data$link_id)
  
  # Print progress
  cat("Request", i, "complete\n")
}

# Combine list of results into a single data frame
final <- do.call(rbind.data.frame, results)

Refactor You can also slightly refactor the code to

library(purrr)
library(httr)

# Set API endpoint and parameters
endpoint <- "https://api.pushshift.io/reddit/search/comment/"
params <- list(q = "trump", size = 500)

# Function to fetch data from API
fetch_data <- function(after, before) {
  query <- list(after = after, before = before)
  response <- GET(url = endpoint, query = c(params, query))
  content(response)$data[, c("body", "id", "parent_id", "link_id")]
}

# Use map() to fetch data for multiple requests
results <- map(1:10, ~ fetch_data(.x+1, .x))

# Combine list of results into a single data frame
final <- do.call(rbind.data.frame, results)

In this code, we've used httr::content() to extract the relevant data from the API response instead of using jsonlite::fromJSON(). We've also used purrr::map() to fetch data for multiple requests instead of using a for loop. These changes should make the code more concise and easier to read.

Upvotes: 1

Joacopaz
Joacopaz

Reputation: 126

The Link Id of the post is in the URL https://www.reddit.com/r/obama/comments/xgsxy7 <-- id

You could even search https://www.reddit.com/xgsxy7 to get the information.

If you fetch at the endpoint https://www.reddit.com/xgsxy7.json you would get the JSON information, you should then access the object to find them.

JS example:

const data = fetchedJSONObject;

const comments = data[1].data.children.map(comment => comment.data.body); // to get the text body

And you can just analyze the JSON object and get all the data you want from it: if the comment has some nested replies to it, time created, author, etc.

Upvotes: 4

Related Questions