user53526356
user53526356

Reputation: 968

Using Python to get Substack posts without scraping

I want to create a dataframe of Substack posts from all the newsletter I subscribe to. But using feedparser + Substack's RSS feeds only seem to go back ~20 posts—even if a particular newsletter has hundreds of old posts.

Is there a way to use RSS to get all the old posts too? Or another method to get the same data I can using the RSS feed that doesn't involve scraping/beautifulSoup?

import feedparser
import pandas as pd

rawrss = ['https://heathercoxrichardson.substack.com/feed', 'https://marcstein.substack.com/feed']

posts = []
for url in rawrss:
    feed = feedparser.parse(url)
    for post in feed.entries:
        posts.append((post.title, post.link, post.summary, post.summary_detail, post.content, post.published))
df = pd.DataFrame(posts, columns=['title', 'link', 'summary', 'summary_detail', 'content', 'published'])
print(df)

Upvotes: 1

Views: 2276

Answers (1)

shime
shime

Reputation: 9018

There's an unofficial Substack API available for that. Here's a curl request that fetches the second page of the most recent posts:

curl https://ava.substack.com/api/v1/posts\?limit\=50\&offset\=50

Note that this is unofficial API so this can change at any time.

Upvotes: 4

Related Questions