Reputation: 968
I want to create a dataframe of Substack posts from all the newsletter I subscribe to. But using feedparser
+ Substack's RSS feeds only seem to go back ~20 posts—even if a particular newsletter has hundreds of old posts.
Is there a way to use RSS to get all the old posts too? Or another method to get the same data I can using the RSS feed that doesn't involve scraping/beautifulSoup
?
import feedparser
import pandas as pd
rawrss = ['https://heathercoxrichardson.substack.com/feed', 'https://marcstein.substack.com/feed']
posts = []
for url in rawrss:
feed = feedparser.parse(url)
for post in feed.entries:
posts.append((post.title, post.link, post.summary, post.summary_detail, post.content, post.published))
df = pd.DataFrame(posts, columns=['title', 'link', 'summary', 'summary_detail', 'content', 'published'])
print(df)
Upvotes: 1
Views: 2276
Reputation: 9018
There's an unofficial Substack API available for that. Here's a curl request that fetches the second page of the most recent posts:
curl https://ava.substack.com/api/v1/posts\?limit\=50\&offset\=50
Note that this is unofficial API so this can change at any time.
Upvotes: 4