In Development
v0.1.0 Released

When Pushshift died in 2023, Reddit research got much harder overnight. Pushshift had quietly handled all the infrastructure work — parsing JSON, enabling bulk downloads, making temporal queries actually work. Researchers took it for granted until it was gone.
Now researchers face three options:
Here's what setup looks like. Takes about ten minutes. We timed it. Including the time to make coffee while Supabase spins up.
First, request Reddit API credentials at the Reddit support center (2 minutes of form-filling). They approve academic requests quickly.
Then spin up a free Supabase project at supabase.com.
Finally, install RedditHarbor and connect everything (8 minutes including database setup).
pip install RedditHarbor
import redditharbor.login as login
from redditharbor.dock.pipeline import collect
# One-time authentication setup
reddit_client = login.reddit(
public_key="<your-reddit-public>",
secret_key="<your-reddit-secret>",
user_agent="YourUni:ProjectName (u/username)"
)
supabase_client = login.supabase(
url="<your-supabase-url>",
private_key="<your-service-role-key>" # Not the anon key!
)
# Name your tables (we keep it simple)
db_config = {
"user": "redditors",
"submission": "submissions",
"comment": "comments"
}
# Initialize collector
collect = collect(reddit_client, supabase_client, db_config)
# Start collecting
collect.subreddit_submission(["python", "MachineLearning"], ["hot", "top"], limit=100)
That's it. Your data flows into three clean tables. And you can later download it with redditharbor.utils.download however you want — CSV for the Excel holdouts, JSON for the web folks, even the actual images if you're doing multimodal work. We don't judge.
We spent few months making RedditHarbor boring. This is a compliment.
Three Collection Strategies, All Legal: Subreddit-based for community studies — grab hot, top, new, controversial posts. Keyword search with actual boolean operators that work ("renewable energy NOT (fossil OR coal)" does what you'd expect). Database-driven expansion — found interesting posts? Fetch their comments. Discovered relevant users? Track their posting history. It's breadth-first search for social media.
Export Without Drama: download.submission().to_csv() for spreadsheet people. .to_json() for API folks. .to_img() when you realise half of Reddit is actually memes and screenshots. Pick your columns, set your path, get your data. No proprietary formats. No vendor lock-in. Just your data in whatever shape you need it.
Built-in PII Protection: Microsoft's Presidio under the hood. One flag — mask_pii=True — and suddenly John from Seattle becomes <PERSON> from <LOCATION>. Covers 12+ entity types. Your IRB will sleep better. Though sometimes it's overeager: "The Fed raised rates by 0.25%" becomes "The Fed raised rates by <NUMBER>%". Privacy has trade-offs.
Temporal Tracking: Everyone else gives you snapshots — a post frozen at some random Tuesday at 3:47 PM. RedditHarbor tracks evolution. Set update.schedule_task("submission", "24hr") and watch it unfold. Upvote ratios collapsing as brigade arrives. Comment counts exploding during controversies. Score trajectories that reveal community dynamics. The scheduler auto-adjusts based on your data volume — updates every 10 minutes for small datasets, daily for massive ones. All within Reddit's 100 QPM limit.
This is infrastructure. Boring, reliable infrastructure that every Reddit researcher needs but nobody should have to build. We built it once, properly, so you can focus on finding insights instead of finding bugs.
Version 1 solved the technical problem. Version 2 solves the legal nightmare.
We learned that collecting Reddit data legally requires navigating a maze of overlapping and contradictory requirement. GDPR says one thing. Reddit's ToS says another. Copyright law wants something else entirely.
So we're building PELTP directly into RedditHarbor. Answer simple questions about your research. Get bulletproof documentation. Keep redditors, IRBs, and lawyers happy. Compliance without the law degree.
But for now, version 1 still works fine. It's ready. It's free. It's boring in all the right ways.
RedditHarbor v0.1 is available now at github.com/socius-org/RedditHarbor . Version 2.0 is in development, built on lessons from our PETLP framework.
Built with respect for boring infrastructure at socius: Experimental Intelligence Lab
Reddit scraping with compliance built-in, not bolted on.
View all
Scraper
Subreddit
Collect data from specific subreddits, whether you’re interested in submissions, comments or user information

Scraper
Keyword
Collect submissions based on specific keywords from your desired subreddits

Scraper
Database
Leverage your existing database to collect additional relevant data, such as comments from specific submissions

