Navigating the Legal Maze of Social Media AI Research

In Development

v0.1.0 Released

RedditHarbor

The Day the Music Died (May 2023)

When Pushshift died in 2023, Reddit research got much harder overnight. Pushshift had quietly handled all the infrastructure work — parsing JSON, enabling bulk downloads, making temporal queries actually work. Researchers took it for granted until it was gone.

Now researchers face three options:

Official API with DIY Infrastructure: Reddit offers a free API, and PRAW makes it accessible. But you still need to build everything else. Three weeks later, you might have a working pipeline. Three weeks of engineering that had nothing to do with your actual research.
Academic Torrents: Someone, somewhere, scraped years of Reddit data and uploaded it. It's sitting there, terabytes of exactly what you need. But it disclaims the data ‘may be copyright protected’. The legal equivalent of 'maybe safe, maybe not’.

We built RedditHarbor to be option three: legal, practical, and boring in all the right ways.

Ten Minutes from Zero to Pipeline

Here's what setup looks like. Takes about ten minutes. We timed it. Including the time to make coffee while Supabase spins up.

First, request Reddit API credentials at the Reddit support center (2 minutes of form-filling). They approve academic requests quickly.

Then spin up a free Supabase project at supabase.com.

Finally, install RedditHarbor and connect everything (8 minutes including database setup).


pip install RedditHarbor

import redditharbor.login as login
from redditharbor.dock.pipeline import collect

# One-time authentication setup
reddit_client = login.reddit(
    public_key="<your-reddit-public>",
    secret_key="<your-reddit-secret>",
    user_agent="YourUni:ProjectName (u/username)"
)
supabase_client = login.supabase(
    url="<your-supabase-url>",
    private_key="<your-service-role-key>"  # Not the anon key!
)

# Name your tables (we keep it simple)
db_config = {
    "user": "redditors",
    "submission": "submissions",
    "comment": "comments"
}

# Initialize collector
collect = collect(reddit_client, supabase_client, db_config)

# Start collecting
collect.subreddit_submission(["python", "MachineLearning"], ["hot", "top"], limit=100)

That's it. Your data flows into three clean tables. And you can later download it with redditharbor.utils.download however you want — CSV for the Excel holdouts, JSON for the web folks, even the actual images if you're doing multimodal work. We don't judge.

The Boring Infrastructure Under the Hood

We spent few months making RedditHarbor boring. This is a compliment.

Three Collection Strategies, All Legal: Subreddit-based for community studies — grab hot, top, new, controversial posts. Keyword search with actual boolean operators that work ("renewable energy NOT (fossil OR coal)" does what you'd expect). Database-driven expansion — found interesting posts? Fetch their comments. Discovered relevant users? Track their posting history. It's breadth-first search for social media.

Export Without Drama: download.submission().to_csv() for spreadsheet people. .to_json() for API folks. .to_img() when you realise half of Reddit is actually memes and screenshots. Pick your columns, set your path, get your data. No proprietary formats. No vendor lock-in. Just your data in whatever shape you need it.

Built-in PII Protection: Microsoft's Presidio under the hood. One flag — mask_pii=True — and suddenly John from Seattle becomes <PERSON> from <LOCATION>. Covers 12+ entity types. Your IRB will sleep better. Though sometimes it's overeager: "The Fed raised rates by 0.25%" becomes "The Fed raised rates by <NUMBER>%". Privacy has trade-offs.

Temporal Tracking: Everyone else gives you snapshots — a post frozen at some random Tuesday at 3:47 PM. RedditHarbor tracks evolution. Set update.schedule_task("submission", "24hr") and watch it unfold. Upvote ratios collapsing as brigade arrives. Comment counts exploding during controversies. Score trajectories that reveal community dynamics. The scheduler auto-adjusts based on your data volume — updates every 10 minutes for small datasets, daily for massive ones. All within Reddit's 100 QPM limit.

This is infrastructure. Boring, reliable infrastructure that every Reddit researcher needs but nobody should have to build. We built it once, properly, so you can focus on finding insights instead of finding bugs.

What Comes Next: PETLP-Powered Compliance (v2.0)

Version 1 solved the technical problem. Version 2 solves the legal nightmare.

We learned that collecting Reddit data legally requires navigating a maze of overlapping and contradictory requirement. GDPR says one thing. Reddit's ToS says another. Copyright law wants something else entirely.

So we're building PELTP directly into RedditHarbor. Answer simple questions about your research. Get bulletproof documentation. Keep redditors, IRBs, and lawyers happy. Compliance without the law degree.

But for now, version 1 still works fine. It's ready. It's free. It's boring in all the right ways.

RedditHarbor v0.1 is available now at GitHub . Version 2.0 is in development, built on lessons from our PETLP framework.

Built with respect for boring infrastructure at socius: Experimental Intelligence Lab