Reddit is one of my favorite sites. It has a fantastic community that generates a wealth of interesting data.
Spidering image posts to Reddit
I wanted to play around with a slice of this data (image posts). Getting all these images was kind of annoying, particularly because Reddit’s API only returns 1000 search results, and Reddit rate limits you to one request every two seconds. So I wrote a cron script to suck down all new image posts every hour.
But I also wanted accurate scores and comment counts. So I had another cron script re-download information on every post 1 week after it went live (to allow time to gather up/down votes and comments).
I wish Reddit would distribute an occasional database snapshot. That would be a great data source to play with. But, alas, they don’t, so I’ve had these scripts running for about 5 months.
If you’d like to play with the image links I spidered, you can use my snapshot (230,248 images, 19 megs compressed, JSON output from SOLR):
The structure of each entry looks like:
"title":"the single-most sad moment of my childhood",
I wanted to build something that surfaced interesting new photos with minimal effort. Reddit is great first thing in the morning when it’s full of fresh content. But what if I want to goof off and it’s only been 15 minutes since I last browsed Reddit?
I can keep paginating further and further away from the front page. But them I’m constantly scanning links to see if I’ve visited them. And there’s no way to tell when new content is on the front page.
What I really wanted was a single button that I could click as much as I wanted, each time getting a new photo. I don’t really care to see the most recent photo — just give me a reasonably good one, and make it easy to keep seeing more.
Every time you want a new image, just click the “NEXT” button. To make that work:
- I threw the content into a SOLR index, which allows filtering by subreddit as well as random sorting.
- To improve quality, the default view filters out images with a score under 100 (which eliminates the bottom 80%).
Synchronized Viewing Experience
Perhaps the most novel feature is “synchronized viewing”. Everyone who visits the image-viewing page gets a distinct channel as a URL. If you share this URL with a friend, you’ll see the same images. Each time one of you clicks “NEXT”, you’ll both see the same new image.
To build this feature, I used PubNub, which is a great service to publishing and subscribing to channels. It’s really easy to set up.
Every time someone visits the page, I give them a random channel (if they don’t already have one). And then every time they view another image, I publish that image’s id on the channel.
The page also subscribes to this channel, so everyone else on the same channel will receive the image id and display it.
I found the image explorer fun when the images have immediate impact (like a beautiful landscape photo rather than a rage comic). I also discovered lots of subreddits I never knew existed.
Some images I like:
A bullet going through some M&Ms
Just a baby hippo taking a bath
Sunrise reflected in a bubble
In addition to PubNub, I’d also like to thank imgur.com for hosting all these images for free.