Reddit r/india sub-reddit Data Analysis Part 1

This is a simple data exploration of /r/india subreddit to identify access patterns, just for fun. As this is a jupyter notebook, you will see code interspersed with images and comments. This is intentional. It gives you an idea on how this data is processed and extracted so you can be sure of the data. It is also instructional, if you are interested in data exploration like this. If you are not a coder, simply skip those code sections. :)

If you are not interested in knowing how this data analysis is done, feel free to skip this section and head straight to pretty graphs.

Data Extraction Process

The data set contains data from 09/2016 to 09/2017 so an years worth of data downloaded from reddit public dataset on Google BigQuery at https://bigquery.cloud.google.com. I ran the following query to download submissions:

SELECT 
    id, created_utc, author, num_comments, 
    score, ups, downs, title, selftext, link_flair_text 
FROM 
    `fh-bigquery.reddit_posts.201*` 
WHERE 
    subreddit="india" and _TABLE_SUFFIX BETWEEN '6_09' and '7_09' and 
    num_comments > 1 and selftext != '[deleted]' and 
    selftext !='[removed]'and score > 2

And the following query to download comments:

SELECT 
    id, created_utc, author, body, score, ups, downs, parent_id 
FROM 
    `fh-bigquery.reddit_comments.201*`
WHERE
    subreddit="india" and _TABLE_SUFFIX BETWEEN '6_09' and '7_09' and 
    score > 0

As you can see I did some filtering on data to reduce the data size so I maybe downstating some numbers in this analysis. I had to export this data to a custom table and then export to a Google Cloud Storage bucket. It is relatively straightforward to download data from Google Cloud Storage once you export data over there. This approach is very fast compared to using praw for querying reddit, as you get a bulk of data without making thousands of queries. Unfortunately, this public dataset hosted on BigQuery didn't have user data like when a user joined, their karma and score. This required me to make bulk queries using praw anyway. I will have that code added at the bottom of this post, if you are interested.

When exporting data from BigQuery, I used "avro" export format as it holds schema information, which makes it easy to parse the data in python. Alternatively, one can use csv or json format. If you choose to use either of those, please keep in mind that submission title and comments may contain a comma or other punctuation marks that might affect parsing.

Local Database Schema

Once I downloaded the data, I loaded it into a local sqlite database. Overall the data size isn't large enough to necessitate heavyweight databases and any performance reasons aren't an issue because I have a powerful computer to load and process. In sqlite database, I created three tables called submissions, comments and users. No awards for guessing what they hold. Here is the exact SQL queries that I used to create those tables

CREATE TABLE users 
    (user_id text PRIMARY KEY, user_since timestamp, 
    link_karma real, comment_karma real)

CREATE TABLE submissions 
    (submission_id text PRIMARY KEY, title text, content text, 
    link_flair_text text, posted_at timestamp, posted_by text, 
    upvotes int, downvotes int, score int, num_comments int)

CREATE TABLE comments 
    (comment_id text PRIMARY KEY, submission_id text, parent_comment_id text, 
    content text, posted_at timestamp, posted_by text, upvotes int, 
    downvotes int, score int)

Unfortunately BigQuery public dataset for comments doesn't contain any submission identifier for comment other than parent_id. So this requires some preprocessing to identify the submission id.

I gathered details of all the users who posted atleast one submission or comment during the timeperiod by making queries to reddit api and stored that details in users table.

In the below sections, you will see a lot of data is extracted by making SQL queries to above tables. Majority of the data can be extracted by simple queries. I will have a separate post that does more post-processing on this data but for now, this should be a good start.

%matplotlib inline

import matplotlib.pyplot as plt
import sqlite3
import tabulate

from datetime import datetime
from IPython.display import HTML, display
%config InlineBackend.figure_format = 'retina'

conn=sqlite3.connect('reddit.db')
cursor=conn.cursor()

def plot_data(xdata, ydata, xlabel, ylabel, title, 
              kind='bar', color='blue', width=15, imwidth=12, imheight=4):
    fig = plt.figure(figsize=(imwidth, imheight), dpi=80)
    fig.suptitle(title, fontsize=20)
    plt.xlabel(xlabel, fontsize=18)
    plt.ylabel(ylabel, fontsize=16)
    plt.bar(xdata, ydata, width=width, color=color)

Distribution of User Activity By Month

This subreddit started seeing a lot more activity these days so I wanted to see how dramatic is the increase over last year. It turns out that we had a massive spike around November 2016 that persisted for the most part. I thought the spike may have been because of demonetization but I don't know if it is because of new users or existing users participating more so I tried to look at new users joined during the period of Sept-Dec 2016. I included before and after time to see if a lot more people joined during the time. It turns out that wasn't the case. We didn't have a spike of new users. It's just existing users participating more.

# Monthly stats
submissions_monthly = cursor.execute(
    "select date(posted_at, 'start of month') as month, count(*) from submissions group by month").fetchall()
comments_monthly = cursor.execute(
    "select date(posted_at, 'start of month') as month, count(*) from comments group by month").fetchall()
users_monthly = cursor.execute(
    "select date(user_since, 'start of month') as month, count(*) from users group by month").fetchall()
new_users_sept_dec = cursor.execute(
    """select date(user_since, 'weekday 1') as day, count(*) as count 
     from users where day>'2016-09-01' and day<='2016-12-31' group by day""").fetchall()

plot_data([datetime.strptime(x, '%Y-%m-%d') for x,y in submissions_monthly], 
          [y for x,y in submissions_monthly],
          'Month', 'Submissions', 'Submissions Per Month', color='blue')

plot_data([datetime.strptime(x, '%Y-%m-%d') for x,y in comments_monthly], 
          [y for x,y in comments_monthly],
          'Month', 'Comments', 'Comments Per Month', color='red')

plot_data([datetime.strptime(x, '%Y-%m-%d') for x,y in users_monthly], 
          [y for x,y in users_monthly],
          'Month', 'Users', 'Users Joined Per Month', color='green')

plot_data([datetime.strptime(x, '%Y-%m-%d') for x,y in new_users_sept_dec], 
          [y for x,y in new_users_sept_dec],
          'Day', 'Users', 'Users Joined Per Day Before And After Demonetization', 
          color='green', width=3, imwidth=12)

Distribution of User Activity across different days of the Week

I wanted to know if we are more active during any day of the week. There is minor variation but nothing notable. It truly seems like a daily pasttime of everyone in the subreddit.

# Weekly stats
submissions_weekday = cursor.execute(
    "select weekday, avg(count) from (select strftime('%w', posted_at) as weekday, strftime('%W', posted_at) as week, count(*) as count from submissions group by week, weekday) group by weekday").fetchall()
comments_weekday = cursor.execute(
    "select weekday, avg(count) from (select strftime('%w', posted_at) as weekday, strftime('%W', posted_at) as week, count(*) as count from comments group by week, weekday) group by weekday").fetchall()
users_weekday = cursor.execute(
    "select weekday, avg(count) from (select strftime('%w', user_since) as weekday, strftime('%W', user_since) as week, count(*) as count from users group by week, weekday) group by weekday").fetchall()

plot_data([int(x) for x,y in submissions_weekday], [int(y) for x,y in submissions_weekday], 
          'Weekday', 'Average Submissions', 'Average Submissions on a Weekday', 
          color='blue', width=0.5)

plot_data([int(x) for x,y in comments_weekday], [int(y) for x,y in comments_weekday],
          'Weekday', 'Average Comments', 'Average Comments on a Weekday', 
          color='red', width=0.5)

plot_data([int(x) for x,y in users_weekday], [int(y) for x,y in users_weekday], 
          'Weekday', 'Average Users', 'Average Users Joined on a Weekday',
          color='green', width=0.5)

Distribution of User Activity across different times of the Day

It looks like the first thing that people do after going to office is to access reddit. I guess one could use this information to gain traction when posting messages or submissions. Other than that, nothing unexpected here.

# Hourly stats
submissions_hourly = cursor.execute(
    "select (5.5+hour) as hour, avg(count) from (select strftime('%H', posted_at) as hour, date(posted_at) as date, count(*) as count from submissions group by date, hour) group by hour").fetchall()
comments_hourly = cursor.execute(
    "select (5.5+hour) as hour, avg(count) from (select strftime('%H', posted_at) as hour, date(posted_at) as date, count(*) as count from comments group by date, hour) group by hour").fetchall()
users_hourly = cursor.execute(
    "select (5.5+hour), sum(count) from (select strftime('%H', user_since) as hour, date(user_since) as date, count(*) as count from users group by date, hour) group by hour").fetchall()

plot_data([int(x) for x,y in submissions_hourly], [int(y) for x,y in submissions_hourly], 
          'Hours in Day', 'Average Submissions', 'Average Submissions at different times of the day', 
          color='blue', width=0.5)

plot_data([int(x) for x,y in comments_hourly], [int(y) for x,y in comments_hourly], 
          'Hours in Day', 'Average Comments', 'Average Comments at different times of the day', 
          color='red', width=0.5)

plot_data([int(x) for x,y in users_hourly], [int(y) for x,y in users_hourly], 
          'Hours in Day', 'Total Users Joined', 'Total Users Joined at different times of the day', 
          color='green', width=0.5)

User Submission and Comment Metrics

I was wondering if users who joined during a particular month are more or less likely to submit or comment. There are some spikes but I haven't had time to fully investigate.

In a similar vein, I was wondering if users from certain month are voted high for some reason and there are some spikes. I think this maybe because of some highly active users who joined during that time. I will investigate it later.

# Number of submissions by authors age
user_submissions = cursor.execute(
    "select date(user_since, 'start of month') as month, avg(count) from users join (select posted_by, count(*) as count from submissions group by posted_by) on users.user_id==posted_by group by month").fetchall()
user_comments = cursor.execute(
    "select date(user_since, 'start of month') as month, avg(count) from users join (select posted_by, count(*) as count from comments group by posted_by) on users.user_id==posted_by group by month").fetchall()

plot_data([datetime.strptime(x, '%Y-%m-%d') for x,y in user_submissions], 
          [int(y) for x,y in user_submissions], 'Date', 'User Submissions',
          'Average User Submissions Based on User Joining Date', color='blue', width=15)

plot_data([datetime.strptime(x, '%Y-%m-%d') for x,y in user_comments], 
          [y for x,y in user_comments], 'Date', 'User Comments',
          'Average User Comments Based on User Joining Date', color='red', width=15)

# Score of submissions by authors age
user_submission_scores = cursor.execute(
    "select date(user_since, 'start of month') as month, avg(score) from users join (select posted_by, sum(score) as score from submissions group by posted_by) on users.user_id==posted_by group by month").fetchall()
user_comment_scores = cursor.execute(
    "select date(user_since, 'start of month') as month, avg(score) from users join (select posted_by, sum(score) as score from comments group by posted_by) on users.user_id==posted_by group by month").fetchall()

plot_data([datetime.strptime(x, '%Y-%m-%d') for x,y in user_submission_scores], 
          [y for x,y in user_submission_scores], 'Date', 'Average Score',
          'Average Score of Submissions Based on User Age', color='blue', width=15)

plot_data([datetime.strptime(x, '%Y-%m-%d') for x,y in user_comment_scores], 
          [y for x,y in user_comment_scores], 'Date', 'Average Score',
          'Average Score of Comments Based on User Age', color='red', width=15)

Most Active Users

These are the users who made either most submissions or most comments. Anyone who worries about AI taking away jobs has one more data point here. Instead of recruiting hundreds of people, we have bots doing quite a lot of grunt work. I am pretty sure all those humans currently recruited to post large number of messages against other political parties and companies will be next. You know who you are.

# Most frequent submission authors
data = cursor.execute(
    "select posted_by, count(*) as count from submissions group by posted_by order by count desc").fetchmany(11)[1:]
display(HTML(tabulate.tabulate(data, headers=['Username', 'Submissions'], tablefmt='html')))

Username	Submissions
gcs8	643
pannagasamir	610
swacchreddit	556
aryaninvader	481
modiusoperandi	458
TemptNotTheBlade	454
hipporama	439
HornOK	323
ppatra	322
darklordind	316

# Most frequent comment authors
data = cursor.execute("select posted_by, count(*) as count from comments group by posted_by order by count desc").fetchmany(11)[1:]
display(HTML(tabulate.tabulate(data, headers=['Username', 'Comments'], tablefmt='html')))

Username	Comments
samacharbot2	33221
AwkwardMod	26021
TemptNotTheBlade	11680
anti_anti_adblock	9350
gcs8	6639
bhiliyam	6562
donoteatthatfrog	5686
swacchreddit	5569
ppatra	5558
charavaka	5297

Most Popular Users

These are the users who received maximum score for their submissions and comments. Looks like these highly active users need investigation.

# Most popular (based on score) submission authors
data = cursor.execute("select posted_by, sum(score) as score from submissions group by posted_by order by score desc").fetchmany(11)[1:]
display(HTML(tabulate.tabulate(data, headers=['Username', 'Submission Score'], tablefmt='html')))

Username	Score
hipporama	43363
pannagasamir	34396
wordswithmagic	28209
aryaninvader	26022
TemptNotTheBlade	22450
gcs8	22383
ironypatrol	21829
modiusoperandi	21447
spikyraccoon	19153
ab0mI	17806

# Most popular (based on score) comment authors
data = cursor.execute("select posted_by, sum(score) as score from comments group by posted_by order by score desc").fetchmany(11)[1:]
display(HTML(tabulate.tabulate(data, headers=['Username', 'Comment Score'], tablefmt='html')))

Username	Comment Score
TemptNotTheBlade	61177
Not_a_kulcha	59741
ribiy	46236
samacharbot2	39752
swacchreddit	37643
AwkwardMod	37398
modiusoperandi	35069
charavaka	33366
bhiliyam	32679
GoldPisseR	32073

Submissions by Flair

I think Politics takes up quite a lot of space in Indian mind and it shows in the data. ~24% of posts were tagged with Politics flair. I am actually surprised that we had more Non-Politics posts than Politics.

With respect to submission score, taking direct average is useless because of many custom flairs. If I filter out all the flairs with less than 10 posts, we have a much more representative sample. From that, excluding subreddit management related flairs, Politics ranks sixth highest rated flair with AMA, Sports, Reddiquette, Non-political and Entertainment taking top spots. That sounds reasonable. If submission scores were to be considered as a proxy for user interests, people are more interested in other things than Politics but somehow more Political posts are made than people are interested in. I wonder how that happens.

# Post count by flair
data = cursor.execute("select link_flair_text, (count*100.0/(select count(*) from submissions)) as percent from (select link_flair_text, count(*) as count from submissions group by link_flair_text)  order by percent desc").fetchmany(15)
display(HTML(tabulate.tabulate(data, headers=['Flair', 'Submission Percentage'], tablefmt='html')))

Flair	Submission Percentage
Non-Political	28.5036
Politics	23.8635
AskIndia	13.3852
Policy/Economy	6.13779
Demonetization	3.90912
Business/Finance	3.07831
[R]eddiquette	2.77312
Science/Technology	2.6469
Entertainment	2.39822
Policy & Economy	1.91406
Sports	1.09079
Science & Technology	1.07195
Policy	1.00224
Repost.	0.855297
Not about India.	0.817618

# Post score by flair
data = cursor.execute("select link_flair_text, score from (select link_flair_text, avg(score) as score, count(*) as count from submissions group by link_flair_text) where count > 10 order by score desc").fetchmany(15)
display(HTML(tabulate.tabulate(data, headers=['Flair', 'Submission Score'], tablefmt='html')))

Flair	Submission Score
Misleading	167.048
Moderated	161.8
AMA	124.183
Unverified	112.259
Sports	91.9378
Casual AMA	80.9167
TIL Rule Violation.	76.617
[R]eddiquette	73.7887
Non-Political	66.3399
Entertainment	61.1406
Politics	59.652
Unverified Content.	59.6022
Science/Technology	57.1181
Meta.	56.6875
Policy/Economy	55.9699

Word Cloud of Submissions

What data analysis is complete without great looking wordclouds? So here they are. I think the sizes of words should surprise no one. I was expecting a lot bigger "Modi" but I guess other non-important words took up too much of space.

import base64
import operator
from wordcloud import WordCloud
# Submission title word cloud
submission_title = ''
for b64text in cursor.execute("select title from submissions").fetchall():
    text = base64.b64decode(b64text[0])
    submission_title += text + ' '
title_wordcloud = WordCloud(width=1200, height=600, background_color='white').generate(submission_title)
fig = plt.figure(figsize=(12,6), dpi=80)
plt.imshow(title_wordcloud, interpolation='bilinear')
plt.title('Wordcloud of Submission Titles', fontsize=20)
plt.axis("off")

(-0.5, 1199.5, 599.5, -0.5)

# Submission text word cloud
submission_text = ''
for b64text in cursor.execute("select content from submissions").fetchall():
    text = base64.b64decode(b64text[0])
    submission_text += text + ' '
text_wordcloud = WordCloud(width=1200, height=600, background_color='white').generate(submission_text)
fig = plt.figure(figsize=(12,6), dpi=80)
plt.imshow(text_wordcloud, interpolation='bilinear')
plt.title('Wordcloud of Submission Content', fontsize=20)
plt.axis("off")

(-0.5, 1199.5, 599.5, -0.5)

# Comments text word cloud
comment_text = ''
for b64text in cursor.execute("select content from comments where score > 5").fetchall():
    text = base64.b64decode(b64text[0])
    comment_text += text + ' '
comment_wordcloud = WordCloud(width=1200, height=600, background_color='white').generate(comment_text)
fig = plt.figure(figsize=(12,6), dpi=80)
plt.imshow(comment_wordcloud, interpolation='bilinear')
plt.title('Wordcloud of Comment Content', fontsize=20)
plt.axis("off")

(-0.5, 1199.5, 599.5, -0.5)

Code and Extras

Code for loading data into database is below.

import avro
import base64
import io
import os
import praw
import sqlite3
import sys

from avro.datafile import DataFileReader
from avro.io import DatumReader
from datetime import datetime
from multiprocessing.dummy import Pool as ThreadPool

epoch = datetime(1970,1,1)

def safe_int(val):
    if val != None:
        return val
    else:
        return 0

class RedditDataLoader(object):
    def __init__(self):
        self.conn = sqlite3.connect('reddit.db')

    def get_author_info(self, reddit, author_name):
        try:
            author = reddit.redditor(author_name)
            user_since = datetime.utcfromtimestamp(author.created_utc)
            return author_name, "'%s', '%s', %f, %f" % (author_name, user_since, author.link_karma, author.comment_karma)
        except Exception, e:
            print 'Failed with exception', e
            return author_name, None

    def fetch_author_info(self):
        reddit = praw.Reddit(client_id='YOUR_REDDIT_CLIENT_ID', client_secret='SECRET', 
                username='USERNAME', password='PASSWORD', user_agent='script')
        cursor = self.conn.cursor()
        submission_authors = [item[0] for item in cursor.execute('select distinct(posted_by) from submissions').fetchall()]
        comment_authors = [item[0] for item in cursor.execute('select distinct(posted_by) from comments').fetchall()]
        all_authors = sorted(list(set(submission_authors + comment_authors)))
        print 'Found %d authors' % len(all_authors)
        pool = ThreadPool(12)
        all_authors_data = pool.map(lambda author: self.get_author_info(reddit, author), all_authors)
        pool.close()
        pool.join()
        print 'Done fetching data'
        for author_name, author_details in all_authors_data:
            if author_details==None: continue
            self._add_author_details(cursor, author_name, author_details)

    def read_data(self, submission_file_name, comment_file_name):
        cursor = self.conn.cursor()
        # read and load submissions
        print 'Reading and loading submissions data'
        submission_file = DataFileReader(open(submission_file_name, 'rb'), DatumReader())
        for submission in submission_file:
            self._add_submission_details(cursor, submission)

        # read and load submissions
        print 'Reading and loading comments data'
        comment_file = DataFileReader(open(comment_file_name, 'rb'), DatumReader())
        for comment in comment_file:
            self._add_comment_details(cursor, comment)

    def create_tables(self):
        c = self.conn.cursor()
        # If some tables exist, we don't need to create new ones
        c.execute("SELECT count(*) FROM sqlite_master where type='table'")
        if c.fetchone()[0] > 0:
            return
        c.execute('''CREATE TABLE users 
                        (user_id text PRIMARY KEY, user_since timestamp, 
                        link_karma real, comment_karma real)''')
        c.execute('''CREATE TABLE submissions
                        (submission_id text PRIMARY KEY, title text, content text,
                        link_flair_text text, posted_at timestamp, posted_by text, 
                        upvotes int, downvotes int, score int, num_comments int)''')
        c.execute('''CREATE TABLE comments
                        (comment_id text PRIMARY KEY, submission_id text, parent_comment_id text,
                        content text, posted_at timestamp, posted_by text,
                        upvotes int, downvotes int, score int)''')
        self.conn.commit()

    def _add_submission_details(self, cursor, submission):
        try:
            created_at = datetime.utcfromtimestamp(int(submission['created_utc']))
            cursor.execute("SELECT * FROM submissions WHERE submission_id='%s'" % submission['id'])
            if cursor.fetchone() != None:
                return
            values = "'%s', '%s', '%s', '%s', '%s', '%s', %d, %d, %d, %d" % (
                        submission['id'], base64.b64encode(submission['title'].encode('utf-8')), 
                        base64.b64encode(submission['selftext'].encode('utf-8')), submission['link_flair_text'],
                        created_at, submission['author'], safe_int(submission['ups']), safe_int(submission['downs']), 
                        safe_int(submission['score']), safe_int(submission['num_comments']))
            cursor.execute("INSERT INTO submissions VALUES (%s)" % values)
            self.conn.commit()
        except ValueError,e:
            pass
        except Exception, e:
            print submission, e

    def _add_comment_details(self, cursor, comment):
        created_at = datetime.utcfromtimestamp(int(comment['created_utc']))
        cursor.execute("SELECT * FROM comments WHERE comment_id='%s'" % comment['id'])
        if cursor.fetchone() != None:
            return
        values = "'%s', '', '%s', '%s', '%s', '%s', %d, %d, %d" % (
                    comment['id'], comment['parent_id'],
                    base64.b64encode(comment['body'].encode('utf-8')), 
                    created_at, comment['author'], safe_int(comment['ups']), safe_int(comment['downs']), safe_int(comment['score']))
        cursor.execute("INSERT INTO comments VALUES (%s)" % values)
        self.conn.commit()

    def _add_author_details(self, cursor, author_name, author_details):
        try:
            cursor.execute("SELECT * FROM users WHERE user_id='%s'" % author_name)
            if cursor.fetchone() != None:
                return
            cursor.execute('INSERT INTO users VALUES (%s)' % author_details)
            self.conn.commit()
        except Exception, e:
            print e

if __name__=='__main__':
    data_loader = RedditDataLoader()
    data_loader.create_tables()
    data_loader.read_data('data/submissions.avro', 'data/comments.avro')
    data_loader.fetch_author_info()
    print 'Done loading'