PySpark with Reddit Data

Using PySpark, my team and I reviewed Reddit data to answer a number of questions. The primary purpose of this assignment was to flex our muscles in PySpark. To test our limits we set an overly ambitious goal of taking the first 10 million rows of Reddit comments from May 2015 (obtained from Kaggle). We had three research questions:

  1. Can we predict sentiment using NLP of reddit comments?

  2. Can we predict which subreddit a comment came from based on its text and some metadata features?

  3. Can we help predict post removal using tagged user flairs?

For my portion of the project, I focused on the second question. After pulling the raw data from Kaggle, I used Spark to remove irrelevant fields and performed EDA. I found that the data had significant problems, including columns that were misaligned. I performed diagnostics and eliminated those rows.

Once I was sure the data was clean, I again used Spark to analyze the data, select the top 100 subreddits (before the number of comments hit a large drop off point), and used Spark to build a pipeline using word2vec to incorporate the text of the comment along with the other metadata already present, and feed it into a random forest model.

It was immediately clear we didn't have nearly enough memory assigned to us in our high performance cloud computing environment to approach this research question (if this is even a possible question to answer at this scale). However, the pipeline we created was ready to handle any amount of data we had the resources to throw at it.

Previous
Previous

EDA with NLP - Sherlock Holmes or Not?

Next
Next

Earthquake Rescue Project