Flare Detector for Reddit Data

May 1, 2020

Code

Flair detector is a web application, deployed using Heroku. The application can be seen here. It scrapes posts using the URL and then uses a Random Forest model to predict the flair of the post.

The project contains the code which performs the following functionalities:

Scrapping posts from Indian Subreddit according to two methods: hottest posts and distributed posts in accordance with flairs
Exploratory Data Analysis: contains bar graphs and pie-charts to analyse data distributions for the attributes collected in step 1.
Textual Pre-Processing: pre-processes textual data for attributes like Title of the post and Content of the posts.
Building and contrasting different models: Builds, trains and validates four ML models, Naive Bayes, Random Forest, Logistic Regression and Multi-layer Perceptron using different features and then selects the feature-model pair performing the best.
Building Web Application: Contains the code to build a basic web application that takes as input a post url and displays the predicted flair of the post.
Hosts the app on Heroku.

My experimental log on how I designed the project along with the project code and documentation on how to run it in your local machine can be found here

Good things about the project:

Detailed Documentation
Prediction for Photography posts (generally)

Scopes of improvement:

Prediction. I realised it a little late that the 71% accuracy in using title as feature (and Random Forest as learning algorithm) is achieved because the title contained the flair in most cases. This was weird and has been mentioned in better detail in the Experimental log.ipynb file. Time remaining, I would have liked to find a way to:
- incorporate comments, content and title with the title (and dealing with NaNs appropriately)
- Used better learning algorithms
- collected more data.

Future work:

In data exploration:

finding the number of posts in each flair for which content != None
- finding the correlation between individual flair confusion matrix obtained from using content only with the number of samples obtained above
- verifying if the unequal distribution is one of the reasons behind the low flair accuracy. if yes, checking if increasing the number of sample distribution had any effect on prediction scores. Then maybe, content would not have been as useless after all.
contrasting performance with unsupervised algorithms (like K-Means)
Find flair-wise accuracy

Shreya Gupta

Research Associate

Aim to understand the world behind the three lines of code (import, train, test), challenge conventional approaches and build more efficient and applicable algorithms