EDA with NLP - Sherlock Holmes or Not?

I set out to answer the question, can various NLP methods differentiate between Arthur Conan Doyle’s Sherlock Holmes work and his other work. Aurthur Conan Doyle was a prolific author, writting not only other adventure stories similar to Sherlock Holmes, but also poems, songs, historical novels and other generes. My goal was to test the ability of these NLP methods and become more familiar with their limits. Can these methods pick appart nuances in the words used in these different works, or will they be indistinguishable.

To create the corpus I utilized a third-party database to get a comprehensive list of Arthur Conan Doyle’s work, then I pulled in those works from the Project Gutenberg library. Next, I structured the corpus in an OHCO format and eliminated foreign language translations and duplicate works, eliminating stop words and otherwise manipulating the data as necessary.

I then ran the corpus throught the following models and assessed their results.

  • Principle Component Analysis (do Sherlock Holmes themes gravitate toward one component?)

  • Word2Vec (can I find groupings of Sherlock Holmes theme words in a Word2Vec word cloud?)

  • tree-based models (utilizing TF-IDF to categorize different works)

  • topic modeling (are there topics specific to the Sherlock Holmes texts?)

  • Syuzhet model (plotting the progression of the plot to catch patterns)

Most of these methods proved innefective. While the format and high level topic were different for Arthut Conan Doyle’s work, the voice, tone, and word usage were similar accross all generes. I did find that tree-based models were effective. These models are built on TF-IDF which measures word importance rather than word similarity (as Word2Vec and PCA do for example), and I was able to effectively seperate these works. Take a look at the code and more comprehensive results in the GitHub link below.

View the code and in-depth report on GitHub.

Previous
Previous

Neuron Identification and Classification

Next
Next

PySpark with Reddit Data