What is the Data Behind Your Sentiment Model?

Phillip Durst

SkimAI, November 3, 2019

Part 1: Sentiment Models

Sentiment models are a type of natural language processing (NLP) algorithm that determines the polarity of a piece of text. That is, a sentiment model predicts whether the opinion given in a piece of text is positive, negative, or neutral. These models provide a powerful tool for gaining insights into large sets of opinion-based data, such as social media posts and product reviews. For example, a seller on the Amazon marketplace could use a sentiment model to quickly assess 1000s of reviews and gauge customer satisfaction with their goods and services. Sentiment models can also be used to predict the reviews for a new product by comparing product metadata to similar products.

Like all machine learning algorithms, sentiment models require large sets of labeled training data to develop and tune. The first step in model development requires tens of thousands of statements that are already labeled as positive, negative, or neutral. Finding these training data is difficult because a human expert must determine and label the polarity of each statement in the training data. Having a ready-made training dataset that are already labeled greatly reduces the time and effort needed to develop a sentiment model. Two such sentiment datasets frequently used for training are the Internet Movie Database (IMDB) and Amazon review databases.

Part 2: The IMDB and Amazon Review Databases

The IMDB and Amazon review databases are almost ideal for training sentiment models (more on their limitations to follow), as they are ready-made datasets of easily labeled sentiments. The polarity of these reviews can be determined by segmenting reviews by score. For the IMBD database, reviews of 0-3 stars are typically considered negative, 4-6 stars neutral, and 7-10 stars positive. Similarly, for Amazon reviews, 1-2 stars is negative, 3 stars is neutral, and 4-5 stars is positive. However, the Amazon review database is not as popular as a 1-to-5 rating, does not have the fidelity of a 1-to-10 system, and the Amazon dataset is more complex and therefore more challenging to use.

The IMDB database has been used in a wealth of academic studies, tutorials, and open-source codes. The standard IMDB dataset contains 50,000 reviews, with an even number of positive and negative reviews. In general, the IMDB database is more popular than the Amazon database, as it provides a smaller and easier-to-manipulate dataset. The IMDB dataset is a powerful tool for developing the skills necessary to develop more advanced sentiment models.

The Amazon review dataset has the advantages of size and complexity. Amazon has compiled reviews for over 20 years and offers a dataset of over 130 million labeled sentiments. The Amazon dataset also offers the additional benefit of containing reviews in multiple languages. The Amazon dataset further provides labeled fake or biased reviews. Due to its size and complexity, the Amazon dataset provides for the development of more sophisticated sentiment models. The Amazon dataset additionally offers more utility, given that predicting product performance via sentiment modeling is a critical component of modern product release.

3 Limitations in Applicability

As much time and effort as these databases save for training sentiment models, they are not without limitations. Given the quantitative nature of reviews, applying the models trained using these databases to qualitative opinions, such as tweets, leads to losses in accuracy. Also, for the IMBD database, reviews are highly subjective to the viewer’s preferences, which can skew results. Similarly, for the Amazon database, biased or fake reviews are common. A further complication of any sentiment database is the innate inability of the model to recognize sarcasm, which can be common among reviews.

Furthermore, the keywords (features) found during the training process are limited when working with reviews. Reviews often tend to be repetitive, containing a limited subset of key terms. Moreover, reviews contain some terms that are uncommon in regular opinion statements, such as weak soundtrack. Because of the uniqueness of some of the key terms and the lack of key term diversity, applying sentiment models trained on these databases can lead to sub-optimal results. For example, if a company wants to use a sentiment model to predict the reaction to a change in policy, a model trained on a review database would struggle with this prediction, given that the reaction will not be a quantitative assessment of a product.

In summary, sentiment models are a powerful tool for modern businesses, and these models require large datasets for training. The IMDB and Amazon review databases are two common, readily accessible sentiment databases that are popular for training sentiment models. While providing a useful tool for sentiment model training, these datasets come with caveats that must be taken into account.

Phillip J. Durst