Anthony Gentile

An introduction to sentiment classification

December 20, 2013, 8:26pm

As the information stored on computers throughout the world grows, many want to derive usefulness from this mass amount of data for different applications. A lot of buzzwords like big data, machine learning, and NLP are becoming more prevalent. I'd like to talk specifically about one such use case where computational linguistics/NLP comes into play. It is in determining whether the attitude of what people are talking about is positive or negative. The outcome of this sentiment classification can be useful to many, such as companies that wish to know what people are saying about a particular product that they make or election candidates wanting to know how a particular campaign is being received by the public. I'd like to spend some time explaining the inner workings of sentiment classification, highlight some current approaches to the problem, run through an example to see it in action, and play with some experiments to see how it may be improved by analyzing sentence structure.


Determining Sentiment

How does one go about determining whether or not a sentence, paragraph or an entire text is negative or positive? One might start by looking at the text and checking if there are words with positive or negative connotation associated with them. This connotation is determined by a particular society within a language group. In English, some example positive and negative words are fantastic, awesome, awful, and terrible. However, the word brilliant may have a different connotation if the speaker if from the United Sates versus England. In the United States, an English speaker would often use the word brilliant to describe something or someone of great intelligence, whereas in England, brilliant is common slang for something that is outstanding. Once it has been determined what the positive and negative words should be for the text in question one could then count up these words into separate negative and positive tallies. If the tally for negative is greater, the sentence could be classified as negative, likewise for positive, and if they are the same, we could consider it to be of neutral polarity.

This would be a good start and sentiment classifier models have been built on exactly this premise, employing various machine learning methods, such as Naive Bayes, maximum entropy, and Support Vector Machines (more about these methods soon). The accuracy of these classifiers are fairly good, but allow for several false positive cases. Take for example the following movie review: "If they would've focused more on the plot, that would have been a great movie!". This example, while having the word great in it, we know to be of negative sentiment. There are many other phrases one could utter that use positive or negative words in abundance but actually convey the opposite sentiment of these words. Sometimes this can be detected by looking for negation words such as not and analyzing sentence trees, but overall, it can be a difficult task.

The problem becomes even more complicated when looking beyond the sentence level into paragraphs. Some sentences might convey positive sentiment while others convey negative sentiment. How is the sentiment of the paragraph as a whole determined in this case? Do particular sentences and their classified sentiment outweigh other sentences in the same paragraph? Do speakers tend to convey overall sentiment near the end of a text? Is this language specific? For example, take another movie review: "I don't know why the theater was out of popcorn, but my friend was really upset about it. This put us in a bad mood and then the people in front of us kept talking. Despite all that, the crude humor and being slow at times, I loved it!". We can see by this example that if we are just counting positive and negative words on a sentence level, this text would likely be classified as negative, but in reality the last part of the last sentence, "I loved it!", indicates that the movie itself received positive sentiment.

This also brings into question the target of sentiment in a text. In the previous example, positive sentiment was expressed to a particular movie, while negative sentiment was expressed about the theater experience. Being the director of the movie in question or the manager of the theater would determine which target of sentiment you might care about. Another difficulty with identifying sentiment is determining the role of parts of speech and how they affect the sentiment in a sentence. For example, there are different uses for the word hate in "This movie is about a hate crime" versus "I hate this movie". The first does not convey negative sentiment, but a part-of-speech (POS) tagger and sentiment classifier may have trouble correctly tagging and classifying it. To solve this problem, one could use a method called word sense disambiguation where POS tags are added to word features during training and testing of a classifier.

In addition to specific word features, we may also look for key phrases in the context which we are talking about, or the domain. If we are talking about the movie reviews domain then phrases such as "two thumbs up" and "left the theater" give further indication as to the sentiment of the text. If, however, we saw these phrases in a different domain, such as pet supplies, they may not provide the same boost in accuracy when classifying the sentiment.

We see that language allows us to be very expressive and flexible in the way we convey ideas. The movie review example helps show how it can be difficult and increasingly complex to implement machine learning techniques to classify sentiment with accuracy comparable to that of a native speaker of the language.

Classification Methods

Currently the approach to solving this problem is done by training models using various machine learning methods, which I will briefly describe below with some math that can largely be ignored unless you are particularly curious. We are starting to see a shift where more artificial intelligence methods are being used and combined with existing methods. Of particular note, is the use of deep learning techniques which have shown some considerable promise with sentiment classification c.f. Glorot et al. (2011) and Socher et al. (2013).

Naive Bayes

Naive Bayes is a probabilistic classification method that leverages Bayes' theorem as shown below which computes the probability of A given B

$$\huge P(A|B) = \frac{P(B | A)\, P(A)}{P(B)} $$

The main distinguishing feature of this classification method is that it considers features independently from other features in the same class. This method has been seen with criticism as well as support in the machine learning community because the independence of all features in a class is often not an accurate assumption and yet it still tends to perform well. It definitely has a place as a contender for the sentiment classification problem. To compute the probability of a class using Naive Bayes, the formula shown below can be used which expresses the probability of class C given a set of features ($-$F_1,\dots,F_n$-$) where Z is a scaling constant.

$$\huge p(C \vert F_1,\dots,F_n) = \frac{1}{Z} p(C) \prod_{i=1}^n p(F_i \vert C)$$

Maximum Entropy

Maximum entropy classification, which is commonly viewed as a form of Occam's razor (when given multiple hypotheses, choose the one with the least amount of assumptions), differs from Naive Bayes in that it allows for a large amount of features but does not assume the features are independent. From the training data, the distribution with the maximum entropy is selected from the features provided. This distribution will have the least amount of assumptions. Maximum entropy is determined using the formula shown below where the probability of class y given instance x where $-$\lambda_i$-$ are weights for corresponding features and $-$f_i$-$ are binary feature functions indicating the occurrence of a feature in instance x with label y and Z is a normalization term.

$$\huge P(y|x) = \frac{e^{\lambda_0(y)+\Sigma_j\lambda_jf_j(x,y)}}{Z}$$

Support Vector Machines

Support vector machines are models for detecting patterns in which probabilistic measures are not used, but rather instances of different classes are separated along a geometric hyperplane in an optimal way which produces the largest margin achievable on either side of the hyperplane. The larger the margin between the two classes the better the expected generalization of the classifier on new data. SVMs leverage kernels which are mathematical functions that allow data that is not linearly separable to be placed unto a higher dimensional space, where it can be separated linearly. SVMs are often implemented with a soft margin, allowing some instances to fall on the wrong side of the separating hyperplane. This is a means of handling outliers in the data so that a workable hyperplane can exist.

Experimenting with Sentence Structure

To better understand the current approaches to sentiment classification and domain adaptation using models, lets run through an experiment using the Amazon Product Reviews dataset. We will go through actual product reviews in different domains (books, DVDs, electronics, kitchen, etc) and train a model with different machine learning approaches (Naive Bayes, MaxEnt, and SVMs) to see how they compare. We will use unigrams, bigrams, and unigrams + bigrams as features to see how those distributions may affect accuracy. We will then try to add another feature to these n-gram distributions to see if sentence structure tells us anything about sentiment classificaiton. So lets break it down a bit.

The Dataset

The Amazon product reviews dataset contains 38,548 reviews for products spanning across 25 different domains, including books, DVDs, electronics, and video games. By using this data set, we are able to leverage a large amount of labeled text categorized into separate domains. This helps us tremendously as we already have reviews segmented into positive and negative files. We can then slice these up into 80/20 splits for training and testing.

Parts of Speech

To determine whether sentence structures can be beneficial to sentiment classification we must first obtain sentence structures to use as features. While one might go the route of working with tree structures that contain parent-child nodes as produced by tools like the Stanford Parser, For our experiment, lets use shallow representations that will be computationally cheaper to produce and work with. We will use the Stanford Part-Of-Speech Tagger to obtain POS representations for sentences. By applying the Stanford POS Tagger to the following sentence "This movie is great!" it would produce the corresponding POS tag pairs: "This/DT movie/NN is/VBZ great/JJ". These tags identify the following parts of speech:

DT - Determiner, NN - Noun, VBZ - Verb, 3rd person singular present, and JJ - Adjective. Here is a full list:

Tag Description
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb

Once the POS tags are obtained for each sentence in the reviews, we can then form a simple sentence representation by concatenating these tags together with underscores. Using the previous "This/DT movie/NN is/VBZ great/JJ" example, the sentence representation would be DT_NN_VBZ_JJ. These representations could be very long depending on the how long the sentence is. For this experiment, the representations are not truncated in any way except for the symbolic POS tags: :, ',', ., ``, $, '', #, ), (, --, SYM.

Classification Details

MALLET: MAchine Learning for LanguagE Toolkit is used for classification using Naive Bayes and Maximum Entropy. MaxEnt classification is performed using a Gaussian prior variance of 1.0.

LIBSVM was used for the classification using support vector machines. Classification was run with a linear kernel type.

Word features were binarized as opposed to using occurrence counts. Subsequent vector files were made in a binarized fashion, where positive polarity was indicated by 1 and negative polarity indicated by 0. We add unigrams, bigrams, or unigrams+bigrams as features, not capping by any particular frequency. Then additional classifications are performed with the addition of the POS sentence representation features. Classification in cross-domain scenarios is achieved by training the classifiers, in their different variations, against a particular source domain and then applying those trained classifiers to the test set of a different target domain.

Sentence Structure Counts

If we obtain POS sentence representations from each sentence in every review, we can then easily count them up and see which ones tend to come up the most. Here are the top 20 in positive reviews and top 20 in negative reviews:

POS Rep. # Positive # Negative Positive Review Example Negative Reviews Example
JJ NN 414 169 Great product. Big mistake.
DT NN VBZ JJ 258 99 This movie is great. This game is horrible.
PRP VBP DT NN 224 38 I love this product! I hate this product.
RB VBN 176 50 Highly recommended! Very disappointed.
DT VBZ DT JJ NN 165 42 This is an excellent magazine. This is a terrible product.
RB JJ 157 198 Very satisfied. Very disappointing.
PRP VBP PRP 156 11 I love it! I hate it.
PRP VBZ JJ 123 45 It is amazing. It's pathetic.
DT NN VBZ RB JJ 118 83 The quality is very good. This game is really bad.
DT JJ NN 104 36 An irresistible combination. A true nightmare.
VB PRP 102 27 Thank you. Forget it!
PRP VBZ RB JJ 91 56 It smells so good. It seems somewhat deceptive.
PRP VBZ DT JJ NN 86 21 It is a definite winner. It's no big deal.
PRP RB VB DT NN 83 1 I highly recommend this product They totally misrepresent this product.
PRP VBP RB JJ IN DT NN 80 14 I am very pleased with this product. I am very disappointed in this camera
NNP NNP 80 75 Fiestaware ROCKS! BIG Dissapointment.
NNP NNP NNP 78 21 ~The Rebecca Review Amazon Reviewer AKNapa
RB JJ NN 73 40 Very light weight. Very poor service.
PRP RB VB PRP 70 0 I highly recommend it.
DT NNS VBP JJ 67 22 These burgers are unbelievable. These knives are cheap.

In PRP_VBP_DT_NN, PRP_VBP_PRP and JJ_NN one sees that in the corresponding example sentences "I love this product", "I love it", and "Excellent product" that one would expect that a VBP or JJ of negative polarity could be interchanged to produce likely sentences of "I hate this product", "I hate it", and "Horrible product". While this does occur, it is interesting that the occurrences happening for positive polarity outweigh the negative occurrences substantially.

POS Rep. # Negative # Positive Negative Review Example Positive Review Example
RB JJ 198 157 Very disappointing. Very comfortable.
JJ NN 169 414 Big mistake! Excellent product.
DT NN VBZ JJ 99 258 This product is worthless. This product is great!
WP DT NN 92 27 What a disappointment. What a timesaver.
PRP VBD RB JJ 86 30 I was very disappointed. I was very impressed!
DT NN VBZ RB JJ 83 118 The viewfinder is very dim. The case is very stylish.
VBP RB VB PRP NN 82 0 Don't waste your money
NNP NNP 75 80 BIG Dissapointment. Fiestaware ROCKS!
DT NN 74 56 No response. No brainer.
VB PRP NN 59 13 Save your money. Do your research.
VBP RB VB DT NN 59 3 Do not buy this product. Do not miss this book!
PRP VBZ RB JJ 56 91 It is not beautiful. It is very sturdy.
NN NN 56 53 Caveat emptor -Michael Cran
NNP NN 52 48 Buyer beware Grade: A-
PRP MD RB VB DT NN 51 44 I would not recommend this product. I would highly recommend this product.
RB VBN 50 176 Very disappointed. Highly recommended.
PRP VBD RB JJ IN DT NN 47 8 I was very unhappy with this product. I was very pleased with this purchase.
PRP VBD JJ 47 38 It was awful! I was impressed.
PRP VBP RB JJ 46 66 I am so disappointed. I am very pleased.
PRP VBZ JJ 45 123 It is defective. It is great!

In VBP_RB_VB_PRP_NN and VBP_RB_VB_DT_NN the corresponding example sentences "Don't waste your money" and "Do not buy this product" are imperative sentences (give a request or command). The numbers seemingly indicate that it is likely to see imperative sentences in negative context much more so than imperative sentences in positive context, an example being VB_DT_NN, "Buy this product!".

POS Tagging Discrepancies

It can be seen that there are cases where the Stanford POS tagger doesn't accurately tag particular words or reveals interesting scenarios. For example, with the sentence structure NNP NNP NNP we likely have a name or signature of the reviewer. It is interesting to see that for this structure it is more likely that the reviews will be positive. Does this indicate that those who leave their name are more likely to give positive reviews? Could it suggest people are being paid or otherwise influenced to produce more positive reviews than negative? With NNP NNP we see that the POS tagger mistakenly tags ROCKS and BIG as proper nouns. We must not forget that we will also find a positive example for a representation in a negative review and vice versa. This is seen with the structure RB VBN for the sentence "Very disappointed" occurring in a positive review. This structure, while having a negative polarity, is outweighed by the positive polarity of the entire text.

Classification Results

For Naive Bayes classification, the unigrams distribution sees very little top accuracies, however, the best increase resulting from the addition of POS representations or selected set is in the Music domain with 1.75% when POS representations are added to unigrams. Most improvements and top accuracies are seen in the bigrams and unigrams+bigrams distributions varying when POS representations is added.

Domain Train Test Unigrams Bigrams Uni+Bi Uni+POS Bi+POS Uni+Bi+POS
Apparel 1600 400 80.25% 81.75% 83.25% 79.75% 81.00% 83.00%
Automotive 589 147 80.27% 81.63% 79.59% 79.59% 81.63% 79.59%
Baby 1520 380 76.58% 82.37% 81.05% 77.37% 83.42% 81.32%
Beauty 1194 299 74.92% 78.60% 73.91% 72.24% 78.93% 73.24%
Books 1600 400 79.75% 79.75% 82.75% 80.00% 80.25% 82.75%
Camera & Photo 1599 400 84.75% 86.50% 86.00% 85.00% 86.00% 86.00%
Cell Phones & Service 818 205 80.49% 80.00% 77.07% 76.59% 80.00% 76.10%
Computer & Video Games 1166 292 94.12% 96.92% 96.92% 95.55% 96.92% 96.92%
DVD 1600 400 74.25% 78.75% 77.75% 74.75% 78.25% 78.50%
Electronics 1600 400 80.75% 79.75% 82.25% 81.25% 79.75% 82.25%
Gourmet Food 966 242 83.06% 85.95% 85.12% 83.88% 86.36% 85.12%
Grocery 1082 270 79.63% 80.00% 78.52% 77.78% 78.52% 78.89%
Health & Personal Care 1600 400 84.00% 83.50% 85.25% 84.25% 83.00% 85.25%
Jewelry & Watches 1034 258 78.68% 80.62% 78.29% 77.13% 80.23% 77.91%
Kitchen & Housewares 1600 400 83.75% 84.25% 85.25% 84.00% 84.50% 85.00%
Magazines 1576 394 81.22% 79.95% 81.22% 81.47% 80.46% 81.47%
Music 1600 400 79.00% 78.25% 80.00% 80.75% 78.25% 79.75%
Musical Instruments 265 67 85.07% 85.07% 85.07% 85.07% 85.07% 85.07%
Office Products 345 86 82.56% 84.88% 84.88% 83.72% 84.88% 84.88%
Outdoor Living 1062 265 76.98% 76.98% 76.60% 76.60% 77.36% 76.60%
Software 1532 383 83.03% 82.25% 83.81% 82.25% 82.25% 84.33%
Sports & Outdoors 1600 400 80.50% 80.25% 83.25% 82.00% 81.25% 82.50%
Tools & Hardware 89 23 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%
Toys & Games 1600 400 80.75% 84.75% 83.00% 80.75% 85.00% 82.50%
Video 1600 400 81.75% 79.75% 81.25% 81.00% 79.50% 81.25%
Overall Accuracy 80.96% 82.00% 82.27% 80.92% 82.02% 82.20%

With MaxEnt we see the opposite of Naive Bayes where there is a strong showing in the unigrams and unigrams+bigrams, but no top accuracies in bigrams (except in the outlying Tools & Hardware domain). The best increase is 1.68% in the Beauty domain when POS representations are added to unigrams. This is the top accuracy for the domain with MaxEnt.

Domain Train Test Unigrams Bigrams Uni+Bi Uni+POS Bi+POS Uni+Bi+POS
Apparel 1600 400 79.50% 80.25% 82.00% 79.50% 80.00% 82.50%
Automotive 589 147 83.67% 80.95% 86.39% 84.35% 80.95% 86.39%
Baby 1520 380 76.05% 77.63% 78.42% 76.32% 78.95% 78.16%
Beauty 1194 299 79.26% 78.60% 80.60% 80.94% 78.26% 80.94%
Books 1600 400 76.50% 73.00% 76.00% 76.00% 73.00% 75.75%
Camera & Photo 1599 400 83.50% 81.25% 84.00% 83.75% 81.25% 83.75%
Cell Phones & Service 818 205 79.02% 75.12% 83.41% 78.54% 75.12% 82.93%
Computer & Video Games 1166 292 97.26% 97.26% 97.60% 97.26% 97.26% 97.60%
DVD 1600 400 79.25% 74.50% 78.00% 79.00% 75.00% 78.25%
Electronics 1600 400 79.75% 74.75% 78.75% 79.50% 75.00% 78.75%
Gourmet Food 966 242 87.19% 85.12% 86.78% 87.19% 85.12% 86.78%
Grocery 1082 270 85.93% 79.63% 84.44% 85.56% 80.00% 84.07%
Health & Personal Care 1600 400 82.00% 80.50% 84.50% 81.50% 81.25% 84.50%
Jewelry & Watches 1034 258 86.43% 84.50% 87.60% 85.66% 84.11% 87.21%
Kitchen & Housewares 1600 400 83.25% 81.00% 85.50% 82.75% 81.00% 84.75%
Magazines 1576 394 80.20% 81.22% 82.49% 80.20% 80.71% 82.23%
Music 1600 400 80.25% 74.75% 79.75% 80.00% 74.00% 79.25%
Musical Instruments 265 67 86.57% 85.07% 85.07% 86.57% 85.07% 85.08%
Office Products 345 86 86.05% 84.88% 84.88% 86.05% 84.88% 84.88%
Outdoor Living 1062 265 81.89% 78.49% 81.51% 81.89% 77.74% 81.51%
Software 1532 383 82.51% 78.33% 83.03% 81.98% 77.55% 82.77%
Sports & Outdoors 1600 400 78.50% 79.00% 81.75% 78.00% 80.00% 81.50%
Tools & Hardware 89 23 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%
Toys & Games 1600 400 79.25% 80.25% 82.50% 80.25% 80.25% 82.50%
Video 1600 400 81.50% 80.50% 83.00% 81.25% 80.25% 83.25%
Overall Accuracy 81.67% 79.7% 82.75% 81.6% 79.73% 82.63%

For SVMs, top accuracies for domains mostly occur in unigrams or unigram+bigram distributions, bigram alone achieves only a couple top accuracies. The best increase is 2.00% in the Sports & Outdoors domain when POS representations are added to bigrams. This is the top accuracy for the domain with SVM.

Domain Train Test Unigrams Bigrams Uni+Bi Uni+POS Bi+POS Uni+Bi+POS
Apparel 1600 400 79.75% 78.25% 78.50% 79.75% 78.50% 78.50%
Automotive 589 147 80.95% 80.27% 80.95% 80.95% 80.95% 81.63%
Baby 1520 380 76.84% 76.84% 74.74% 77.37% 76.58% 73.95%
Beauty 1194 299 78.26% 80.60% 78.60% 78.26% 79.60% 78.93%
Books 1600 400 74.25% 72.50% 76.00% 75.25% 73.00% 76.25%
Camera & Photo 1599 400 80.75% 80.75% 81.75% 82.00% 80.50% 81.75%
Cell Phones & Service 818 205 76.10% 74.15% 76.10% 77.56% 73.17% 76.10%
Comp. & Video Games 1166 292 95.55% 96.58% 96.92% 95.55% 96.92% 97.26%
DVD 1600 400 78.50% 72.25% 76.75% 78.50% 73.50% 76.75%
Electronics 1600 400 77.00% 74.50% 77.75% 76.50% 74.25% 78.00%
Gourmet Food 966 242 86.78% 84.30% 85.54% 87.19% 84.71% 85.95%
Grocery 1082 270 85.19% 82.22% 84.81% 86.30% 81.11% 84.81%
Health & Personal Care 1600 400 79.85% 80.25% 81.75% 80.00% 80.5% 82.25%
Jewelry & Watches 1034 258 84.88% 84.50% 87.21% 84.88% 84.50% 86.43%
Kitchen & Housewares 1600 400 82.25% 81.00% 82.50% 82.25% 81.25% 82.50%
Magazines 1576 394 81.22% 80.20% 79.70% 80.96% 80.96% 79.70%
Music 1600 400 77.50% 73.00% 77.50% 77.00% 73.50% 77.25%
Musical Instruments 265 67 89.55% 85.07% 85.07% 88.06% 85.07% 85.07%
Office Products 345 86 86.05% 86.05% 86.05% 86.05% 86.05% 86.05%
Outdoor Living 1062 265 81.13% 79.62% 81.89% 81.51% 78.87% 81.89%
Software 1532 383 81.46% 74.67% 79.63% 81.46% 74.93% 79.37%
Sports & Outdoors 1600 400 77.75% 78.75% 77.50% 77.25% 80.75% 77.25%
Tools & Hardware 89 23 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%
Toys & Games 1600 400 78.00% 82.50% 80.25% 79.25% 82.25% 80.50%
Video 1600 400 79.75% 81.50% 82.00% 79.75% 81.75% 81.75%
Overall Accuracy 80.46% 79.33% 80.62% 80.67% 79.51% 80.62%

The following table serves to provide a reference of the best and worst performing classifier and n-gram distribution variations. It shows that there is no clearly dominant classifier and that it seems to be very domain specific.

Domain Highest Classifier Variant Lowest Classifier Variant
Apparel 83.25% NB (Uni+Bi) 78.25% SVM (Bigrams)
Automotive 86.39% MaxEnt (Uni+Bi, Uni+Bi+POS) 79.59% NB (Uni+Bi, Uni+POS, Uni+Bi+POS)
Baby 83.42% NB (Bi+POS) 73.95% SVM (Uni+Bi+POS)
Beauty 80.94% MaxEnt (Uni+POS, Uni+Bi+POS) 72.24% NB (Uni+POS)
Books 82.75% NB (Uni+Bi, Uni+Bi+POS) 72.5% SVM (Bigrams)
Camera & Photo 86.50% NB (Bigrams) 80.5% SVM (Bi+POS)
Cell Phones & Service 83.41% MaxEnt (Uni+Bi) 73.17% SVM (Bi+POS)
Computer & Video Games 97.60% MaxEnt (Uni+Bi, Uni+Bi+POS) 94.12% NB (Uni)
DVD 79.25% MaxEnt (Uni) 72.25% SVM (Bigrams)
Electronics 82.25% NB (Uni+Bi, Uni+Bi+POS) 74.25% SVM (Bi+POS)
Gourmet Food 87.19% SVM (Uni+POS), MaxEnt (Uni, Uni+POS) 83.06% NB (Uni)
Grocery 86.30% SVM (Uni+POS) 77.78% NB (Uni+POS)
Health & Personal Care 85.25% NB (Uni+Bi, Uni+Bi+POS) 79.85% SVM (Uni)
Jewelry & Watches 87.60% MaxEnt (Uni+Bi) 77.13% NB (Uni+POS)
Kitchen & Housewares 85.50% MaxEnt (Uni+Bi) 81.00% MaxEnt (Bigrams)
Magazines 82.49% MaxEnt (Uni+Bi) 79.7% SVM (Uni+Bi, Uni+Bi+POS)
Music 80.75% NB (Uni+POS) 73.00% SVM (Bigrams)
Musical Instruments 89.55% SVM (Uni) 85.07% NB, MaxEnt, SVM (Numerous Variants)
Office Products 86.05% SVM, MaxEnt (Numerous Variants) 82.56% NB (Uni)
Outdoor Living 81.89% SVM (Uni+Bi, Uni+Bi+POS), MaxEnt (Uni, Uni+POS) 76.6% NB (Uni+POS, Uni+Bi, Uni+Bi+POS)
Software 84.33% NB (Uni+Bi+POS) 74.67% SVM (Bigrams)
Sports & Outdoors 83.25% NB (Uni+Bi) 77.25% SVM (Uni+POS, Uni+Bi+POS)
Tools & Hardware 100.00% All Variants 100.00% All Variants
Toys & Games 85.00% NB (Bi+POS) 78.00% SVM (Uni)
Video 83.25% MaxEnt (Uni+Bi+POS) 79.5% NB (Bi+POS)

Outcome of POS representations

The results of the experiment reveal some interesting numbers. Across all classifier methods we see the addition of POS representations not only accounts for many top ranking accuracies for the domains, but will often increase the accuracy of a distribution, even if it is not the top accuracy for that domain. While the percentage point increases and decreases may seem subtle, in the field of computational linguistics, single percent points can be quite substantial. Given very large datasets, a 2.00% increase in accuracy could be the difference in a large amount of documents being correctly classified.

It is my belief that the domain and data will be strong indicators as to what classification method is best. It would seem, however, that the easiest method of determining which classifier method is appropriate for your domain and data, would be to run an experiment like we have done here.

Outlying Domains

Tools & Hardware domain shows an accuracy of 100% in every variant of the classifiers. This is due to the small sample size of the training and test data and that the training data sufficiently encompasses the n-grams that present themselves in the test data. It does not contribute to the findings, but is included to show the entirety of the Amazon product reviews dataset. The Computer & Video Games domain is also different in the fact that the accuracies are substantially higher than other domains, with the lowest being 94.12%. After manually reviewing many of the training and test data reviews, I encountered several reviews similar to the following: "Its patheticic. You cant do anything, graphics are horrble", "This is without a doubt the worste game I have ever purchased.", and "This game is a joke! It is immpossible for a child to get past level 3.". These examples show misspellings and unambiguous sentiment. While the misspellings may contribute to some classification errors, overall, I believe that the misspellings coupled with the amount of profanity seen in the reviews may indicate that the domain is comprised of reviewers that are in a younger age distribution. It is my theory that this contributes to the number of sentences that have numerous occurrences of one-sided polar adjectives and are less likely to be ambiguous regarding sentiment in the way the classifiers are looking to detect it. More review is needed to confirm this theory.


I believe that our experiment helps show the importance of sentence structure as a component in a sentiment classifier. It was shown how shallow sentence representations created by POS tags improve accuracies in several domains across differing classifier methods and n-gram distributions. It is interesting to see that there are indeed cases where particular sentiment is conveyed in a specific grammatical way. It is also important to mention how much the domain comes into play in regard to the best n-gram set for a given classifier. With some domains, unigrams yielded the best accuracies, while in others it was the combination of unigrams+bigrams. It can be surmised by the experiment that word features do indeed tell a lot in regard to sentiment analysis, but I believe that sentence structure is a significant piece of the puzzle for solving the remaining edge cases and the nuances we see in language. The better we solve these edge cases, the better applications we can produce to mimic human understanding of language.