Anthony Gentile

ML Algorithms: Mutual Information

February 23, 2014, 12:09am

Introduction

Mutual Information (MI) and Chi Squared are important algorithms that help provide a means of determining the importance of a data distribution. It is easy to start writing some code and get excited when you see some significant accuracy increase when you add a feature. Properly using precision, recall, and measurements such as MI and Chi Squared, will help you determine the true value of the change and how the data may be misleading you.

The Algorithm

Mutual Information is a measure of mutual dependence between two random variables. Mathematically, it is stated as follow:

$$\huge I(X;Y) = \sum_{y \in Y} \sum_{x \in X} p(x,y) \log{ \left(\frac{p(x,y)}{p(x),p(y)} \right) }$$

Of course, for those of us who are easily confused by mathematical syntax (including myself), this may not make a whole lot of sense. So let us work through an example and offer up some code that will show us what the benefit of this algorithm is.

Example

Using a pre-existing example for MI, lets discuss how it can be a useful measurement.

Suppose we were to categorize newspaper articles into events, obituaries, and classifieds. We then asked ourselves how much does the term 'orange' contribute to those categories or classes? MI provides us a measure of this information for a term given a class. We would have a perfect MI score of 1 if 'orange' only appeared in obituaries and that is the class we were determining the MI score for. We would have a score of 0 if, however, the distribution of the term in the class is the same as the distribution of the term in the collection of classes as a whole (if the term 'orange' showed up four times in every category).

Consider the following matrix:

In class Obituaries Outside of class Obituaries
Term 'orange' seen 49 27,652
Term 'orange' not seen 141 774,106

Using Mutual Information algorithm we can determine that the MI score for 'orange' in Obituaries to be:

MI Score: 0.0001105355861

What this tells us is that there isn't a strong relation at all between the term orange and the data within the class of Obituaries.

Now consider the term heart in the class Obituaries

In class Obituaries Outside of class Obituaries
Term 'heart' seen 180 12,578
Term 'heart' not seen 10 789,180

MI Score: 0.0012729579439

That is quite an increase and helps us determine that heart is more dependent on the Obituary class than the term orange.

What you may immediately notice from our data table, is that we don't have a large amount of obituary articles to begin with. One can see that we have 190 obituaries versus 801,758 articles concerning classifieds and events. Perhaps people rarely die where this newspaper is circulated and the people there use their long life span to constantly put things up for sale in the classifieds and promote events. Whatever the case for our example, lets adjust these counts to see what happens to our MI Score.

In class Obituaries Outside of class Obituaries
Term 'heart' seen 1800 1,258
Term 'heart' not seen 100 78,918

MI Score: 0.1088809859617

Now we can start to realize the impact the distribution of our data can have when determining the dependence of our term to the class. In this example, the term heart does have a stronger dependence to Obituaries than before because it is more likely that we see this term in Obituaries versus the other classes and, also of importance, the times when we don't see it the classes.

Now lets look at an example implementation using Python.

Implementation

#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from math import log

# Calculate MI for 2x2 matrix
def calc_mi(matrix):
    # Lets turn our matrix cells into
    # floats to help with our further calculations
    n11 = float(matrix[0])
    n10 = float(matrix[1])
    n01 = float(matrix[2])
    n00 = float(matrix[3])

    # Lets store the total count
    n = n11 + n10 + n01 + n00

    # Calculate the information of the term given the class by summing
    # up the calculations for each matrix cell.
    line1 = (n11 / n) * log((n * n11) / ((n11 + n10) * (n11 + n01)), 2)
    line2 = (n01 / n) * log((n * n01) / ((n01 + n00) * (n11 + n01)), 2)
    line3 = (n10 / n) * log((n * n10) / ((n11 + n10) * (n10 + n00)), 2)
    line4 = (n00 / n) * log((n * n00) / ((n01 + n00) * (n10 + n00)), 2)

    return line1 + line2 + line3 + line4

if __name__ == '__main__':
    # For term "export" in class "poultry", we have the following 2x2 matrix

    # The term export appearing in the class poultry
    # export = 1 and poultry = 1 (x_x) = 49
    # the term export appearing in classes other than poultry
    # export = 1 and poultry = 0 (x_y) = 27652
    # the term export not appearing in the class poultry
    # export = 0 and poultry = 1 (y_x) = 141
    # the term export not appearing in classes other than poultry
    # export = 0 and poultry = 0 (y_y) = 774106
    matrix = [49, 27652, 141, 774106]

    # Calculate MI given our matrix
    print "%1.13f" % calc_mi(matrix)
    #0.0001105355861

Use Cases

MI is important as it strengthens your findings by statistical measurement. One example would be to determine how particular features are affecting a classifier. One could measure the individual MI scores for each feature to better determine the impact each feature is having given a particular data set.