Lecture 12: Data Mining and Machine Learning

The Library of Dresan: Dr. Anthony G. Francis, Jr.'s Weblog

Introduction to Artificial Intelligence for Public Health

Rollins School of Public Health at Emory University Instructor: Dr. Anthony G. Francis, Jr.

Lecture 12: Data Mining and Machine Learning

Modern computer systems enable the collection of vast amounts of data, far more than there are human experts to analyze and interpret. Data mining is the semiautomated extraction of patterns of knowledge from large amounts of raw data using techniques from machine learning, pattern recognition, statistics, linguistics, databases and scientific modeling. Data mining is used in science, business and computing to describe, classify, cluster, estimatate, predict and associate data, but it is not a cure-all: there are no purely automated tools which can just "mine" messy data sets to intuitively produce answers to our deepest questions. Effective data mining is often performed as an intensive, iterative process of understanding, preparing, analyzing and modeling data. Artificial intelligence techniques used in the modeling phase of data mining include neural networks, decision trees, self-organizing maps, and nearest neighbor algorithms.

Outline

What is Data Mining?
What Data Mining Can Do
What Data Mining Can't Do
Steps in Successful Data Mining
Machine Learning
Case Study: Learning Decision Trees
Other Machine Learning Algorithms
Success Stories in Data Mining

Readings:

Artificial Intelligence: Chapters 19 and 20
Machines Who Think: Afterword and Timeline

What is Data Mining?

What is the problem?

Computers enable the collection of massive amounts of data
- Computer Security - Thousands of records per machine per day
- Medical Records - Hundreds of thousands of cases of even rare diseases
- Law Enforcement - Hundreds of thousands of crimes per jurisdiction per year
- Bioinformatics - Gigabytes of gene sequence data
- Space Research - Terabytes of satellite imagery
- Business Data - Terabyte-sized data warehouses
Not enough trained human experts exist to analyze or interpret the data
Some problems are so large that human expertise can't be applied
- Hundreds of columns (features)
- Hundreds of thousands of rows (records)
- Gigabytes of data (possibly distributed)
One potential solution is data mining

Data Mining: Extracting Knowledge from Data

Data mining is the process of discovering meaningful correlations, patterns and trends from large repositories of raw data. Data mining exploits domain knowledge, databases, artificial intelligence, and the scientific method.

Domain Knowledge: Define questions (business, science, medical)
Databases: Collect, maintain and prepare vast amounts of data
Statistics: Analyze data to find candidate subsets/techniques
Artificial Intelligence: Extract knowledge from the data
Scientific Method: Analysis of results, feedback to earlier stages
Showmanship: Publish, share or exploit learned knowledge

Real-World Examples of Data Mining

Each of these examples is from a deployed system.

Learning that electrolyte content in sweat may predict cystic fibrosis prognosis
Identifying a series of crimes as being related to a known set of offenders
Determining that certain auto design features leads to more electrical problems

Strengths and Weaknesses of Data Mining

Capabilities of Data Mining

Describe: Illustrate patterns and trends within data
Cluster: Identify groups of similar data
Classify: Group data into predefined classes
Estimate: Label data with numerical attributes
Predict: Classify/estimate for data points in the future
Associate: Extract rules from the data set

Limitations of Data Mining

Data mining tools:

Cannot automatically process data repositories to answer questions
Cannot operate without human oversight
Do not pay for themselves overnight
Are not easy to use
Will not find the causes behind problems
Do not automatically clean up messy data

Data Mining is easy to do badly!

Not a silver bullet
Not completely automated
Easy to get wrong
No guarantee that answers exist for you to mine!
But can occasionally provide insights

Steps in Successful Data Mining

Successful data mining usually involves a process model that applies structure to the exploration of the data and the extraction of knowledge from it.

Data Mining and Knowledge Discovery Process Models

CRISP-DM: Cross Industry Standard Process for Data Mining
- Developed by DaimlerChrysler, SPSS, NCR
- Successfully adapted to the medical domain
- Website: http://www.crisp-dm.org/
Fayyad 9-stage model: more detail, same basic outline

The Six-Stage CRISP Model

Understanding the Problem Domain
- List objectives
- Define problem
- Outline strategy
Understanding the Data
- Exploratory data analysis
- Evaluate data quality
- Identify relevant subsets
Preparing the Data
- Extract relevant subset from raw data
- Select cases and variables for analysis
- Clean and transform attributes
Modeling the Data
- Select and apply models
- Calibrate and optimize models
- Return to data preparation if needed
Evaluating the Models
- Evaluate models with respect to objectives
- Determine whether additional objectives need to be met
- Decide whether to continue modeling or to deploy results
Deploying the Knowledge
- Generate reports on knowledge collected
- Apply knowledge to affect outcomes
- Continue or terminate data mining

Data Modeling with Artificial Intelligence

Typically, data modeling techniques are drawn from artificial intelligence, though data mining draws liberally upon statistics, pattern recognition, and information visualization. Typical AI techniques for data mining include:

Machine Learning: Extract knowledge from the features
Pattern Recognition: Extract knowledge from the features
Text Summarization: Extract additional features from text
Language Understanding: Extract knowledge directly from text
Vision: Extract additional features from images

While text summarization, language understanding and vision are now starting to be used, the primary techniques used in data mining are machine learning.

Machine Learning

There are many different kinds of machine learning, from rote memorization to scientific discovery.

Memorization: Learning by rote
Induction: Generalizing over examples
Deduction: Extracting knowledge from knowledge
Discovery: Self-guided exploration

The primary kind used in data mining is inductive learning, or learning from examples.

Kinds of Inductive Learning

Inductive learning problems can be categorized by the feedback available to the learning mechanism.

Supervised Learning: Learning a function from input to output
Unsupervised Learning: Learning unspecified patterns in data
Reinforcement Learning: Learning from rewards (or lack thereof)

A Model of Learning from Examples

Supervised learning can be viewed as the process of learning a function f from a set of inputs to a set of outputs given a set of examples which have the output provided:

E: the set of all possible examples < X, Y >
X: representation of a given example
- Input attributes: X = {x1, x2 ... xn}
  - Boolean Attributes: xi in True,False
  - Categorical Attributes: xi in Category
  - Continuous Attributes: xi in Real
Y: representation of the desired output (if available)
- Output attributes: Y = {y1, y2 ... yn}
  - Boolean Classification: Y = {y1 in True,False}
  - Multiattribute Classification: Y = {y1 in Category}
  - Unsupervised Learning: Y = {} = empty set
H: the space of possible hypotheses for the function f
- Decision Trees: tree of if-then tests on inputs with outputs at leaves
- Neural Networks: networks that map features to output
- Nearest Neighbor: store previously seen examples and interpolate
- Mathematical Expressions: compute best formula using regression
T: training set: the subset of examples used to train the algorithm
D: the distribution from which the training set is drawn

Goal: find a hypothesis that generalizes - predicts correctly for examples not in the training set.

Why we need to prepare the data

Missing attribute values - e.g., missing name or age field.
Difficult to handle correctly:
- Throw out example?
- Use a default value?
- Use the mean of the training set?
- Draw randomly from the training set?
- Find the most likely value for the other attributes?
Bad attribute values - e.g., incorrect name or age field
Even more difficult to handle correctly!
- Example: Zip codes like 30318, 90210, J2S7K7, 6269, 99999
- Can't just throw out everything that's not a 5-digit number:
  - J2S7K7: zip code for St. Hyancinthe, Quebec, Canada
  - 6269: zip code for Storrs, Connecticut (06269)
  - 99999: probably an end-of-field marker, not a zip code
Redundant attributes - e.g., cell phone usage and cell phone charges.
Can skew learning algorithms:
- Two attributes measuring the same feature of reality
- Correlated attributes can skew the importance of an association
Hidden attributes - e.g., stress level w.r.t. heart disease
Examples may not have the right data to identify the pattern
- Hidden attributes are also called hidden or lurking variables
- Variables might not have a high enough resolution
Bad Attributes - invalid, spurious, or simply multivalued
Can enable learning algorithms to find spurious correlations
- Invalid attributes - e.g., an test field never filled in
- Spurious attributes - e.g., a marker not correlated with the disease
- Multivalued attributes - e.g., a name field unique to each example
Outliers - e.g., Bill Gates's income, Yao Ming's height
- Individual "bad" examples can skew statistical functions
- Various methods exist to deal with outliers
  - Min-Max Normalization: scale by the overall range
  - Z-Score Standardization: scale by the standard deviation
  - Interquartile Range: scale by range between 25% and 75% of data
Bad training data - distribution of examples can be skewed
- A skewed series of examples will lead to a skewed hypothesis
- Focus on wrong or irrelevant data

Why we need data analysis

We want to avoid overfitting --- providing learning hypotheses so powerful that they effectively recreate the input data.

Error: probability that the hypothesis will misclassify the input
Underfitting: having too weak a hypothesis space to account for the data
E.g., a linear function can't accurately represent a polynomial
E.g., a polynomial can't accurately represent a sine wave

Overfitting: learning spurious relationships in data

E.g., a nth-order polynomial may wiggle to fit the noise in your data

Particularly dangerous for lurking variables and multivalued attributes

Ockham's Razor: try to find the simplest possible hypothesis

PAC learning: find probably approximately correct hypotheses faster by focusing on smaller spaces of hypotheses

Learning Bias: knowledge used to restrict the set of hypotheses For example, for examples with n attributes, there are 2^2^n possible boolean decision trees, but only 2^n possible examples. So you can't do any better than a lookup table unless you have some knowledge that enables you to restrict the set of trees you build. However, restricting the set of hypotheses can exclude what you want to learn!

Confirmation Bias: human tendency to seek evidence that confirms the hypothesis we have, rather than evidence that contradicts it
Peeking: using knowledge of test data to improve the algorithm

This combination of factors makes problem and data understanding both crucial and hard: we have to restrict our search for hypotheses in order to find anything, but we can just as easily preclude ourselves from finding out anything new by doing so!

Case Study: Learning Decision Trees

What are Decision Trees?

A decision tree is a hierarchical set of if-then rules that partitions the space of examples based on functions of the attributes.

Decision Trees: partition based on boolean functions of attributes
Decision Lists: use a simple list of tests, not a tree
Continuous Inputs: use split-point functions rather than boolean tests
Continuous Outputs: use regression functions at each leaf of tree

Decision Tree Learning

Learning decision trees involves finding the best set of questions to ask to completely partition the examples. An algorithm for learning decision trees:

Given a set of examples, a set of attributes, and a default classification:
- If there are no examples, return the default classification
- If all the examples are in the same classification, return that
- Select the category that best fits all the examples
- Select the attribute that best splits the examples into categories
- Create a new tree rooted on the splitting attribute
- For each possible value of the splitting attribute, create a leaf of the tree by call this algorithm recursively with:
  - Examples: all examples with that value of the attribute
  - Attributes: all attributes except the splitting attribute
  - Default: current best fitting attribute

How to select the best attribute

Choose the attribute that provides the most information gain - intuitively, the most bang for the buck in splitting the remaining attributes.

Information: how much a piece of data helps us answer questions
Bit: the smallest amount of information - answer a yes/no question
- Bits needed to represent n choices: log2(n)
Information provided by finding an answer that occurs m chances out of n:
- Bits needed to represent we got - bits in all possible choices
  = log2(m) - log2(n)
  = log2(m/n)
- But m/n is just the probability p of the choice!
  p = m/n
- Furthermore, finding out one answer means we didn't get all other answers!
SO, the information provided by an answer given a set of choices is:
the sum of the information content of each answer weighted by its probability
- I( A | C1 ... CN )
  = I( P(C1) ... P(CN) )
  = ∑ -P(Ci) log2(P(Ci))
So the information provided by finding out that a coin is Heads is:
- I( H | {H,T} )
  = I( P(H), P(T) )
  = -P(H) log2 P(H) - P(T) log2(P(T))
  = -½ log(½) - ½ log2(½)
  = ½ + ½
  = 1 bit
And the information on the roll of a four-sided die:
- I( 1 | {1 ... 4} )
  = I( P(1) ... P(4) )
  = ∑ -P(i) log2(P(i))
  = ∑ -¼ log2(¼)
  = 4(-¼(-2))
  = 2 bits

So to apply this to decision tree learning, we need to find the attribute that provides us the most information. For boolean classification, that's the attribute which gives us the best answer to the question of "is this an example or not".

Total positive examples: p
Total negative examples: n
Total examples: p+n
Total information in training set:
- I( classification | examples )
  = I( P(positive), P(negative) )
  = I( P(positive), P(negative) )
  = -P(positive) log2 P(positive) - P(negative) log2 P(negative)
  = - p/(p+n) log(p/(p+n)) - n/(p+n) log2(n/(p+n)))
Remainder Information still needed to classify examples after we split on an attribute
- Sum up the information in each subset of examples with a given value of the attribute:
- ∑ P(example has attribute i) I( P(positive if attribute=i), P(negative if attribute=i) )
  = ∑ (pi+ni)/(p+n) I( pi/(pi+ni), ni/(pi+ni) )
  (pi+ni)/(p+n) [ - pi/(pi+ni) log(pi/(pi+ni)) - ni/(pi+ni)log(ni/(pi+ni)) ]
Information Gain Information we get from splitting on the attribute
- Information Gain = Total Information - Remainder
For example, an attribute that completely splits the examples into positive and negative sets requires no additional information:

One way to test the algorithm is to divide the examples into a test set and a training set to see whether or not the tree learned on the training set can accurately categorize the test set.

Other Machine Learning Algorithms

Instance-Based Learning

Instance-based learning stores a population of examples and estimate values by returning results from one or more stored examples "nearest" to the input.

Memory-based reasoning: Store large numbers of examples, use "nearest" answer
k-Nearest-Neighbor: Find k nearest answers and interpolate (or use majority)
Case-based reasoning: Find most similar answer and "adapt" it

Instance based learning mechanisms need a similarity function or distance metric to compute which stored examples are closest to the new test input. If the number of stored examples is very large, a memory retrieval system is required to make access to the nearest examples efficient.

Clustering Algorithms

Clustering algorithms take a population of examples and divide them into groups.

Hierarchical clustering: Group examples together, then group clusters recursively
k-Means Clustering: Select k examples as start clusters, then add nearest

Neural Networks

Neural networks act as function approximators.

Graphs of nodes connected by links with varying weights
- Feedforward networks are directed acyclic graphs (no reverse links)
- Recurrent networks allow back links
Feedforward neural networks can be broken down into two classes
- Perceptrons have no hidden layers and compute linearly separable functions
- Hidden layer networks have an intervening layer between input and output and can compute more complex functions
Backpropagation enables hidden layer feedforward networks to learn
- Use the weights of the network to combine the inputs into the outputs
- Compute the error (difference) between the expected and actual output
- Use error to adjust weights recursively closer to producing the right answer

Self-Organizing Maps

Self-organizing maps use competitive learning.

Competition
- Each attribute of the input vectors are fed to every node in the network
- Each node computes a response to the input based on its weights
- A scoring function evaluates the responses
- The node with the best score is the "winning" node
Cooperation
- Nodes which are close to the winning node receive partial activation
- Winning nodes are thus in a neighborhood of active neurons
Adaptation
- Active nodes are adjusted to reduce their error
- The more active a node is, the more can adjust its weights

The result is that a self-organizing map divides into a set of regions, each of which map a particular sub-population of examples. This is useful for categorization, data visualization, and dimensionality reduction.

Bayesian Network Learning

Bayesian networks model probabilistic relationships between variables. The greedy bayesian network learning algorithm:

Start with an empty network with no associations
Find the pair of varibles with the most correlation
Add a link to the network
Adjust the weights of the network
Repeat until all variance accounted for

Other Learning Algorithms

A Priori Association Rule Learning
- Find all "frequent itemsets" (e.g., smoking, high-fat diet, no exercise)
  - Use a threshold of minimum frequency
  - a priori property:
    if one set of items is not frequent, supersets containing it won't be either
- For each frequent itemset, generate the set of statistically probable rules
  - Find all subsets of the itemset
  - Split the subsets rules by breaking them into IF and THEN partitions
  - Keep all the rules that are statistically probable
Version Space Learning
- Graph search through the space of possible hypotheses
- Works over conjunctions of features
  - most specific hypothesis: all examples output false
  - most general hypothesis: all examples output true
- Keep the set of most general hypotheses and the single most specific hypothesis
- For each example:
  - If it is a negative example, try to make the most specific hypothesis cover it
  - If it is a positive example, update the set of most general hypotheses
Support Vector Machines
- Problem: Difficult to learn complicated nonlinear functions
- Solution: Re-represent the problem in higher dimensional space
- Result: Simple linearly separable learning problem (most of the time)
- "Support vectors" are examples on the boundaries that define separation
Genetic Algorithms
- Problem: Don't have a good handle on the right hypothesis space
- Solution: Allow the system to evolve the representation
- Requires: A fitness function that determines goodness of a result

"Success" Stories in Data Mining

Businesses do not in general publicize their data mining results --- they use them for competitive advantage. The following examples from healthcare and law enforcment are systems that were tested with real data in the field, but the learned results do not appear to have been applied to practice yet.

Analyzing Cystic Fibrosis

Goal: Predict what factors indicate prognosis in cystic fibrosis patients
Data Source: 800+ Cystic Fibrosis Patients at U of Colorado
- General patient information
- FEV1% - forced expiratory volume in one second % predicted
- Genotypes
- Prognosis information
Data Preparation:
- Correcting missing and nonsensical values
- Removing bad attributes
- Discretizing continuous variables
Learning Mechanism: "Data Squeezer" production rule learner
Learned Results:
- Evaluated by experts from 1 (trivial) to 4 (interesting and novel)
- Found 9 previously known useful results
- Only 1 unknown result - sweat electrolytes indicate prognosis

Tracking Offender Networks

Goal: Identify crimes that may have been committed by a group of perpetrators
Data Source: 48,000+ burglaries in the West Midlands Police Department
- Crime date and location
- Modus operandi checklist
- Free text case narratives
Data Preparation:
- Extracting modus operandi information from free text
- Omitting records with key missing data
- Temporal and spatial breakdowns
Learning Mechanism: Self-Organizing Map
Learned Results:
- Could prepare lists for interview in 5 minutes (as opposed to 1-2 hours)
- Thresholded lists only contained relevant crimes (as opposed to 5-10% accurate)

Resources

Data Mining: ????
Machine Learning: ????
Neural Networks: ????
Decision Trees: ????
Clustering: ????
Self-Organizing Maps: ????
Nearest Neighbor: ????
Medical Data Mining: ????
Law Enforcement Data Mining: ????

Research
Articles
Classes
Software

Classic
Weblog
Wiki
Store

f@nu fiku
Fiction
Personal
About

Contact