Rollins School of Public Health at Emory University
Instructor: Dr. Anthony G. Francis, Jr.
Lecture 12: Data Mining and Machine Learning
Modern computer systems enable the collection of vast amounts of data,
far more than there are human experts to analyze and interpret.
Data mining is the semiautomated extraction of patterns of knowledge
from large amounts of raw data using techniques from machine learning,
pattern recognition, statistics, linguistics, databases and scientific
modeling. Data mining is used in science, business and computing to
describe, classify, cluster, estimatate, predict and associate data,
but it is not a cure-all: there are no purely automated tools which
can just "mine" messy data sets to intuitively produce answers to
our deepest questions. Effective data mining is often performed as
an intensive, iterative process of understanding, preparing, analyzing
and modeling data. Artificial intelligence techniques used in the
modeling phase of data mining include neural networks, decision trees,
self-organizing maps, and nearest neighbor algorithms.
Outline
- What is Data Mining?
- What Data Mining Can Do
- What Data Mining Can't Do
- Steps in Successful Data Mining
- Machine Learning
- Case Study: Learning Decision Trees
- Other Machine Learning Algorithms
- Success Stories in Data Mining
Readings:
- Artificial Intelligence: Chapters 19 and 20
- Machines Who Think: Afterword and Timeline
What is Data Mining?
What is the problem?
- Computers enable the collection of massive amounts of data
- Computer Security - Thousands of records per machine per day
- Medical Records - Hundreds of thousands of cases of even rare diseases
- Law Enforcement - Hundreds of thousands of crimes per jurisdiction per year
- Bioinformatics - Gigabytes of gene sequence data
- Space Research - Terabytes of satellite imagery
- Business Data - Terabyte-sized data warehouses
- Not enough trained human experts exist to analyze or interpret the data
- Some problems are so large that human expertise can't be applied
- Hundreds of columns (features)
- Hundreds of thousands of rows (records)
- Gigabytes of data (possibly distributed)
- One potential solution is data mining
Data Mining: Extracting Knowledge from Data
Data mining is the process of discovering meaningful correlations,
patterns and trends from large repositories of raw data. Data
mining exploits domain knowledge, databases, artificial
intelligence, and the scientific method.
- Domain Knowledge: Define questions (business, science, medical)
- Databases: Collect, maintain and prepare vast amounts of data
- Statistics: Analyze data to find candidate subsets/techniques
- Artificial Intelligence: Extract knowledge from the data
- Scientific Method: Analysis of results, feedback to earlier stages
- Showmanship: Publish, share or exploit learned knowledge
Real-World Examples of Data Mining
Each of these examples is from a deployed system.
- Learning that electrolyte content in sweat may predict cystic fibrosis prognosis
- Identifying a series of crimes as being related to a known set of offenders
- Determining that certain auto design features leads to more electrical problems
Strengths and Weaknesses of Data Mining
Capabilities of Data Mining
- Describe: Illustrate patterns and trends within data
- Cluster: Identify groups of similar data
- Classify: Group data into predefined classes
- Estimate: Label data with numerical attributes
- Predict: Classify/estimate for data points in the future
- Associate: Extract rules from the data set
Limitations of Data Mining
Data mining tools:
- Cannot automatically process data repositories to answer questions
- Cannot operate without human oversight
- Do not pay for themselves overnight
- Are not easy to use
- Will not find the causes behind problems
- Do not automatically clean up messy data
Data Mining is easy to do badly!
- Not a silver bullet
- Not completely automated
- Easy to get wrong
- No guarantee that answers exist for you to mine!
- But can occasionally provide insights
Steps in Successful Data Mining
Successful data mining usually involves a process model that
applies structure to the exploration of the data and the
extraction of knowledge from it.
Data Mining and Knowledge Discovery Process Models
- CRISP-DM: Cross Industry Standard Process for Data Mining
- Fayyad 9-stage model: more detail, same basic outline
The Six-Stage CRISP Model
- Understanding the Problem Domain
- List objectives
- Define problem
- Outline strategy
- Understanding the Data
- Exploratory data analysis
- Evaluate data quality
- Identify relevant subsets
- Preparing the Data
- Extract relevant subset from raw data
- Select cases and variables for analysis
- Clean and transform attributes
- Modeling the Data
- Select and apply models
- Calibrate and optimize models
- Return to data preparation if needed
- Evaluating the Models
- Evaluate models with respect to objectives
- Determine whether additional objectives need to be met
- Decide whether to continue modeling or to deploy results
- Deploying the Knowledge
- Generate reports on knowledge collected
- Apply knowledge to affect outcomes
- Continue or terminate data mining
Data Modeling with Artificial Intelligence
Typically, data modeling techniques are drawn from artificial intelligence,
though data mining draws liberally upon statistics, pattern recognition,
and information visualization. Typical AI techniques for data mining
include:
- Machine Learning: Extract knowledge from the features
- Pattern Recognition: Extract knowledge from the features
- Text Summarization: Extract additional features from text
- Language Understanding: Extract knowledge directly from text
- Vision: Extract additional features from images
While text summarization, language understanding and vision are now
starting to be used, the primary techniques used in data mining are
machine learning.
Machine Learning
There are many different kinds of machine learning, from rote
memorization to scientific discovery.
- Memorization: Learning by rote
- Induction: Generalizing over examples
- Deduction: Extracting knowledge from knowledge
- Discovery: Self-guided exploration
The primary kind used in data mining is
inductive learning, or learning from examples.
Kinds of Inductive Learning
Inductive learning problems can be categorized by the feedback available
to the learning mechanism.
- Supervised Learning: Learning a function from input to output
- Unsupervised Learning: Learning unspecified patterns in data
- Reinforcement Learning: Learning from rewards (or lack thereof)
A Model of Learning from Examples
Supervised learning can be viewed as the process of learning a
function f from a set of inputs to a set of outputs
given a set of examples which have the output provided:
- E: the set of all possible examples < X, Y >
- X: representation of a given example
- Input attributes: X = {x1, x2 ... xn}
- Boolean Attributes: xi in True,False
- Categorical Attributes: xi in Category
- Continuous Attributes: xi in Real
- Y: representation of the desired output (if available)
- Output attributes: Y = {y1, y2 ... yn}
- Boolean Classification: Y = {y1 in True,False}
- Multiattribute Classification: Y = {y1 in Category}
- Unsupervised Learning: Y = {} = empty set
- H: the space of possible hypotheses for the function f
- Decision Trees: tree of if-then tests on inputs with outputs at leaves
- Neural Networks: networks that map features to output
- Nearest Neighbor: store previously seen examples and interpolate
- Mathematical Expressions: compute best formula using regression
- T: training set: the subset of examples used to train the algorithm
- D: the distribution from which the training set is drawn
Goal: find a hypothesis that generalizes
- predicts correctly for examples not in the training set.
Why we need to prepare the data
- Missing attribute values - e.g., missing name or age field.
Difficult to handle correctly:
- Throw out example?
- Use a default value?
- Use the mean of the training set?
- Draw randomly from the training set?
- Find the most likely value for the other attributes?
- Bad attribute values - e.g., incorrect name or age field
Even more difficult to handle correctly!
- Example: Zip codes like 30318, 90210, J2S7K7, 6269, 99999
- Can't just throw out everything that's not a 5-digit number:
- J2S7K7: zip code for St. Hyancinthe, Quebec, Canada
- 6269: zip code for Storrs, Connecticut (06269)
- 99999: probably an end-of-field marker, not a zip code
- Redundant attributes - e.g., cell phone usage and cell phone charges.
Can skew learning algorithms:
- Two attributes measuring the same feature of reality
- Correlated attributes can skew the importance of an association
- Hidden attributes - e.g., stress level w.r.t. heart disease
Examples may not have the right data to identify the pattern
- Hidden attributes are also called hidden or lurking variables
- Variables might not have a high enough resolution
- Bad Attributes - invalid, spurious, or simply multivalued
Can enable learning algorithms to find spurious correlations
- Invalid attributes - e.g., an test field never filled in
- Spurious attributes - e.g., a marker not correlated with the disease
- Multivalued attributes - e.g., a name field unique to each example
- Outliers - e.g., Bill Gates's income, Yao Ming's height
- Individual "bad" examples can skew statistical functions
- Various methods exist to deal with outliers
- Min-Max Normalization: scale by the overall range
- Z-Score Standardization: scale by the standard deviation
- Interquartile Range: scale by range between 25% and 75% of data
- Bad training data - distribution of examples can be skewed
- A skewed series of examples will lead to a skewed hypothesis
- Focus on wrong or irrelevant data
Why we need data analysis
We want to avoid overfitting --- providing learning hypotheses so
powerful that they effectively recreate the input data.
- Error: probability that the hypothesis will misclassify the input
- Underfitting: having too weak a hypothesis space to account for the data
- E.g., a linear function can't accurately represent a polynomial
- E.g., a polynomial can't accurately represent a sine wave
Overfitting: learning spurious relationships in data
E.g., a nth-order polynomial may wiggle to fit the noise in your data
Particularly dangerous for lurking variables and multivalued attributes
Ockham's Razor: try to find the simplest possible hypothesis
PAC learning: find probably approximately correct hypotheses
faster by focusing on smaller spaces of hypotheses
Learning Bias: knowledge used to restrict the set of hypotheses
For example, for examples with n attributes, there are 2^2^n
possible boolean decision trees, but only 2^n possible examples.
So you can't do any better than a lookup table unless you have some
knowledge that enables you to restrict the set of trees you build.
However, restricting the set of hypotheses can exclude what you want to learn!
- Confirmation Bias: human tendency to seek evidence that
confirms the hypothesis we have, rather than evidence that contradicts it
- Peeking: using knowledge of test data to improve the algorithm
This combination of factors makes problem and data understanding both crucial and hard:
we have to restrict our search for hypotheses in order to find anything, but
we can just as easily preclude ourselves from finding out anything new by doing so!
Case Study: Learning Decision Trees
What are Decision Trees?
A decision tree is a hierarchical set of if-then rules that partitions
the space of examples based on functions of the attributes.
- Decision Trees: partition based on boolean functions of attributes
- Decision Lists: use a simple list of tests, not a tree
- Continuous Inputs: use split-point functions rather than boolean tests
- Continuous Outputs: use regression functions at each leaf of tree
Decision Tree Learning
Learning decision trees involves finding the best set of questions to ask
to completely partition the examples. An algorithm for learning decision trees:
- Given a set of examples, a set of attributes, and a default classification:
- If there are no examples, return the default classification
- If all the examples are in the same classification, return that
- Select the category that best fits all the examples
- Select the attribute that best splits the examples into categories
- Create a new tree rooted on the splitting attribute
- For each possible value of the splitting attribute,
create a leaf of the tree by call this algorithm recursively with:
- Examples: all examples with that value of the attribute
- Attributes: all attributes except the splitting attribute
- Default: current best fitting attribute
How to select the best attribute
Choose the attribute that provides the most information gain
- intuitively, the most bang for the buck in splitting the remaining
attributes.
- Information: how much a piece of data helps us answer questions
- Bit: the smallest amount of information - answer a yes/no question
- Bits needed to represent n choices: log2(n)
- Information provided by finding an answer that occurs m chances out of n:
- Bits needed to represent we got - bits in all possible choices
= log2(m) - log2(n)
= log2(m/n)
- But m/n is just the probability p of the choice!
p = m/n
- Furthermore, finding out one answer means we didn't get all other answers!
- SO, the information provided by an answer given a set of choices is:
the sum of the information content of each answer weighted by its probability
- I( A | C1 ... CN )
= I( P(C1) ... P(CN) )
= ∑ -P(Ci) log2(P(Ci))
- So the information provided by finding out that a coin is Heads is:
- I( H | {H,T} )
= I( P(H), P(T) )
= -P(H) log2 P(H) - P(T) log2(P(T))
= -½ log(½) - ½ log2(½)
= ½ + ½
= 1 bit
- And the information on the roll of a four-sided die:
- I( 1 | {1 ... 4} )
= I( P(1) ... P(4) )
= ∑ -P(i) log2(P(i))
= ∑ -¼ log2(¼)
= 4(-¼(-2))
= 2 bits
So to apply this to decision tree learning, we need to find the
attribute that provides us the most information. For boolean
classification, that's the attribute which gives us the best
answer to the question of "is this an example or not".
- Total positive examples: p
- Total negative examples: n
- Total examples: p+n
- Total information in training set:
- I( classification | examples )
= I( P(positive), P(negative) )
= I( P(positive), P(negative) )
= -P(positive) log2 P(positive) - P(negative) log2 P(negative)
= - p/(p+n) log(p/(p+n)) - n/(p+n) log2(n/(p+n)))
- Remainder Information still needed to classify examples after we split on an attribute
- Sum up the information in each subset of examples with a given value of the attribute:
- ∑ P(example has attribute i) I( P(positive if attribute=i), P(negative if attribute=i) )
= ∑ (pi+ni)/(p+n) I( pi/(pi+ni), ni/(pi+ni) )
(pi+ni)/(p+n) [ - pi/(pi+ni) log(pi/(pi+ni)) - ni/(pi+ni)log(ni/(pi+ni)) ]
- Information Gain Information we get from splitting on the attribute
- Information Gain = Total Information - Remainder
- For example, an attribute that completely splits the examples
into positive and negative sets requires no additional information:
One way to test the algorithm is to divide the examples into a test
set and a training set to see whether or not the tree learned on the
training set can accurately categorize the test set.
Other Machine Learning Algorithms
Instance-Based Learning
Instance-based learning stores a population of examples and
estimate values by returning results from one or more stored
examples "nearest" to the input.
- Memory-based reasoning: Store large numbers of examples, use "nearest" answer
- k-Nearest-Neighbor: Find k nearest answers and interpolate (or use majority)
- Case-based reasoning: Find most similar answer and "adapt" it
Instance based learning mechanisms need a similarity function or distance metric
to compute which stored examples are closest to the new test input. If the
number of stored examples is very large, a memory retrieval system is required
to make access to the nearest examples efficient.
Clustering Algorithms
Clustering algorithms take a population of examples and
divide them into groups.
- Hierarchical clustering: Group examples together, then group clusters recursively
- k-Means Clustering: Select k examples as start clusters, then add nearest
Neural Networks
Neural networks act as function approximators.
- Graphs of nodes connected by links with varying weights
- Feedforward networks are directed acyclic graphs (no reverse links)
- Recurrent networks allow back links
- Feedforward neural networks can be broken down into two classes
- Perceptrons have no hidden layers and compute linearly separable functions
- Hidden layer networks have an intervening layer between input and output
and can compute more complex functions
- Backpropagation enables hidden layer feedforward networks to learn
- Use the weights of the network to combine the inputs into the outputs
- Compute the error (difference) between the expected and actual output
- Use error to adjust weights recursively closer to producing the right answer
Self-Organizing Maps
Self-organizing maps use competitive learning.
- Competition
- Each attribute of the input vectors are fed to every node in the network
- Each node computes a response to the input based on its weights
- A scoring function evaluates the responses
- The node with the best score is the "winning" node
- Cooperation
- Nodes which are close to the winning node receive partial activation
- Winning nodes are thus in a neighborhood of active neurons
- Adaptation
- Active nodes are adjusted to reduce their error
- The more active a node is, the more can adjust its weights
The result is that a self-organizing map divides into a set of regions, each
of which map a particular sub-population of examples. This is useful for
categorization, data visualization, and dimensionality reduction.
Bayesian Network Learning
Bayesian networks model probabilistic relationships between variables.
The greedy bayesian network learning algorithm:
- Start with an empty network with no associations
- Find the pair of varibles with the most correlation
- Add a link to the network
- Adjust the weights of the network
- Repeat until all variance accounted for
Other Learning Algorithms
- A Priori Association Rule Learning
- Find all "frequent itemsets" (e.g., smoking, high-fat diet, no exercise)
- Use a threshold of minimum frequency
- a priori property:
if one set of items is not frequent, supersets containing it won't be either
- For each frequent itemset, generate the set of statistically probable rules
- Find all subsets of the itemset
- Split the subsets rules by breaking them into IF and THEN partitions
- Keep all the rules that are statistically probable
- Version Space Learning
- Graph search through the space of possible hypotheses
- Works over conjunctions of features
- most specific hypothesis: all examples output false
- most general hypothesis: all examples output true
- Keep the set of most general hypotheses and the single most specific hypothesis
- For each example:
- If it is a negative example, try to make the most specific hypothesis cover it
- If it is a positive example, update the set of most general hypotheses
- Support Vector Machines
- Problem: Difficult to learn complicated nonlinear functions
- Solution: Re-represent the problem in higher dimensional space
- Result: Simple linearly separable learning problem (most of the time)
- "Support vectors" are examples on the boundaries that define separation
- Genetic Algorithms
- Problem: Don't have a good handle on the right hypothesis space
- Solution: Allow the system to evolve the representation
- Requires: A fitness function that determines goodness of a result
"Success" Stories in Data Mining
Businesses do not in general publicize their data mining results
--- they use them for competitive advantage. The following examples
from healthcare and law enforcment are systems that were tested with
real data in the field, but the learned results do not appear to
have been applied to practice yet.
Analyzing Cystic Fibrosis
- Goal: Predict what factors indicate prognosis in cystic fibrosis patients
- Data Source: 800+ Cystic Fibrosis Patients at U of Colorado
- General patient information
- FEV1% - forced expiratory volume in one second % predicted
- Genotypes
- Prognosis information
- Data Preparation:
- Correcting missing and nonsensical values
- Removing bad attributes
- Discretizing continuous variables
- Learning Mechanism: "Data Squeezer" production rule learner
- Learned Results:
- Evaluated by experts from 1 (trivial) to 4 (interesting and novel)
- Found 9 previously known useful results
- Only 1 unknown result - sweat electrolytes indicate prognosis
Tracking Offender Networks
- Goal: Identify crimes that may have been committed by a group of perpetrators
- Data Source: 48,000+ burglaries in the West Midlands Police Department
- Crime date and location
- Modus operandi checklist
- Free text case narratives
- Data Preparation:
- Extracting modus operandi information from free text
- Omitting records with key missing data
- Temporal and spatial breakdowns
- Learning Mechanism: Self-Organizing Map
- Learned Results:
- Could prepare lists for interview in 5 minutes (as opposed to 1-2 hours)
- Thresholded lists only contained relevant crimes (as opposed to 5-10% accurate)
Resources
- Data Mining: ????
- Machine Learning: ????
- Neural Networks: ????
- Decision Trees: ????
- Clustering: ????
- Self-Organizing Maps: ????
- Nearest Neighbor: ????
- Medical Data Mining: ????
- Law Enforcement Data Mining: ????
|
Research
Articles
Classes
Software
Classic
Weblog
Wiki
Store
f@nu fiku
Fiction
Personal
About
Contact
|