Foreword
Preface
Chapter Introduction
1.1 What Motivated Data Mining? Why Is It Important?
1.2 So, What Is Data Mining?
1.3 Data Mining-On What Kind of Data?
1.3.1 Relational Databases
1.3.2 Data Warehouses
1.3.3 Transactional Databases
1.3.4 Advanced Data and Information Systems and Advanced Applications
1.4 Data Mining Functionalities-What Kinds of Patterns Can Be Mined?
1.4.1 Concept/Class Description: Characterization and Discrimination
1.4.2 Mining Frequent Patterns, Associations, and Correlations
1.4.3 Classification and Prediction
1.4.4 Cluster Analysis
1.4.5 Outlier Analysis
1.4.6 Evolution Analysis
1.5 Are All of the Patterns Interesting?
1.6 Classification of Data Mining Systems
1.7 Data Mining Task Primitives
1.8 Integration of a Data Mining System with a Database or Data Warehouse System
1.9 Major Issues in Data Mining
1.1O Summary
Exercises
Bibliographic Notes
Chapter Data Preprocessing
2.1 Why Preprocess the Data?
2.2 Descriptive Data Summarization
2.2.1 Measuring the Central Tendency
2.2.2 Measuring the Dispersion of Data
2.2.3 Graphic Displays of Basic Descriptive Data Summaries
2.3 Data Cleaning
2.3.1 Missing Values
2.3.2 Noisy Data
2.3.3 Data Cleaning as a Process
2.4 Data Integration and Transformation
2.4.1 Data Integration
2.4.2 Data Transformation
2.5 Data Reduction
2.5.1 Data Cube Aggregation
2.5.2 Attribute Subset Selection
2.5.3 Dimensionality Reduction
2.5.4 Numerosity Reduction
2.6 Data Oiscretization and Concept Hierarchy Generation
2.6.1 Discretization and Concept Hierarchy Generation for Numerical Data
2.6.2 Concept Hierarchy Generation for Categorical Data
2.7 Summary
Exercises
Bibliographic Notes
Chapter 3 Data Warehouse and OLAP Technology: An Overview
3.1 What Is a Data Warehouse?
3.1.1 Differences between Operational Database System and Data Warehouses
3.1.2 But, Why Have a Separate Data Warehouse?
3.2 A Multidimensional Data Model
3.2.1 From Tables and Spreadsheets to Data Cubes
3.2.2 Stars, Snowflakes, and Fact Constellations:Schemas for Multidimensional Databases
3.2.3 Examples for DefTnzng Star, Snowflake,and Fact Constellation Schemas
3.2.4 Measures: Their Categorization and Computation
3.2.5 Concept Hierarchies
3.2.60LAP Operations in the Multidimensional Data Model
3.2.7 A Stamet Query Model for Querying Multidimensional Databases
3.3 Data Warehouse Architecture
3.3.1 Steps for the Design and Construction of Data Warehouses
3.3.2 A Three-Tier Data Warehouse Architecture
3.3.3 Data Warehouse Back-End Tools and Utilities
3.3.4 Metadata Repository
3.3.5 Types of OLAP Servers: ROLAP versus MOLAP versus HOLAP
3.4 Data Warehouse Implementation
3.4.1 Efficient Computation of Data Cubes
3.4.2 Indexing OLAP Data
3.4.3 Efficient Processing of OLAP Queries
3.5 From Data Warehousing to Data Mining
3.5.1 Data Warehouse Usage
3.5.2 From On-Line Analytical Processing to On-Line Analytical Mining
3.6 Summary
Exercises
Bibliographic Notes
Chapter 4 Data Cube Computation and Data Generalization
4. 1 Efficlent Methods for Data Cube Computation
4.1.1 A Road Map for the Materialization of Different Kinds of Cubes
4.1.2 Multiway Array Aggregation for Full Cube Computation
4.1.3 BUC: Computing Iceberg Cubes from the Apex Cuboid Downward
4.1.4 Star-cubing: Computing Iceberg Cubes Using a Dynamic Star-tree Structure
4.1.5 Precomputing Shell Fragments for Fast High-Dimensional OLAP
4.1.6 Computing Cubes with Complex Iceberg Conditions
4.2 Further Development of Data Cube and OLAP
4.3 Attribute-Oriented Induction-An Alternative Method for Data Generalization and Concept
Description
4.3.1 Attribute-Oriented Induction for Data Characterization
4.3.2 Efficient implementation of Attribute Oriented Induction
4.3.3 Presentation of the Derived Generalization
4.3.4 Mining Class Comparisons: Discriminating between Different Classes
4.3.5 Class Description: Presentation of Both Characterization and Comparison
4.4 Summary
Exercises
Bibliographic Notes
Chapter 5 Mining Frequent Patterns, Associations, and Correlations
5. 1 Basic Concepts and a Road Map
5.1.1 Market Basket Analysis: A Motivating Example
5.1.2 Frequent Itemsets, Closed Itemsets, and Association Rules
5.1.3 Frequent Pattern Mining; A Road Map
5.2 Efficient and Scalable Frequent Itemset Mining Methods
5.2.1 The Apriori Algorithm: Finding Frequent ltemsets Using Candidate Generation
5.2.2 Generating Association Rules from Frequent Itemsets
5.2.3 Improving the Efficiency of Apriori
5.2.4 Mining Frequent Itemsets without Candidate Generation
5.2.5 Mining Frequent Itemsets Using Vertical Data Format
5.2.6 Mining Closed Frequent Itemsets
5.3 Mining Various Kinds of Association Rules
5.3.1 Mining Multilevel Association Rules
5.3.2 Mining Multidimensional Association Rules from Relational Databases and Data
Warehouses
5.4 From Association Mining to Correlation Analysis
5.4.1 Strong Rules Are Not Necessarily Interesting, An Example
5.4.2 From Association Analysis to Correlation Analysis
5.5 Constraint-Based Association Mining
5.5.1 Metarule-Guided Mining of Association Rules
5.5.2 Constraint Pushing: Mining Guided by Rule Constraints
5.6 Summary
Exercises
Bibliographic Notes
Chapter 6 Classification and Prediction
6. 1 What Is Classification? What Is Prediction?
6.2 Issues Regarding Classification and Prediction
6.2.1 Preparing the Data for Classification and Prediction
6.2.2 Comparing Classification and Prediction Methods
6.3 Classification by Decision Tree Induction
6.3.1 Decision Tree Induction
6.3.2 Attribute Selection Measures
6.3.3 Tree Pruning
6.3.4 Scalability and Decision Tree Induction
6.4 Bayesian Classification
6.4.1 Bayes' Theorem
6.4.2 Naive Bayesian Classification
6.4.3 Bayesian Belief Networks
6.4.4 Training Bayesian Belief Networks
6.5 Rule-Based Classification
6.5.1 Using IF-THEN Rules for Classification
6.5.2 Rule Extraction from a Decision Tree
6.5.3 Rule Induction Using a Sequential Covering Algorithm
6.6 Classification by Backpropagation
6.6.1 A Multilayer Feed-Forward Neural Network
6.6.2 Defining a Network Topology
6.6.3 Backpropagation
6.6.4 Inside the Black Box: Backpropagation and Interpretability
6.7 Support Vector Machines
6.7.1 The Case When the Data Are Linearly Separable
6.7.2 The Case When the Data Are Linearly Inseparable
6.8 Associative Classification: Classification by Association Rule Analysis
6.9 Lazy Learners (or Learning from Your Neighbors)
6.9.1 k-Nearest-Neighbor Classifiers
6.9.2 Case-Based Reasoning
6.10 Other Classification Methods
6.10.1 Genetic Algorithms
6.10.2 Rough Set Approach
6.10.3 Fuzz'/Set Approaches
6.11 Prediction
6.11.1 Linear Regression
6.11.2 Nonlinear Regression
6.11.3 Other Regression-Based Methods
7.6.2 OPTICS: Ordering Points to Identify the Clustering Structure
7.6.3 DENCLUE: Clustering Based on Density Distribution Functions
7.7 Grid-Based Methods
7.7.1 STING: STatistical INformation Grid
7.7.2 WaveCluster: Clustering Using Wavelet Transformation
7.8 Model-Based Clustering Methods
7.8.1 Expectation-Maximization
7.82 Conceptual Clustering
7.8.3 Neural Network Approach
7.9 Clustering High-Dimensional Data
7.9.1 CLIQUE: A Dimension-Growth Subspace Clustering Method
7.9.2 PROCLUS: A Dimension-Reduction Subspace Clustering Method
7.9.3 Frequent Pattern-Based Clustering Methods
7.10 Constraint-Based Cluster Analysis
7.10.1 Clustering with Obstacle Objects
7.10.2 User-Constrained Cluster Analysis
7.10.3 Semi-Supervised Cluster Analysis
7.11 Outlier Analysis
7.11.1 Statistical Distribution-Based Outlier Detection
7.11.2 Distance-Based Outlier Detection
7.11.3 Density-Based Local Outlier Detection
7.11.4 Deviation-Based Outlier Detection
7.12 Summary
Exercises
Bibliographic Notes
Chapter 8 Mining Stream, Time-Series, and Sequence Data
8.1 Mining Data Streams
8.1.1 Methodologies for Stream Data Processing and Stream Data Systems
8.1.2 Stream OLAP and Stream Data Cubes
8.1.3 Frequent-Pattern Mining in Data Streams
8.1.4 Classification of Dynamic Data Streams
8.1.5 Clustering Evolving Data Streams
8.2 Mining Time-Series Data
8.2.1 Trend Analysis
8.2.2 Similarity Search in Time-Series Analysis
8.3 Mining Sequence Patterns in Transactional Databases
8.3.1 Sequential Pattern Mining: Concepts and Primitives
8.3.2 Scalable Methods for Mining Sequential Patterns
8.3.3 Constraint-Based Mining of Sequential Patterns
8.3.4 Periodicity Analysis for Time-Related Sequence Data
8.4 Mining Sequence Patterns in Biological Data
8.4.1 Alignment of Biological Sequences
8.4.2 Hidden Markov Model for Biological Sequence Analysis
8.5 Summary
Exercises
Bibliographic Notes
Chapter 9 Graph Mining, Social Network Analysis, and Multirelational Data Mining
9.1 Graph Mining
9.1.1 Methods for Mining Frequent Subgraphs
9.1.2 Mining Variant and Constrained Substructure Patterns
9.1.3 Applications: Graph Indexing, Similarity Search, Classification,and Clustering
9.2 Social Network Analysis
9.2.1 What Is a Social Network?
9.2.2 Characteristics of Social Networks
9.2.3 Link Mining: Tasks and Challenges
9.2.4 Mining on Social Networks
9.3 Multirelational Data Mining
9.3.1 What Is Multirelational Data Mining?
9.3.2 ILP Approach to Multirelational Classification
9.3.3 Tuple ID Propagation
9.3.4 Multirelational Classification Using Tuple ID Propagation
9.3.5 Muitirelational Clustering with User Guidance
9.4 Summary
Exercises
Bibliographic Notes
Chapter 10 Mining Object, Spatial, Multimedia, Text, and Web Data
10.1 Multidimensional Analysis and Descriptive Mining of Comple Data Objects
10.1.1 Generalization of Structured Data
10.1.2 Aggregation and Approximation in Spatial and Multimedia Data Generalization
10.1.3 Generalization of Object Identifiers and Class/Subclass Hierarchies
10.1.4 Generalization of Class Composition Hierarchies
1O.1.5 Construction and Mining of Object Cubes
10.1.6 Generalization-Based Mining of Plan Databases by Divide-and-Conquer
102 Spatial Data Mining
10.2.1 Spatial Data Cube Construction and Spatial OLAP
10.2.2 Mining Spatial Association and Co-location Patterns
10.2.3 Spatial Clustering Methods
10.2.4 Spatial Classification and Spatial Trend Analysis
10.2.5 Mining Raster Databases
10.3 Multimedia Data Mining
10.3.1 Similarity Search in Multimedia Data
10.3.2 Multidimensional Analysis of Multimedia Data
10.3.3 Classification and Prediction Analysis of Multimedia Data
10.3.4 Mining Associations in Multimedia Data
10.3.5 Audio and Video Data Mining
10.4 Text Mining
10.4.1 Text Data Analysis and Information Retrieval
10.4.2 Dimensionality Reduction for Text
10.4.3 Text Mining Approaches
10.5 Mining the World Wide Web
10.5. I Mining the Web Page Layout Structure
10.5.2 Mining the Web's Link Structures to Identify Authoritative Web Pages
10.5.3 Mining Multimedia Data on the Web
10.5.4 Automatic Classification of Web Documents
10.5.5 Web Usage Mining
10.6 Summary
Exercises
Bibliographic Notes
Chapter 11 Applications and Trends in Data Mining
11.1 Data Mining Applications
11.1.1 Data Mining for Financial Data Analysis
11.1.2 Data Mining for the Retail Industry
11.1.3 Data Mining for the Telecommunication Industry
11.1.4 Data Mining for Biological Data Analysis
11.1.5 Data Mining in Other Scientific Applications
11.1.6 Data Minin for Intrusion Detection
11.2 Data Mining System Products and Research Prototypes
11.2.1 How to Choose a Data Mining System
11.2.2 Examples of Commercial Data Mining Systems
1.3 Additional Themes on Data Mining
11.3.1 Theoretical Foundations of Data Mining
11.3.2 Stat/stical Data Mining
11.3.3 Visual and Audio Data Mining
11.3.4 Data Mining and Collaborative Filtering
1.4 Social Impacts of Data Mining
11.4.1 Ubiquitous and Invisible Data Mining
11.4.2 Data Mining, Privacyand Data Security
1.5 Trends in Data Mining
11.6 Summary
Exercises
Bibliographic Notes
Appendix An Introduction to Microsoft's OLE DB for Data Mining
A.I Model Creation
A.2 Model Training
A.3 Model Prediction and Browsing
Bibliography
Index