Web數(shù)據(jù)挖掘：超文本數(shù)據(jù)的知識發(fā)現(xiàn)（英文版）

定　價(jià)：￥59.00

作　者：	（?。┎閯P萊巴蒂著
出版社：	人民郵電出版社
叢編項(xiàng)：	圖靈原版計(jì)算機(jī)科學(xué)系列
標(biāo)　簽：	數(shù)據(jù)倉庫與數(shù)據(jù)挖掘

購買這本書可以去

ISBN：	9787115194046	出版時(shí)間：	2009-02-01	包裝：	平裝
開本：	16開	頁數(shù)：	344	字?jǐn)?shù)：

內(nèi)容簡介

　　本書是信息檢索領(lǐng)域的名著，深入講解了從大量非結(jié)構(gòu)化Web數(shù)據(jù)中提取和產(chǎn)生知識的技術(shù)。書中首先論述了Web的基礎(chǔ)（包括Web信息采集機(jī)制、Web標(biāo)引機(jī)制以及基于關(guān)鍵字或基于相似性搜索機(jī)制），然后系統(tǒng)地描述了Web挖掘的基礎(chǔ)知識，著重介紹基于超文本的機(jī)器學(xué)習(xí)和數(shù)據(jù)挖掘方法，如聚類、協(xié)同過濾、監(jiān)督學(xué)習(xí)、半監(jiān)督學(xué)習(xí)，最后講述了這些基本原理在Web挖掘中的應(yīng)用。本書為讀者提供了堅(jiān)實(shí)的技術(shù)背景和最新的知識。本書是從事數(shù)據(jù)挖掘?qū)W術(shù)研究和開發(fā)的專業(yè)人員理想的參考書，同時(shí)也適合作為高等院校計(jì)算機(jī)及相關(guān)專業(yè)研究生的教材。

作者簡介

　　Soumen Chakrabarti，Web搜索與挖掘領(lǐng)域的知名專家，ACM Transactions on the Web副主編。加州大學(xué)伯克利分校博士，目前是印度理工學(xué)院計(jì)算機(jī)科學(xué)與工程系副教授。曾經(jīng)供職于IBM Almaden研究中心，從事超文本數(shù)據(jù)庫和數(shù)據(jù)挖掘方面的工作。他有豐富的實(shí)際項(xiàng)目開發(fā)經(jīng)驗(yàn)，開發(fā)了多個(gè)Web挖掘系統(tǒng)，并獲得了多項(xiàng)美國專利。

圖書目錄

INTRODUCTION
1.1　Crawling and Indexing
1.2 Topic Directories
1.3 Clustering and Classification
1.4 Hyperlink Analysis
1.5 Resource Discovery and Vertical Portals
1.6 Structured vs. Unstructured Data Mining
1.7 Bibliographic Notes
PART Ⅰ INFRASTRUCTURE
2　 CRAWLING THE WEB
2.1 HTML and HTTP Basics
2.2 Crawling Basics
2.3 Engineering Large-Scale Crawlers
2.3.1 DNS Caching， Prefetching， and Resolution
2.3.2 Multiple Concurrent Fetches
2.3.3 Link Extraction and Normalization
2.3.4 Robot Exclusion
2.3.5 Eliminating Already-Visited URLs
2.3.6 Spider Traps
2.3.7 Avoiding Repeated Expansion of Links on Duplicate Pages
2.3.8 Load Monitor and Manager
2.3.9 Per-Server Work-Queues
2.3.10 Text Repository
2.3.11 Refreshing Crawled Pages
2.4 Putting Together a Crawler
2.4.1 Design of the Core Components
2.4.2 Case Study： Using w3c-1 i bwww
2.5 Bibliographic Notes
3 WEB SEARCH AND INFORMATION RETRIEVAL
3.1 Boolean Queries and the Inverted Index
3.1.1 Stopwords and Stemming
3.1.2 Batch Indexing and Updates
3.1.3 Index Compression Techniques
3.2 Relevance Ranking
3.2.1 Recall and Precision
3.2.2 The Vector-Space Model
3.2.3 Relevance Feedback and Rocchios Method
3.2.4 Probabilistic Relevance Feedback Models
3.2.5 Advanced Issues
3.3 Similarity Search
3.3.1 Handling "Find-Similar" Queries
3.3.2 Eliminating Near Duplicates via Shingling
3.3.3 Detecting Locally Similar Subgraphs of the Web
3.4 Bibliographic Notes
PART Ⅱ LEARNING
SIMILARITY AND CLUSTERING
4.1 Formulations and Approaches
4.1.1 Partitioning Approaches
4.1.2 Geometric Embedding Approaches
4.1.3 Generative Models and Probabilistic Approaches
4.2 Bottom-Up and Top-Down Partitioning Paradigms
4.2.1 Agglomerative Clustering
4.2.2 The k-Means Algorithm
4.3 Clustering and Visualization via Embeddings
4.3.1 Self-Organizing Maps (SOMs)
4.3.2 Multidimensional Scaling (MDS) and FastMap
4.3.3 Projections and Subspaces
4.3.4 Latent Semantic Indexing (LSI)
4.4 Probabilistic Approaches to Clustering
4.4.1 Generative Distributions for Documents
4.4.2 Mixture Models and Expectation Maximization (EM)
4.4.3 Multiple Cause Mixture Model (MCMM)
4.4.4 Aspect Models and Probabilistic LSI
4.4.5 Model and Feature Selection
4.5 Collaborative Filtering
4.5.1 Probabilistic Models
4.5.2 Combining Content-Based and Collaborative Features
4.6 Bibliographic Notes
5 SUPERVISED LEARNING
5.1 The Supervised Learning Scenario
5.2 Overview of Classification Strategies
5.3 Evaluating Text Classifiers
5.3.1 Benchmarks
5.3.2 Measures of Accuracy
5.4 Nearest Neighbor Learners
5.4.1 Pros and Cons
5.4.2 Is TFIDF Appropriate?
5.5 Feature Selection
5.5.1 Greedy Inclusion Algorithms
5.5.2 Truncation Algorithms
5.5.3 Comparison and Discussion
5.6 Bayesian Learners
5.6.1 Naive Bayes Learners
5.6.2 Small-Degree Bayesian Networks
5.7 Exploiting Hierarchy among Topics
5.7.1 Feature Selection
5.7.2 Enhanced Parameter Estimation
5.7.3 Training and Search Strategies
5.8 Maximum Entropy Learners
5.9 Discriminative Classification
5.9.1 Linear Least-Square Regression
5.9.2 Support Vector Machines
5.10 Hypertext Classification
5.10.1 Representing Hypertext for Supervised Learning
5.10.2 Rule Induction
5.11 Bibliographic Notes
6 SEMISUPERVISED LEARNING
6.1 Expectation Maximization
6.1.1 Experimental Results
6.1.2 Reducing the Belief in Unlabeled Documents
6.1.3 Modeling Labels Using Many Mixture Components
……
PART Ⅲ APPLICATIONS