Preface Chapter 1:Getting Started with Hadooo v2 IntrOductiOn Setting up Hadoop v2 on your local machine Writing a WordCount MapReduce application,bundling it and running it using the Hadoop local mode Adding a combiner step to the WordCount MapReduce program Setting up HDFS Setting up Hadoop YARN in a distributed cluster environment using Hadoop v2 Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution HDFS command-line file operations Running the WordCount program in a distributed cluster environment Benchmarking HDFS using DFSIO Benchmarking Hadoop MapReduce using TeraSort Chapter 2:Cloud Deployments—Using Hadoop YARN on Cloud Environments Introduction Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce Saving money using Amazon EC2 Spot Instances to execute EMR job flows Executing a Pig script using EMR Executing a Hive script using EMR Creating an Amazon EMR job flow using the AWS Command Line Interface Deploying an Apache HBase cluster on Amazon EC2 using EMR Using EMR bootstrap actions to configure VMs for the Amazon EMR jobs Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment Chapter 3:Hadoop Essentials—C0nfigurations,Unit Tests,and Other APIs Introduction Optimizing Hadoop YARN and MapReduce cOnfiguratiOns for cluster deployments Shared user Hadoop clusters—-using Fair and Capacity schedulers Setting classpath precedence to user-provided JARs Speculative execution of straggling tasks Unit testing Hadoop MapReduce applications using MRUnit Integration testing Hadoop MapReduce applications using MiniYarnCluster Adding a new DataNode Decommissioning DataNodes Using multiple disks/volumes and limiting HDFS disk usage Setting the HDFS block size Setting the file replication factor Using the HDFs Java API Chapter 4:Develooin~ComDlex Hadooo MaoReduce Aoolications IntrOductiOn Choosing appropriate Hadoop data types Implementing a custom Hadoop Writable data type Implementing a custom Hadoop key type Emitting data of different value types from a Mapper Choosing a suitable Hadoop InputFormat for your input data format Adding support for new input data formats——implementing a custom InputFormat Formatting the results of MapReduce computations——using Hadoop OutputFormats Writing multiple outputs from a MapReduce computation Hadoop intermediate data partitioning Secondary sorting——sorting Reduce input values BrOadcasting and distributing shared resources to tasks in a MapReduce job—Hadoop DistributedCache Using Hadoop with legacy applications—-Hadoop streaming Adding dependencies between MapReduce jobs Hadoop counters to report custom metrics Chapter5:Analvtics Introduction Simple analytics using MapReduce Performing GROUP BY using MapReduce Calculating frequency distributions and sorting using MapReduce Plotting the Hadoop MapReduce results using gnuplot Calculating histograms using MapReduce Calculating Scatter plots using MapReduce Parsing a complex dataset with Hadoop Joining two datasets using MapReduce Chapter6:Hadooo Ecosystem—Apache Hive Introduction Getting started with Apache Hive Creating databases and tables using Hive CLI Simple SQL-style data querying using Apache Hive Creating and populating Hive tables and views using Hive query results Utilizing different storage formats in Hive.storing table data using ORC files Using Hive built-in functions Hive batch mode-using a query file Performing a join with Hive Creating partitioned Hive tables Writing Hive User·defined Functions(UDF) HCatalog-·performing Java MapReduce computations on data mapped to Hive tables HCatalog——writing data to Hive tables from Java MapReduce computations Chapter7:HadooD Ecosystem II—Pig.HBase.Mahout.a(chǎn)nd Sannn Introduction Getting started with Apache Pig Joining two datasets using Pig Accessing a Hive table data in Pig using HCatalog Getting started with Apache HBase Data random access using Java client APIs Running MapReduce jobs on HBase Using Hive to insert data into HBase tables Getting started with Apache Mahout Running K-means with Mahout Importing data to HDFS from a relational database using Apache Sqoop Exporting data from HDFs to a relational database using Apache Sqoop Tahie OrContencs Chapter8:Searching and Indexine Introduction Generating an inverted index using Hadoop MapReduce Intradomain web crawling using Apache Nutch Indexing and searching web documents using Apache Solr Configuring Apache HBase as the backend data store for Apache Nutch Whole web crawling with Apache Nutch using a HadooP/HBase cluster Elasticsearch for indexing and searching Generating the in-links graph for crawled web pages Chapter 9:CIassmcatiOns。Recommendations,and Findineg RelationshipS Introduction Performing content—based recommendations Classification using the naive Bayes classifier Assigning advertisements to keywords using the Adwords balance algorithm Chapter 10:Mass Text Data processing Introduction Data preprocessing using Hadoop streaming and Python De-duplicating data using Hadoop streaming Loading large datasets to an Apache HBase data store—importtsv and bulkload Creating TF and TF-IDF vectors for the text data Clustering text data using Apache Mahout Topic discovery using Latent Dirichlet Allocation(LDA) Document classification using Mahout Naive Bayes Classifier Index