Hadoop Streaming API

Hadoop MapReduce Programming with Streaming API

Course Objective:

  • How MapReduce and the Hadoop Distributed File System work

  • How to write MapReduce code in Java or other programming languages

  • What issues to consider when developing MapReduce jobs

  • How to implement common algorithms in Hadoop

  • Best practices for Hadoop development and debugging

  • How to leverage other project such as Apache Hive, Apache Pig, Sqoop and Oozie

  • Advanced Hadoop API topics required for real-world data analysis

Prerequisites:

This course is designed for developers with some programming experience (preferably Java). Existing knowledge of Hadoop is not required.

Course Outline

Introduction

The Motivation For Hadoop

  • Problems with traditional large-scale systems

  • Requirements for a new approach

Hadoop: Basic Concepts

  • An Overview of Hadoop

  • The Hadoop Distributed File System

  • Hands-On Exercise

  • How MapReduce Works

  • Hands-On Exercise

  • Anatomy of a Hadoop Cluster

  • Other Hadoop Ecosystem Components

Writing a MapReduce Program

  • The MapReduce Flow

  • Examining a Sample MapReduce Program

  • Basic MapReduce API Concepts

  • The Driver Code

  • The Mapper

  • The Reducer

  • Hadoop’s Streaming API

  • Using Eclipse for Rapid Development

  • Hands-on exercise

  • The New MapReduce API

Integrating Hadoop Into The Workflow

  • Relational Database Management Systems

  • Storage Systems

  • Importing Data from RDBMSs With Sqoop

  • Hands-on exercise

  • Importing Real-Time Data with Flume

  • Accessing HDFS Using FuseDFS and Hoop

Delving Deeper Into The Hadoop API

  • More about ToolRunner

  • Testing with MRUnit

  • Reducing Intermediate Data With Combiners

  • The configure and close methods for Map/Reduce Setup and Teardown

  • Writing Partitioners for Better Load Balancing

  • Hands-On Exercise

  • Directly Accessing HDFS

  • Using the Distributed Cache

  • Hands-On Exercise

Common MapReduce Algorithms

  • Sorting and Searching

  • Indexing

  • Machine Learning With Mahout

  • Term Frequency – Inverse Document Frequency

  • Word Co-Occurrence

  • Hands-On Exercise

Using Hive and Pig

  • Hive Basics

  • Pig Basics

  • Hands-on exercise

Practical Development Tips and Techniques

  • Debugging MapReduce Code

  • Using LocalJobRunner Mode For Easier Debugging

  • Retrieving Job Information with Counters

  • Logging

  • Splittable File Formats

  • Determining the Optimal Number of Reducers

  • Map-Only MapReduce Jobs

  • Hands-On Exercise

More Advanced MapReduce Programming

  • Custom Writables and WritableComparables

  • Saving Binary Data using SequenceFiles and Avro Files

  • Creating InputFormats and OutputFormats

  • Hands-On Exercise

Joining Data Sets in MapReduce

  • Map-Side Joins

  • The Secondary Sort

  • Reduce-Side Joins

Graph Manipulation in Hadoop

  • Introduction to graph techniques

  • Representing graphs in Hadoop

  • Implementing a sample algorithm: Single Source Shortest Path

Creating Workflows With Oozie

  • The Motivation for Oozie

  • Oozie’s Workflow Definition Format

  • Hands-On Exercise

Click here to submit your review.


Submit your review
* Required Field