The Ultimate Hands-On Hadoop: Tame your Big Data!

Hadoop tutorial with MapReduce, HDFS, Spark, Flink, Hive, HBase, MongoDB, Cassandra, Kafka + more! Over 25 technologies. Includes 14.5 hours of on-demand video and a certificate of completion.

Buy This Course


Learn at your own pace! Lifetime access to all course videos and materials for this course, with a one-time payment.

Course Information



The world of Hadoop and “Big Data” can be intimidating – hundreds of different technologies with cryptic names form the Hadoop ecosystem. With this Hadoop tutorial, you’ll not only understand what those systems are and how they fit together – but you’ll go hands-on and learn how to use them to solve real business problems!

Learn and master the most popular big data technologies in this comprehensive course, taught by a former engineer and senior manager from Amazon and IMDb. We’ll go way beyond Hadoop itself, and dive into all sorts of distributed systems you may need to integrate with.

  • Install and work with a real Hadoop installation right on your desktop with Hortonworks (now part of Cloudera) and the Ambari UI
  • Manage big data on a cluster with HDFS and MapReduce
  • Write programs to analyze data on Hadoop with Pig and Spark
  • Store and query your data with Sqoop, Hive, MySQL, HBase, Cassandra, MongoDB, Drill, Phoenix, and Presto
  • Design real-world systems using the Hadoop ecosystem
  • Learn how your cluster is managed with YARN, Mesos, Zookeeper, Oozie, Zeppelin, and Hue
  • Handle streaming data in real time with Kafka, Flume, Spark Streaming, Flink, and Storm

Understanding Hadoop is a highly valuable skill for anyone working at companies with large amounts of data.

Almost every large company you might want to work at uses Hadoop in some way, including Amazon, Ebay, Facebook, Google, LinkedIn, IBM,  Spotify, Twitter, and Yahoo! And it’s not just technology companies that need Hadoop; even the New York Times uses Hadoop for processing images.

This course is comprehensive, covering over 25 different technologies in over 14 hours of video lectures. It’s filled with hands-on activities and exercises, so you get some real experience in using Hadoop – it’s not just theory.

You’ll find a range of activities in this course for people at every level. If you’re a project manager who just wants to learn the buzzwords, there are web UI’s for many of the activities in the course that require no programming knowledge. If you’re comfortable with command lines, we’ll show you how to work with them too. And if you’re a programmer, I’ll challenge you with writing real scripts on a Hadoop system using Scala, Pig Latin, and Python.

You’ll walk away from this course with a real, deep understanding of Hadoop and its associated distributed systems, and you can apply Hadoop to real-world problems. Plus a valuable completion certificate is waiting for you at the end! 

Please note the focus on this course is on application development, not Hadoop administration. Although you will pick up some administration skills along the way.

Knowing how to wrangle “big data” is an incredibly valuable skill for today’s top tech employers. Don’t be left behind – enroll now!

  • “The Ultimate Hands-On Hadoop… was a crucial discovery for me. I supplemented your course with a bunch of literature and conferences until I managed to land an interview. I can proudly say that I landed a job as a Big Data Engineer around a year after I started your course. Thanks so much for all the great content you have generated and the crystal clear explanations. ” – Aldo Serrano
  • “I honestly wouldn’t be where I am now without this course. Frank makes the complex simple by helping you through the process every step of the way. Highly recommended and worth your time especially the Spark environment.   This course helped me achieve a far greater understanding of the environment and its capabilities.  Frank makes the complex simple by helping you through the process every step of the way. Highly recommended and worth your time especially the Spark environment.” – Tyler Buck

Course Instructor

Frank Kane Frank Kane Author

Frank spent 9 years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers, all the time. Frank holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology, and teaching others about big data analysis.

Buy This Course


Learn at your own pace! Lifetime access to all course videos and materials for this course, with a one-time payment.

Learn all the buzzwords! And install the Hortonworks Data Platform Sandbox.

Using Hadoop’s Core: HDFS and MapReduce

Programming Hadoop with Pig

Programming Hadoop with Spark

Using relational data stores with Hadoop

Using non-relational data stores with Hadoop

Querying your Data Interactively

Managing your Cluster

Feeding Data to your Cluster

Analyzing Streams of Data

Designing Real-World Systems

Learning More

7 thoughts on “The Ultimate Hands-On Hadoop: Tame your Big Data!”

  1. tiwariamit85 says:

    This course was really good in terms of covering the topics from breadth perspective. This is the first time i have subscribed Sundog education course due to Frank. I have been following few courses from Frank for last 3 years on other platform like Udemy. This course is must for the beginners who are trying their luck in BIg Data space.

  2. kambleatulm says:

    Dear Frank,

    I am following your course “The Ultimate Hands-On Hadoop: Tame your Big Data!” on O’Reilly. I have installed HDP 2.6.5 version. As you had mentioned, I have followed the steps to install “mrjob” but I am getting an error as below:

    ################## Error Message Start ##################
    [root@sandbox-hdp maria_dev]# pip install mrjob==0.5.11
    Collecting mrjob==0.5.11
    Using cached
    Collecting google-api-python-client>=1.5.0 (from mrjob==0.5.11)
    Using cached
    Collecting filechunkio (from mrjob==0.5.11)
    Using cached
    Collecting PyYAML>=3.08 (from mrjob==0.5.11)
    Using cached
    Complete output from command python egg_info:
    Traceback (most recent call last):
    File “”, line 1, in
    File “/tmp/pip-build-lIGSu6/PyYAML/”, line 67, in
    import sys, os, os.path, pathlib, platform, shutil, tempfile, warnings
    ImportError: No module named pathlib

    Command “python egg_info” failed with error code 1 in /tmp/pip-build-lIGSu6/PyYAML/
    You are using pip version 8.1.2, however version 21.3.1 is available.
    You should consider upgrading via the ‘pip install –upgrade pip’ command.

    ################## Error Message Ends ##################

    Things I have tried:
    1. pip install google-api-python-client==1.6.4 (Error: error in httplib2 setup command: ‘install_requires’ must be a string or list of strings containing valid project/version requirement specifiers. Command “python egg_info” failed with error code 1 in /tmp/pip-build-r6iFbW/httplib2/)
    2. pip install –upgrade setuptools (Error: AttributeError: find_module. Command “python egg_info” failed with error code 1 in /tmp/pip-build-2JTceI/setuptools/)
    3. Tried installing Python 3 but it complicates the entire setup and doesn’t work.

    Could you please help me with it?

    1. Frank Kane says:

      You need to install pathlib, and downgrade PyYAML.

      pip install pathlib
      pip install pyyaml==3.10

      Seems the course videos on O’Reilly are out of date. I’ll see about getting them updated.

  3. ikoreal2005 says:

    I have bought this course online from Udemy, where am I supposed to get access to the course materials from?

    1. Frank Kane says:

      If you bought it on Udemy, you should access the course from Udemy. The first few lectures walk you through getting set up and downloading any materials you need. Generally scripts etc. are just downloaded within each individual activity as needed.

  4. raheelcse says:

    I came from C# and Visual Studio background, Is there any kind debugging tool / IDE we can use to write spark python code e.g. I want to debug each line of code what’s in the object after mentioning breakpoint in the code?
    How you debug your own code, If you write some Python Spark code and code get breaks after calling on Putty, how you debug the code line by line that where the actual problem is, because error message sometime not so helpful?

    1. Frank Kane says:

      Well… there isn’t one, really. Spark works very differently from C# programs. Things don’t execute sequentially, you’re just building up a queue of operations that Spark will later build a directed acyclic graph from, and the distribute the processing across the machines in your cluster when a time comes that an operation requires a final result of some sort. So for most commands, nothing really happens – you can’t debug Spark programs in the same way as you would with lower-level systems stepping into individual commands and seeing what they do. What you can do is force your Spark driver scripts to output intermediate results and print them out for debugging purposes… and pay attention to any error messages at runtime. With datasets, Spark can detect more errors at compile time to make life a little easier.

      Further complicating matters, your Python Spark code is ultimately converted and run using a Java Virtual Machine, making things even more complicated for debugging purposes. And there is the problem of debugging both on the driver side, and on the executor side (which may be running someplace else entirely.)

      That said it’s not *impossible*, just really really hard and only worth the effort when you’re really stuck. More details are at

Leave a Reply