Video Archives - Sundog Education with Frank Kane

Apache Spark Tutorial for Beginners Part 1 – Installing Spark

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

In this series of 8 videos, we will walk through installing Spark on a Hortonworks sandbox running right on your own PC, and we will talk about how Spark works and its architecture. We will then dive hands-on into the origins of Spark by working directly with RDDs Resilient Distributed Datasets and then move on to the modern Spark 2.0 way of programming with Datasets.

You will get hands-on practice writing a few simple Spark applications using the Python programming language, and then we will actually build a movie recommendation engine using real movie ratings data, and Sparks machine learning library MLLib. We will end with an exercise you can try yourself for practice, along with my solution to it.

In this video, we will focus on getting VirtualBox, a Hortonworks Data Platform (HDP) sandbox, and the MovieLens data set installed for use in the rest of the series. Your instructor is Frank Kane, who spent nine years at Amazon.com and IMDb.com as a senior engineer and senior manager, wrangling their massive data sets.

Explore the full course on Udemy

Apache Spark Tutorial for Beginners Part 2 – Introduction to Spark

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

Explore the full course on Udemy

Apache Spark Tutorial for Beginners Part 3 – Resilient Distributed Dataset

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

Explore the full course on Udemy

Apache Spark Tutorial for Beginners Part 4 – Using RDDs in Spark

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

Explore the full course on Udemy

Apache Spark Tutorial for Beginners Part 5 – Spark SQL

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

Explore the full course on Udemy

Apache Spark Tutorial for Beginners Part 6 – Using DataSets

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

Explore the full course on Udemy

Apache Spark Tutorial for Beginners Part 7 – Using MLLib

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

Explore the full course on Udemy

Apache Spark Tutorial for Beginners Part 8 – Project Solution

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

Explore the full course on Udemy

How to Choose the Right Database? – MongoDB, Cassandra, MySQL, HBase

Choosing the right database for your application is no easy task.

You have a wide variety of options relational databases such as MySQL, or distributed NoSQL solutions such as MongoDB, Cassandra, and HBase. NoSQL has come to mean not only SQL as many distributed database systems do in fact support SQL-style queries, as long as you are not doing complex join operations and this further blurs the lines between these systems.

We will talk about how to analyze the requirements of your system in terms of consistency, availability, and partition-tolerance, and how to apply the CAP theorem to guide your choice after showing you where different database technologies fall on the sides of the CAP triangle. We will also talk about more practical considerations, such as your budget, need for professional support, and the ease of integration into the other systems already in place in your organization. Maybe you don’t even need a distributed storage solution at all! Choosing the right technology for your data storage will save you a lot of pain as your application grows and evolves and making the wrong choice can lead to all sorts of maintenance problems and wasted work.

Your instructor is Frank Kane of Sundog Education, bringing nine years of experience as a senior engineer and senior manager at Amazon.com and IMDb.com, where his job involved extracting meaning from their massive data sets, and processing that data in a highly distributed manner.

Explore the full course on Udemy

Kafka Tutorial for Beginners

Learn to stream big data with Kafka, starting from scratch.

Kafka is a powerful data streaming technology and a very hot technical skill to have right now. With Kafka, you can publish streams of data from web logs, sensors, or whatever else you can imagine to systems that manipulate, analyze, and store that data all in real time. Kafka bring a reliable publish / subscribe mechanism that is resilient and can allow clients to pick up where they left off in the event of an outage.

In this tutorial, you will set up a free Hortonworks sandbox environment within a virtual Linux machine running right on your own desktop PC, learn about how data streaming and Kafka work, set up Kafka, and use it to publish real web logs on a Kafka topic and receive them in real time. Kafka is sometimes billed as a Hadoop killer due to its power, but really it is an integral piece of the larger Hadoop ecosystem that has emerged.

Explore the full course on Udemy

Category: Video

Apache Spark Tutorial for Beginners Part 1 – Installing Spark

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

Apache Spark Tutorial for Beginners Part 2 – Introduction to Spark

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

Apache Spark Tutorial for Beginners Part 3 – Resilient Distributed Dataset

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

Apache Spark Tutorial for Beginners Part 4 – Using RDDs in Spark

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

Apache Spark Tutorial for Beginners Part 5 – Spark SQL

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

Apache Spark Tutorial for Beginners Part 6 – Using DataSets

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

Apache Spark Tutorial for Beginners Part 7 – Using MLLib

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

Apache Spark Tutorial for Beginners Part 8 – Project Solution

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

How to Choose the Right Database? – MongoDB, Cassandra, MySQL, HBase

Choosing the right database for your application is no easy task.

Kafka Tutorial for Beginners

Learn to stream big data with Kafka, starting from scratch.

(C) Copyright 2021-2025 Sundog Software LLC. All rights reserved worldwide.

Theme LaunchPad by LifterLMS