Apache Spark Tutorial for Beginners Part 3 – Resilient Distributed Dataset

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

In this series of 8 videos, we will walk through installing Spark on a Hortonworks sandbox running right on your own PC, and we will talk about how Spark works and its architecture. We will then dive hands-on into the origins of Spark by working directly with RDDs Resilient Distributed Datasets and then move on to the modern Spark 2.0 way of programming with Datasets.

You will get hands-on practice writing a few simple Spark applications using the Python programming language, and then we will actually build a movie recommendation engine using real movie ratings data, and Sparks machine learning library MLLib. We will end with an exercise you can try yourself for practice, along with my solution to it.

In this video, we will focus on getting VirtualBox, a Hortonworks Data Platform (HDP) sandbox, and the MovieLens data set installed for use in the rest of the series. Your instructor is Frank Kane, who spent nine years at Amazon.com and IMDb.com as a senior engineer and senior manager, wrangling their massive data sets.

Explore the full course on Udemy

Leave a Reply