Apache Spark Tutorial for Beginners Part 7 – Using MLLib

Apache Spark is arguably the hottest technology in the field of big data right now. It allows you to process and extract meaning from massive data sets on a cluster, whether it is a Hadoop cluster you administer or a cloud-based deployment.

In this series of 8 videos, we will walk through installing Spark on a Hortonworks sandbox running right on your own PC, and we will talk about how Spark works and its architecture. We will then dive hands-on into the origins of Spark by working directly with RDDs Resilient Distributed Datasets and then move on to the modern Spark 2.0 way of programming with Datasets.

You will get hands-on practice writing a few simple Spark applications using the Python programming language, and then we will actually build a movie recommendation engine using real movie ratings data, and Sparks machine learning library MLLib. We will end with an exercise you can try yourself for practice, along with my solution to it.

In this video, we will focus on getting VirtualBox, a Hortonworks Data Platform (HDP) sandbox, and the MovieLens data set installed for use in the rest of the series. Your instructor is Frank Kane, who spent nine years at Amazon.com and IMDb.com as a senior engineer and senior manager, wrangling their massive data sets.

Explore the full course on Udemy

Published by

Frank Kane

Our courses are led by Frank Kane, a former Amazon and IMDb developer with extensive experience in machine learning and data science. With 26 issued patents and 9 years of experience at the forefront of recommendation systems, Frank brings real-world expertise to his teaching. His ability to explain complex concepts in accessible terms has helped over one million students worldwide gain valuable skills in machine learning, data engineering, and AI development.

Leave a Reply