Taming Big Data with Apache Spark and Python – Getting Started

(We have discontinued our Facebook group due to abuse.)

Installing Apache Spark and Python

Windows: (keep scrolling for MacOS and Linux)

  1. Install a JDK (Java Development Kit) from http://www.oracle.com/technetwork/java/javase/downloads/index.html . SPARK IS ONLY COMPATIBLE WITH JAVA 8, 11, OR 17 at this time. DO NOT INSTALL JAVA 21+.
  2. Download a pre-built version of Apache Spark 3 from https://spark.apache.org/downloads.html
  3. If necessary, download and install WinRAR so you can extract the .tgz file you downloaded. http://www.rarlab.com/download.htm
  4. Extract the Spark archive, and copy its contents into C:\spark after creating that directory. You should end up with directories like c:\spark\bin, c:\spark\conf, etc.
  5. Open the the c:\spark\conf folder, and make sure “File Name Extensions” is checked in the “view” tab of Windows Explorer. Rename the log4j2.properties.template file to log4j2.properties. Edit this file (using Wordpad or something similar) and change the error level from “info” to “error” for log4j.rootCategory
  6. Right-click your Windows menu, select Control Panel, System and Security, and then System. Click on “Advanced System Settings” and then the “Environment Variables” button.
  7. Add the following new USER variables:
    1. SPARK_HOME c:\spark
    2. PYSPARK_PYTHON python
  8. Add the following path to your PATH user variable:

%SPARK_HOME%\bin

  1. Close the environment variable screen and the control panels.
  2. Install the latest Anaconda for Python 3 from anaconda.com.  If you already use some other Python environment, that’s OK – you can use it instead, as long as it is a Python 3 environment.
  3. Test it out!
    1. Open up your Start menu and select “Anaconda Prompt” from the Anaconda3 menu.
    2. Enter cd c:\spark and then dir to get a directory listing.
    3. Look for a text file we can play with, like README.md or CHANGES.txt
    4. Enter pyspark
    5. At this point you should have a >>> prompt. If not, double check the steps above.
    6. Enter rdd = sc.textFile(“README.md”) (or whatever text file you’ve found) Enter rdd.count()
    7. You should get a count of the number of lines in that file! Congratulations, you just ran your first Spark program!
    8. Enter quit() to exit the spark shell, and close the console window
    9. You’ve got everything set up! Hooray!

MacOS

Step 1: Install Apache Spark

Using Homebrew

  1. Install Homebrew if you don’t have it already by entering this from a terminal prompt:
    /usr/bin/ruby -e “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)”
  2. Enter:
    brew install openjdk@17
    brew install scala
    brew install apache-spark
  3. Create a log4j.properties file via
    1. cd /opt/homebrew/Cellar/apache-spark/3.5.2/libexec/conf  (substitute 3.5.2 for the version actually installed – the path may be slightly different on your system.)
    2. cp log4j2.properties.template log4j2.properties
  4. Edit the log4j2.properties file and change the log level from “info” to “error” on log4j.rootCategory.

Step 2: Install Anaconda

Install the latest Anaconda for Python 3 from anaconda.com, if you don’t already have Python installed.

Step 3: Test it out!

  1. Open up a terminal
  2. cd into the directory where you installed Spark, and then ls to get a directory listing.
  3. Look for a text file we can play with, like README.md or CHANGES.txt
  4. Enter pyspark
  5. At this point you should have a >>> prompt. If not, double check the steps above.
  6. Enter rdd = sc.textFile(“README.md”) (or whatever text file you’ve found) Enter rdd.count()
  7. You should get a count of the number of lines in that file! Congratulations, you just ran your first Spark program!
  8. Enter quit() to exit the spark shell, and close the terminal window
  9. You’ve got everything set up! Hooray!

Linux

  1. Install Java, Scala, and Spark according to the particulars of your specific OS. A good starting point is https://sparkbyexamples.com/spark/spark-installation-on-linux-ubuntu/
  2. Install the latest Anaconda for Python 3 from anaconda.com
  3. Test it out!
    1. Open up a terminal
    2. cd into the directory you installed Spark, and do an ls to see what’s in there.
    3. Look for a text file we can play with, like README.md or CHANGES.txt
    4. Enter pyspark
    5. At this point you should have a >>> prompt. If not, double check the steps above.
    6. Enter rdd = sc.textFile(“README.md”) (or whatever text file you’ve found) Enter rdd.count()
    7. You should get a count of the number of lines in that file! Congratulations, you just ran your first Spark program!
    8. Enter quit() to exit the spark shell, and close the console window
    9. You’ve got everything set up! Hooray!

Course Materials

On Udemy, you’ll find the materials attached to each lecture as resources. If you’d like to get them all at once, you can grab it all from http://media.sundog-soft.com/Udemy/SparkCourse.zip

Optional: Join Our List

Join our low-frequency mailing list to stay informed on new courses and promotions from Sundog Education. As a thank you, we’ll send you a free course on Deep Learning and Neural Networks with Python, and discounts on all of Sundog Education’s other courses! Just click the button to get started.