Pandas and Spark DataFrame Integration

Back to: Taming Big Data with Apache Spark and Python – Hands On!

NOTE: The activity in this lecture requires Spark 4.0 preview2 or newer.

Previous Lesson

Exercise Solution: Total Amount Spent with Dataframes

Next Lesson

[Activity] Pandas API on Spark

5 thoughts on “Pandas and Spark DataFrame Integration”

Chukwuemeka Okonji says:

March 19, 2025 at 12:16 pm

When running the pandas-conversion.py script, I got the error:
AttributeError: `np.NaN` was removed in the NumPy 2.0 release. Use `np.nan` instead.

I tried deprecating the version of NumPy but that created a host of other errors, eventually I left the current NumPy release as is but had to modify the script to include a compatibility layer that initially started with just with the aforementioned attribute error but grew on as seen below:
# Compatibility layer for NumPy 2.0
if not hasattr(np, ‘NaN’):
np.NaN = np.nan # Map np.NaN to np.nan

if not hasattr(np, ‘string_’):
np.string_ = np.bytes_ # Map np.string_ to np.bytes_

if not hasattr(np, ‘float_’):
np.float_ = np.float64 # Map np.float_ to np.float64

if not hasattr(np, ‘int_’):
np.int_ = np.int64 # Map np.int_ to np.int64

if not hasattr(np, ‘bool_’):
np.bool_ = np.bool8 # Map np.bool_ to np.bool8

if not hasattr(np, ‘object_’):
np.object_ = np.object0 # Map np.object_ to np.object0

if not hasattr(np, ‘complex_’):
np.complex_ = np.complex128 # Map np.complex_ to np.complex128

if not hasattr(np, ‘unicode_’):
np.unicode_ = np.str_ # Map np.unicode_ to np.str_

if not hasattr(np, ‘long’):
np.long = np.int64 # Map np.long to np.int64

Is there a way to resolve this without going through all these modifications as I tried some other solutions and none of them worked.

Log in to Reply
1. Frank Kane says:
  
  March 19, 2025 at 12:29 pm
  
  Hm, it is working fine for me using numpy 2.2.3 (together with Spark 3.10 and the spark 4.0 preview 2 release)
  
  It sounds like a conflict between packages in your environment, where some older package is assuming an older NumPy and needs to be updated. Which one it is might be in the details of the error messages.
  
  Are you using something other than Anaconda? Setting up a new environment might help.
  
  Log in to Reply
  1. Chukwuemeka Okonji says:
    
    March 19, 2025 at 12:46 pm
    
    I’m using Anaconda.
    I already tried deleting and recreating the py310 but I got the same results.
    I have also upgraded my NumPy and pyspark packages.
    I am currently using NumPy 2.2.4 together with Spark 3.5.5.
    
    Log in to Reply
    1. Frank Kane says:
      
      March 19, 2025 at 12:52 pm
      
      Well it is probably some other package that is sitting between Spark and numpy that is the problem.
      
      Try running “conda update –all” from within your Python 3.10 environment as a first step.
      
      If that doesn’t work, see if switching to the Spark 4.0 preview2 release helps, as that’s what I used when recording this lecture. It’s available at https://archive.apache.org/dist/spark/spark-4.0.0-preview2/
      
      And if that doesn’t do it, please paste in the complete error message – there might be more context in it that’s important.
      
      Log in to Reply
      1. Chukwuemeka Okonji says:
        
        March 19, 2025 at 1:26 pm
        
        Tried the update all solution but still got errors then;
        I switched to Spark 4.0 preview2 and it worked without errors.
        Thanks!

Pandas and Spark DataFrame Integration

5 thoughts on “Pandas and Spark DataFrame Integration”

Leave a Reply Cancel reply

(C) Copyright 2021-2025 Sundog Software LLC. All rights reserved worldwide.

Theme LaunchPad by LifterLMS