Back to: Taming Big Data with Apache Spark and Python – Hands On!
NOTE: The activity in this lecture requires Spark 4.0 preview2 or newer.
Back to: Taming Big Data with Apache Spark and Python – Hands On!
NOTE: The activity in this lecture requires Spark 4.0 preview2 or newer.
You must be logged in to post a comment.
When running the pandas-conversion.py script, I got the error:
AttributeError: `np.NaN` was removed in the NumPy 2.0 release. Use `np.nan` instead.
I tried deprecating the version of NumPy but that created a host of other errors, eventually I left the current NumPy release as is but had to modify the script to include a compatibility layer that initially started with just with the aforementioned attribute error but grew on as seen below:
# Compatibility layer for NumPy 2.0
if not hasattr(np, ‘NaN’):
np.NaN = np.nan # Map np.NaN to np.nan
if not hasattr(np, ‘string_’):
np.string_ = np.bytes_ # Map np.string_ to np.bytes_
if not hasattr(np, ‘float_’):
np.float_ = np.float64 # Map np.float_ to np.float64
if not hasattr(np, ‘int_’):
np.int_ = np.int64 # Map np.int_ to np.int64
if not hasattr(np, ‘bool_’):
np.bool_ = np.bool8 # Map np.bool_ to np.bool8
if not hasattr(np, ‘object_’):
np.object_ = np.object0 # Map np.object_ to np.object0
if not hasattr(np, ‘complex_’):
np.complex_ = np.complex128 # Map np.complex_ to np.complex128
if not hasattr(np, ‘unicode_’):
np.unicode_ = np.str_ # Map np.unicode_ to np.str_
if not hasattr(np, ‘long’):
np.long = np.int64 # Map np.long to np.int64
Is there a way to resolve this without going through all these modifications as I tried some other solutions and none of them worked.
Hm, it is working fine for me using numpy 2.2.3 (together with Spark 3.10 and the spark 4.0 preview 2 release)
It sounds like a conflict between packages in your environment, where some older package is assuming an older NumPy and needs to be updated. Which one it is might be in the details of the error messages.
Are you using something other than Anaconda? Setting up a new environment might help.
I’m using Anaconda.
I already tried deleting and recreating the py310 but I got the same results.
I have also upgraded my NumPy and pyspark packages.
I am currently using NumPy 2.2.4 together with Spark 3.5.5.
Well it is probably some other package that is sitting between Spark and numpy that is the problem.
Try running “conda update –all” from within your Python 3.10 environment as a first step.
If that doesn’t work, see if switching to the Spark 4.0 preview2 release helps, as that’s what I used when recording this lecture. It’s available at https://archive.apache.org/dist/spark/spark-4.0.0-preview2/
And if that doesn’t do it, please paste in the complete error message – there might be more context in it that’s important.
Tried the update all solution but still got errors then;
I switched to Spark 4.0 preview2 and it worked without errors.
Thanks!