Reputation: 840
I am running spark 2.4.2
locally through pyspark
for an ML project in NLP. Part of the pre-processing steps in the Pipeline involve the use of pandas_udf
functions optimized through pyarrow
. Each time I operate with the pre-processed spark dataframe the following warning appears:
UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use "
I tried updating pyarrow
but didn't manage to avoid the warning. My pyarrow version is 0.14. I was wondering the implications of this warning and if somebody has found a solution for it? Thank you very much in advance.
Spark session details:
conf = SparkConf(). \
setAppName('map'). \
setMaster('local[*]'). \
set('spark.yarn.appMasterEnv.PYSPARK_PYTHON', '~/anaconda3/bin/python'). \
set('spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON', '~/anaconda3/bin/python'). \
set('executor.memory', '8g'). \
set('spark.executor.memoryOverhead', '16g'). \
set('spark.sql.codegen', 'true'). \
set('spark.yarn.executor.memory', '16g'). \
set('yarn.scheduler.minimum-allocation-mb', '500m'). \
set('spark.dynamicAllocation.maxExecutors', '3'). \
set('spark.driver.maxResultSize', '0'). \
set("spark.sql.execution.arrow.enabled", "true"). \
set("spark.debug.maxToStringFields", '100')
spark = SparkSession.builder. \
appName("map"). \
config(conf=conf). \
getOrCreate()
Upvotes: 4
Views: 5167
Reputation: 139132
This warning is coming from your version of pyspark
, which is using a deprecated function of pyarrow
.
But everything works fine, so you can either simply ignore the warning for now, or update your pyspark version (in the latest version they have fixed the usage of the deprecated pyarrow function)
Upvotes: 3
Reputation: 600
I've the same problem in pycharm
, when using jupyter lab
it seems to be working fine
Upvotes: 0