Frederic Bastiat
Frederic Bastiat

Reputation: 693

Casting string to int null issue

I have a spark dataframe, results, that has two string columns I would like to cast to numeric:

>>> results.show()
+--------------------+-----------------+------------------------+
|       Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+--------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC...|             "43"|                    "20"|
|"BAYLOR MEDICAL C...|             "32"|                    "20"|
|"GOOD SHEPHERD ME...|             "25"|                    "20"|
|"GOOD SHEPHERD ME...|             "25"|                    "20"|
|"MASONIC HOME AND...|  "Not Available"|         "Not Available"|
|"ST HELENA HOSPITAL"|             "41"|                    "20"|
|   "TOURO INFIRMARY"|             "15"|                    "18"|
|"WAHIAWA GENERAL ...|             "17"|                    "10"|
|"ANNA JAQUES HOSP...|             "27"|                    "18"|
|    "CMC-BLUE RIDGE"|             "31"|                    "18"|
|"EVANSTON REGIONA...|             "15"|                    "15"|
|"OKLAHOMA SPINE H...|             "79"|                    "20"|
|"PICKENS COUNTY M...|  "Not Available"|         "Not Available"|
|"PORTNEUF MEDICAL...|             "11"|                    "17"|
|"PRESENCE SAINT J...|             "20"|                    "17"|
|"RIVERSIDE MEDICA...|             "39"|                    "20"|
|"RIVERSIDE MEDICA...|             "39"|                    "20"|
|"RIVERSIDE MEDICA...|             "39"|                    "20"|
|"SOUTH GEORGIA ME...|    "3 out of 10"|                    "24"|
|"TAMPA GENERAL HO...|             "23"|                    "16"|
+--------------------+-----------------+------------------------+

Attempting such gives me a table of null values:

>>> results2 = results.select( results["Hospital Name"], results["HCAHPS Base Score"].cast(pe()).alias("HCAHPS Base Score"), results["HCAHPS Consistency Score"].cast(IntegerType()).aHPS Consistency Score") )
>>> results2.show()
+--------------------+-----------------+------------------------+
|       Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+--------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC...|             null|                    null|
|"BAYLOR MEDICAL C...|             null|                    null|
|"GOOD SHEPHERD ME...|             null|                    null|
|"GOOD SHEPHERD ME...|             null|                    null|
|"MASONIC HOME AND...|             null|                    null|
|"ST HELENA HOSPITAL"|             null|                    null|
|   "TOURO INFIRMARY"|             null|                    null|
|"WAHIAWA GENERAL ...|             null|                    null|
|"ANNA JAQUES HOSP...|             null|                    null|
|    "CMC-BLUE RIDGE"|             null|                    null|
|"EVANSTON REGIONA...|             null|                    null|
|"OKLAHOMA SPINE H...|             null|                    null|
|"PICKENS COUNTY M...|             null|                    null|
|"PORTNEUF MEDICAL...|             null|                    null|
|"PRESENCE SAINT J...|             null|                    null|
|"RIVERSIDE MEDICA...|             null|                    null|
|"RIVERSIDE MEDICA...|             null|                    null|
|"RIVERSIDE MEDICA...|             null|                    null|
|"SOUTH GEORGIA ME...|             null|                    null|
|"TAMPA GENERAL HO...|             null|                    null|
+--------------------+-----------------+------------------------+

only showing top 20 rows

Is it not possible to cast string columns to integer in pyspark?

Upvotes: 4

Views: 13540

Answers (1)

KiranM
KiranM

Reputation: 1323

First you better need to strip off double quotes, then you should be able to convert to IntegerType. You can use below udf to accomplish it.

>>> def stripDQ(string):
...  return string.replace('"', "")
... 
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import StringType, IntegerType
>>> udf_stripDQ = udf(stripDQ, StringType())

We will use it..

Your actual DataFrame:

>>> results.show()
+------------------+-----------------+------------------------+
|     Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC"|             "43"|                    "20"|
|"BAYLOR MEDICAL C"|             "32"|                    "20"|
|"GOOD SHEPHERD ME"|             "25"|                    "20"|
|"GOOD SHEPHERD ME"|             "25"|                    "20"|
|"MASONIC HOME AND"|  "Not Available"|         "Not Available"|
+------------------+-----------------+------------------------+

Now, we will use our udf to strip off double-quotes from both columns.

>>> results1 = results.withColumn("HCAHPS Base Score", udf_stripDQ(results["HCAHPS Base Score"]) ).withColumn("HCAHPS Consistency Score", udf_stripDQ(results["HCAHPS Consistency Score"]) )
>>> results1.show()
+------------------+-----------------+------------------------+
|     Hospital Name|HCAHPS Base Score|HCAHPS Consistency Score|
+------------------+-----------------+------------------------+
|"ADIRONDACK MEDIC"|               43|                      20|
|"BAYLOR MEDICAL C"|               32|                      20|
|"GOOD SHEPHERD ME"|               25|                      20|
|"GOOD SHEPHERD ME"|               25|                      20|
|"MASONIC HOME AND"|    Not Available|           Not Available|
+------------------+-----------------+------------------------+

Now cast to integer:

>>> results2 = results1.select( results1["Hospital Name"], results1["HCAHPS Base Score"].cast(IntegerType()).alias("HCAHPS Base Score"), results1["HCAHPS Consistency Score"].cast(IntegerType()).alias("HPS Consistency Score") )
>>> results2.show()
+------------------+-----------------+---------------------+
|     Hospital Name|HCAHPS Base Score|HPS Consistency Score|
+------------------+-----------------+---------------------+
|"ADIRONDACK MEDIC"|               43|                   20|
|"BAYLOR MEDICAL C"|               32|                   20|
|"GOOD SHEPHERD ME"|               25|                   20|
|"GOOD SHEPHERD ME"|               25|                   20|
|"MASONIC HOME AND"|             null|                 null|
+------------------+-----------------+---------------------+

Upvotes: 7

Related Questions