iPrince
iPrince

Reputation: 125

How to transfer the string to a dict with pysparkSQL

In pysparkSQL, I have a DataFrame called bmd2 like this:

DataFrame[genres: string, id: int, tagline: string, title: string, vote_average: double, vote_count: int]

And the data bmd2['genres'] goes like this:

bmd2.select('genres').show():
+--------------------+
|              genres|
+--------------------+
|[{'id': 16, 'name...|
|[{'id': 12, 'name...|
|[{'id': 10749, 'n...|
|[{'id': 35, 'name...|
|[{'id': 35, 'name...|
|[{'id': 28, 'name...|
|[{'id': 35, 'name...|
|[{'id': 28, 'name...|
|[{'id': 28, 'name...|
|[{'id': 12, 'name...|
|[{'id': 35, 'name...|
|[{'id': 35, 'name...|
|[{'id': 10751, 'n...|
|[{'id': 36, 'name...|
|[{'id': 28, 'name...|
|[{'id': 18, 'name...|
|[{'id': 18, 'name...|
|[{'id': 80, 'name...|
|[{'id': 80, 'name...|
|[{'id': 28, 'name...|
+--------------------+
only showing top 20 rows

The type of data in column 'genres' are string, but they could be transfer to a list of dicts with 'eval function' in python. So how should I apply the eval() here to transfer the string here to list in every row?I tried many ways:

  1. bmd2.select('genres'.astype('list')):AttributeError: 'str' object has no attribute 'astype'
  2. bmd2.select(eval('genres')):NameError: name 'genres' is not defined
  3. bmd2.withColumn('genres',eval('genres')):NameError: name 'genres' is not defined

Upvotes: 1

Views: 2752

Answers (2)

iPrince
iPrince

Reputation: 125

I solved my question by using UDF, which is User-defined function.

First,import it:

from pyspark.sql.functions import udf

Then, define your UDF, just like an anonymous function:

getdirector = udf(lambda x:[i['name'] for i in x if i['job'] == 'Director'],StringType())

You should assign the type of return value here, so you will get a return value with your expected type. Then you can call this UDF in your code like other functions.

cres2 = cres1.select('id',getcharacter('cast').alias('cast'),getdirector('crew').alias('crew'))

In this problem, I can modify the UDF to get any type I need.

Upvotes: 2

fatmali
fatmali

Reputation: 122

I'm writing this as an answer as I can't find the comment option. I would suggest you take a look at from_json from the pyspark.sql.functions. For example, this is how you would use it:

# given a row that looks like:

+----------genres-------------+
| [{ id:1, name:"hiphop"}]    |
+-----------------------------+

# define a schema
schema = ArrayType(StructType().add("id", IntegerType())\
                              .add("name", StringType()))

# transform
new_df = df.select(from_json("genres", schema).alias("genres_dict"))

# display
new_df.printSchema()
new_df.show()

There is one more method to achieve this using a function called regexp_extract. But the above is my personal preference. Also if you want to switch back to the original string you can use to_json function. Hope this helps.

Upvotes: 1

Related Questions