Reputation: 904
I am currently working with a Spark dataframe (using PySpark) representing a large collection of tweets in which I have the following (trimmed) schema:
root
|-- allProperties: struct (nullable = true)
| |-- content: string (nullable = true)
| |-- contributors: string (nullable = true)
| |-- coordinates.coordinates: string (nullable = true)
| ...
| |-- quoted_status.contributors: string (nullable = true)
| |-- quoted_status.coordinates: string (nullable = true)
| |-- quoted_status.coordinates.coordinates: string (nullable = true)
| ...
| |-- retweeted_status.contributors: string (nullable = true)
| |-- retweeted_status.coordinates: string (nullable = true)
| |-- retweeted_status.coordinates.coordinates: string (nullable = true)
| ...
|-- datasetid: string (nullable = true)
|-- routingkey: string (nullable = true)
|-- topic: string (nullable = true)
I want to create a new dataframe by selecingt a number of columns from the original dataframe. For example, allProperties.text
and allProperties.entities.hashtags
. However, I also want to select the same tweets that are retweets or quoted tweets, denoted by the prefix allProperties.retweeted_status
or allProperties.quoted_status
respectively.
Is there a way in which I can select all these columns without having a number of redundant lines in which I specify each of the columns with their prefixes? For example, by supplying some sort of regex which searched for allProperties.text
, allProperties.retweeted_status.text
, allProperties.quoted_status.text
?
As a side note, I should say that I do want to keep the dataframe at the top level as I also want to include topic
in the new dataframe.
So far I've managed to write a regex which matches the required columns:
def _keep_columns(self):
def _regex_filter(x):
tweet_features = [
'text',
'entities.hashtags',
'entities.media',
'entities.urls',
]
r = (('(^allProperties.(retweeted_status.|quoted_status.)'
'?('+"|".join(tweet_features)+')$)'
'|(^topic$)'))
return bool(re.match(r, x))
df = self.tweets.select(*filter(lambda x: _regex_filter(x), self.tweets.columns))
However, self.tweets.columns
only returns the top level columns and it can, therefore, not find the nested columns under allProperties
. How do I search in a nested manner?
Upvotes: 1
Views: 656
Reputation: 31550
You can flatten the struct columns using df.selectExpr("allProperties.*","topic"..etc)
(or) some other dynamic way.
createTempView
for the dataframe to select regex matching columns from the temp view.Example:
#sample dataframe after flattening
df=spark.createDataFrame([("a","1","b","c")],["allProperties.text","allProperties.quoted_status.text","allProperties.quoted_status.text","sample"])
df.show()
#+------------------+--------------------------------+--------------------------------+------+
#|allProperties.text|allProperties.quoted_status.text|allProperties.quoted_status.text|sample|
#+------------------+--------------------------------+--------------------------------+------+
#| a| 1| b| c|
#+------------------+--------------------------------+--------------------------------+------+
df.createOrReplaceTempView("tmp")
spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true").show()
#(allProperties(..*|).text) regex match allProperties. or allProperties..*.text
spark.sql("select `(allProperties(..*|).text)` from tmp").show()
#+------------------+--------------------------------+--------------------------------+
#|allProperties.text|allProperties.quoted_status.text|allProperties.quoted_status.text|
#+------------------+--------------------------------+--------------------------------+
#| a| 1| b|
#+------------------+--------------------------------+--------------------------------+
Upvotes: 1