Reputation: 1271
I have a table with ~300 columns filled with characters (stored as String):
valuesDF:
| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| U | C | ...
| U | E | ...
| I | B | ...
| C | U | ...
| ... | ... | ...
I have a Data Summary, which maps the characters onto their actual meaning. It is in this form:
summaryDF:
| Field | Value | ValueDesc |
|------------------|-------|---------------|
| FavouriteBeer | U | Unknown |
| FavouriteBeer | C | Carlsberg |
| FavouriteBeer | I | InnisAndGunn |
| FavouriteBeer | D | DoomBar |
| FavouriteCheese | C | Cheddar |
| FavouriteCheese | E | Emmental |
| FavouriteCheese | B | Brie |
| FavouriteCheese | U | Unknown |
| ... | ... | ... |
I want to programmatically replace the character values of each column in valuesDF
with the Value Descriptions from summaryDF
. This is the result I'm looking for:
finalDF:
| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| Unknown | Cheddar | ...
| Unknown | Emmental | ...
| InnisAndGunn | Brie | ...
| Carlsberg | Unknown | ...
| ... | ... | ...
As there are ~300 columns, I'm not keen to type out withColumn
methods for each one.
Unfortunately I'm a bit of a novice when it comes to programming for Spark, although I've picked up enough to get by over the last 2 months.
What I'm pretty sure I need to do is something along the lines of:
valuesDF.columns.foreach { col => ...... }
to iterate over each columnsummaryDF
on Field
using col
String valuesummaryDF
onto valuesDF
based on current columnwithColumn
to replace the original character code column from valuesDF
with new description columnvar
However, trying this gave me Cartesian product error (I made sure to define the join as "left"
).
I tried and failed to pivot summaryDF
(as there are no aggregations to do??) then join both dataframes together.
This is the sort of thing I've tried, and always getting a NullPointerException
. I know this is really not the right way to do this, and can see why I'm getting Null Pointer... but I'm really stuck and reverting back to old, silly & bad Python habits in desperation.
var valuesDF = sourceDF
// I converted summaryDF to a broadcasted RDD
// because its small and a "constant" lookup table
summaryBroadcast
.value
.foreach{ x =>
// searchValue = Value (e.g. `U`),
// replaceValue = ValueDescription (e.g. `Unknown`),
val field = x(0).toString
val searchValue = x(1).toString
val replaceValue = x(2).toString
// error catching as summary data does not exactly mapping onto field names
// the joys of business people working in Excel...
try {
// I'm using regexp_replace because I'm lazy
valuesDF = valuesDF
.withColumn( attribute, regexp_replace(col(attribute), searchValue, replaceValue ))
}
catch {case _: Exception =>
null
}
}
Any ideas? Advice? Thanks.
Upvotes: 0
Views: 3977
Reputation: 3964
First, we'll need a function that executes a join of valuesDf
with summaryDf
by Value
and the respective pair of Favourite*
and Field
:
private def joinByColumn(colName: String, sourceDf: DataFrame): DataFrame = {
sourceDf.as("src") // alias it to help selecting appropriate columns in the result
// the join
.join(summaryDf, $"Value" === col(colName) && $"Field" === colName, "left")
// we do not need the original `Favourite*` column, so drop it
.drop(colName)
// select all previous columns, plus the one that contains the match
.select("src.*", "ValueDesc")
// rename the resulting column to have the name of the source one
.withColumnRenamed("ValueDesc", colName)
}
Now, to produce the target result we can iterate on the names of the columns to match:
val result = Seq("FavouriteBeer",
"FavouriteCheese").foldLeft(valuesDF) {
case(df, colName) => joinByColumn(colName, df)
}
result.show()
+-------------+---------------+
|FavouriteBeer|FavouriteCheese|
+-------------+---------------+
| Unknown| Cheddar|
| Unknown| Emmental|
| InnisAndGunn| Brie|
| Carlsberg| Unknown|
+-------------+---------------+
In case a value from valuesDf
does not match with anything in summaryDf
, the resulting cell in this solution will contain null
. If you want just to replace it with Unknown
value, instead of .select
and .withColumnRenamed
lines above use:
.withColumn(colName, when($"ValueDesc".isNotNull, $"ValueDesc").otherwise(lit("Unknown")))
.select("src.*", colName)
Upvotes: 1