Atharv Thakur
Atharv Thakur

Reputation: 701

how to append cosntant in all columns of header in spark scala

For example here is my Existing header

DataPartition|^|TimeStamp|^|Source.organizationId|^|Source.sourceId|^|FilingDateTime|^|SourceTypeCode|^|DocumentId|^|Dcn|^|DocFormat|^|StatementDate|^|IsFilingDateTimeEstimated|^|ContainsPreliminaryData|^|CapitalChangeAdjustmentDate|^|CumulativeAdjustmentFactor|^|ContainsRestatement|^|FilingDateTimeUTCOffset|^|ThirdPartySourceCode|^|ThirdPartySourcePriority|^|SourceTypeId|^|ThirdPartySourceCodeId|^|FFAction|!|

I want to create header like below

DataPartition_1|^|TimeStamp|^|Source.organizationId|^|Source.sourceId|^|FilingDateTime_1|^|SourceTypeCode_1|^|DocumentId_1|^|Dcn_1|^|DocFormat_1|^|StatementDate_1|^|IsFilingDateTimeEstimated_1|^|ContainsPreliminaryData_1|^|CapitalChangeAdjustmentDate_1|^|CumulativeAdjustmentFactor_1|^|ContainsRestatement_1|^|FilingDateTimeUTCOffset_1|^|ThirdPartySourceCode_1|^|ThirdPartySourcePriority_1|^|SourceTypeId_1|^|ThirdPartySourceCodeId_1|^|FFAction_1

Except for columns TimeStamp|^|Source.organizationId|^|Source.sourceId I want to append _1 in all header columns

I have done it by using with withColumn but using this I have to do for all columns .

Is there any easy way to do it like using foldLeft ?

Upvotes: 1

Views: 149

Answers (1)

SCouto
SCouto

Reputation: 7928

First, you need to define a list of the columns you want to skip:

val columnsToAvoid = List("TimeStamp","Source.organizationId","Source.sourceId")

Then you can foldLeft over the column list of the dataFrame (given by df.columns) renaming each column that it's not contained in the columnsToAvoid list and returning the unchanged dataFrame otherwise.

df.columns.foldLeft(df)((acc, elem) => 
                     if (columnsToAvoid.contains(elem)) acc 
                     else acc.withColumnRenamed(elem, elem+"_1"))

A quick example here:

Original DF

+-----+------+-----------+
| word| value|  TimeStamp|
+-----+------+-----------+
|wordA|valueA|45435345435|
|wordB|valueB|  454244345|
|wordC|valueC|32425425435|
+-----+------+-----------+

Operation:

df.columns.foldLeft(df)((acc, elem) => if (columnsToAvoid.contains(elem)) acc else acc.withColumnRenamed(elem, elem+"_1")).show

Result:

+------+-------+-----------+
|word_1|value_1|  TimeStamp|
+------+-------+-----------+
| wordA| valueA|45435345435|
| wordB| valueB|  454244345|
| wordC| valueC|32425425435|
+------+-------+-----------+

Upvotes: 1

Related Questions