Reputation: 157
I have a Spark DataFrame
called nfe
, which contains a column called NFE_CNPJ_EMITENTE
that is currently formatted as a string (although it is completely numeric). This column should have entries that are either 11 or 14 characters in length, but there are entries of 9, 10, 12, and 13 character length for which I need to add leading zeros.
However, I cannot use the sprintf
function as I do in R:
nfe$CNPJ_EMITENTE <- ifelse(nfe$length_emit == 9, sprintf("%00s", nfe$NFE_CNPJ_EMITENTE), nfe$NFE_CNPJ_EMITENTE)
# Error in sprintf("%00s", nfe$NFE_CNPJ_EMITENTE) : unsupported type
Is there a simple way to add 2 leading zeros for 9 and 12 length entries and 1 leading zero for 11 and 13 length entries?
Thanks!
Upvotes: 0
Views: 388
Reputation: 1987
R's ifelse
has trouble with sparkR s4 objects. An easy solution (for this and a lot of other troublesome data manipulations) is to make it into a SQL query:
#Setup SQL context
sqlContext <- sparkRSQL.init(sc)
sqlContext <- SQLContext(sc)
# Register your dataframe as a table
registerTempTable(df, 'df')
#Query your dataframe
sql(sqlContext, "SELECT [other variables],
case when LENGTH(NFE_CNPJ_EMITENTE) in (9,12) then concat('00',NFE_CNPJ_EMITENTE)
when LENGTH(NFE_CNPJ_EMITENTE) in (10,13) then concat('0',NFE_CNPJ_EMITENTE)
else NFE_CNPJ_EMITENTE end as NFE_CNPJ_EMITENTE
FROM df")
To select all the variablles in the dataframe, the query would look like this:
sql(sqlContext, "SELECT df.*,
case when LENGTH(NFE_CNPJ_EMITENTE) in (9,12) then concat('00',NFE_CNPJ_EMITENTE)
when LENGTH(NFE_CNPJ_EMITENTE) in (10,13) then concat('0',NFE_CNPJ_EMITENTE)
else NFE_CNPJ_EMITENTE end as NFE_CNPJ_EMITENTE_RECODED
FROM df")
Upvotes: 1