user8414391
user8414391

Reputation: 152

Spark writing output as fixed width

Reading a fixed-width file into Spark is easy and there are multiple ways to do so. However, I could not find a way to WRITE fixed-width output from spark (2.3.1). Would converting a DF to RDD help? Currently using Pyspark but any language is welcome. Can someone suggest a way out?

Upvotes: 2

Views: 6476

Answers (1)

pault
pault

Reputation: 43544

Here is an example of what I described in the comments.

You can use pyspark.sql.functions.format_string() to format each column to a fixed width and then use pyspark.sql.functions.concat() to combine them all into one string.

For example, suppose you had the following DataFrame:

data = [
    (1, "one", "2016-01-01"),
    (2, "two", "2016-02-01"),
    (3, "three", "2016-03-01")
]

df = spark.createDataFrame(data, ["id", "value", "date"])
df.show()
#+---+-----+----------+
#| id|value|      date|
#+---+-----+----------+
#|  1|  one|2016-01-01|
#|  2|  two|2016-02-01|
#|  3|three|2016-03-01|
#+---+-----+----------+

Let's say you wanted to write out the data left-justified with a fixed width of 10

from pyspark.sql.functions import concat, format_string

fixed_width = 10
ljust = r"%-{width}s".format(width=fixed_width)

df.select(
    concat(*[format_string(ljust,c) for c in df.columns]).alias("fixedWidth")
).show(truncate=False)
#+------------------------------+
#|fixedWidth                    |
#+------------------------------+
#|1         one       2016-01-01|
#|2         two       2016-02-01|
#|3         three     2016-03-01|
#+------------------------------+

Here we use the printf style formatting of %-10s to specify a left justified width of 10.

If instead you wanted to right-justify your strings, remove the negative sign:

rjust = r"%{width}s".format(width=fixed_width)

df.select(
    concat(*[format_string(rjust,c) for c in df.columns]).alias("fixedWidth")
).show(truncate=False)
#+------------------------------+
#|fixedWidth                    |
#+------------------------------+
#|         1       one2016-01-01|
#|         2       two2016-02-01|
#|         3     three2016-03-01|
#+------------------------------+

Now you can write out only the fixedWidth column to your output file.

Upvotes: 4

Related Questions