yoel
yoel

Reputation: 249

How to fetch the value and type of each column of each row in a dataframe?

How can I convert a dataframe to a tuple that includes the datatype for each column?

I have a number of dataframes with varying sizes and types. I need to be able to determine the type and value of each column and row of a given dataframe so I can perform some actions that are type-dependent.

So for example say I have a dataframe that looks like:

+-------+-------+
|  foo  |  bar  |
+-------+-------+
| 12345 | fnord |
|    42 |   baz |
+-------+-------+

I need to get

Seq(
  (("12345", "Integer"), ("fnord", "String")),
  (("42", "Integer"), ("baz", "String"))
)

or something similarly simple to iterate over and work with programmatically.

Thanks in advance and sorry for what is, I'm sure, a very noobish question.

Upvotes: 4

Views: 2231

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

If I understand your question correct, then following shall be your solution.

  val df = Seq(
    (12345, "fnord"),
    (42, "baz"))
    .toDF("foo", "bar")

This creates dataframe which you already have.

+-----+-----+
|  foo|  bar|
+-----+-----+
|12345|fnord|
|   42|  baz|
+-----+-----+

Next step is to extract dataType from the schema of the dataFrame and create a iterator.

val fieldTypesList = df.schema.map(struct => struct.dataType)

Next step is to convert the dataframe rows into rdd list and map each value to dataType from the list created above

  val dfList = df.rdd.map(row => row.toString().replace("[","").replace("]","").split(",").toList)
  val tuples = dfList.map(list => list.map(value => (value, fieldTypesList(list.indexOf(value)))))

Now if we print it

tuples.foreach(println)

It would give

List((12345,IntegerType), (fnord,StringType))
List((42,IntegerType), (baz,StringType))

Which you can iterate over and work with programmatically

Upvotes: 3

Related Questions