Mr.UNOwen
Mr.UNOwen

Reputation: 183

Apache-spark dataframe column names are inconsistent, why does this happen?

Doing something similar to the sql-programming-guid on the Apache-Spark site, the columns produced by my Java bean class don't match up in case sensitivity. Some start with first letter capitalized and others don't with no consistency or pattern.

There are some things I have done different to that guide which is:

So,

  1. By reflection, how exactly is it getting the names? Does it use the get/set function names and truncate off the get and set parts?
  2. Is there a way to disable case sensitivity?

As for why I'm not showing any of my code. It's for work, so I want to avoid showing anything I shouldn't.

--UPDATE-------------------------

So it looks like the name is based on the get and set functions. Changing set/getStartTime to set/getStartTimee resulted in startTime becoming startTimee. However I still get the case where I have a column ITrN for get/setITrN that keeps it's upper case first letter but a column like startTime that doesn't.

--UPDATE #2-------------------------

After playing around with the names, it looks like the deciding factor is if Spark thinks the name is an acronym(all caps), a word, or a single letter. If it starts with a word or a single letter, it'll make it lower case. As a workaround I just started everything with "_". Anyway... if anyone knows how to disable case sensitivity when querying, let me know.

Upvotes: 2

Views: 1315

Answers (1)

Mr.UNOwen
Mr.UNOwen

Reputation: 183

Based on what I've thrown at it, names with first letters capitalized will become lower case only if it isn't followed by another by another upper case letter. So MMark stays MMark, while Mark turns to mark as a result. I'm guessing this is to account for camel case when creating the get/set methods where a person wanting to have a variable of 'name' writes it as setName and getName as a way to improve readability with camel case.

Upvotes: 2

Related Questions