Reputation: 63
I have a column of month numbers in a dataframe and want to change it to month name, so I used this:
df['monthName'] = df['monthNumber'].apply(lambda x: calendar.month_name[x])
But it throws me the following error:
TypeError: 'Column' object is not callable
Please suggest me what are the ways to resolve this. I am new to python and spark
Edit1: I m using Spark 2.1.1 and Python 2.7.6
This is my code for Airline data Analysis.
df_withDelay = df_mappedCarrierNames.filter(df_mappedCarrierNames.ArrDelay > 0)
sqlContext.registerDataFrameAsTable(df_withDelay,"SFO_ArrDelayAnalysisTable")
df_SFOArrDelay = sqlContext.sql \
("select sfo.Month, sum(sfo.ArrDelay) as TotalArrivalDelay \
from SFO_ArrDelayAnalysisTable sfo \
where (sfo.Dest = 'SFO') \
group by sfo.Month")
I am trying to plot a graph with Month vs ArrDelay. From the above code I am getting Month as number. So I tried with the below option
udf = UserDefinedFunction(lambda x: calendar.month_abbr[int(x)], StringType())
new_df_mappedCarrierNames = df_mappedCarrierNames.select(*[udf(column).alias(name) if column == name else column for column in df_mappedCarrierNames.columns])
it works but, in my graph it's not in sorted order. whereas if I use the month numbers, it is in sorted order. My issue is in finding out how to map month numbers to month names in sorted order from Jan to dec. Thanks in advance for the replies.
Upvotes: 2
Views: 12453
Reputation: 21
from pyspark.sql import functions as F
if 'date' column has a date format
.withColumn("month", F.date_format(F.col("date"), 'MMMM'))
Upvotes: 2
Reputation: 39
You can do these steps : input May 20 2012
process
df_train = df_train.withColumn("dates", from_unixtime(unix_timestamp(df_train.dates, 'MMMMM dd yyy')))
Output : 2012-05-20 00:00:00
month_udf = udf(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S').strftime("%B"), returnType = StringType())
dftest = df_train.withColumn("monthname", month_udf(df_train.dates))
Output : May
Upvotes: 0
Reputation: 49
I would avoid using UDFs if possible (as they don't scale well). Try the combination of to_date(), date_format() and casting to integer:
from pyspark.sql.functions import col
df = df.withColumn('monthNumber', date_format(to_date(col('monthName'), 'MMMMM'), 'MM').cast('int'))
Details of date formatting codes: http://tutorials.jenkov.com/java-internationalization/simpledateformat.html
Upvotes: 4
Reputation: 311
Here is the clean solution:
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
#1
month_lst = ['January', 'Feburary', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
#2
df = sqlContext.createDataFrame( [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))
name = 'x1'
#3
udf = UserDefinedFunction(lambda x: month_lst[int(x%12) - 1], StringType())
new_df = df.select(*[udf(column).alias(name) if column == name else column for column in df.columns])
You build a dictionary
Define a dataframe and select the column
3
month_lst[int(x%12) - 1]
This Is the important part, where i only assume the input is a float, and return a value from the list
If you need more help, pleas add a comment,
Upvotes: 2