anaga
anaga

Reputation: 63

I want to change a column of month numbers in a dataframe to month name (pyspark)

I have a column of month numbers in a dataframe and want to change it to month name, so I used this:

df['monthName'] = df['monthNumber'].apply(lambda x: calendar.month_name[x]) 

But it throws me the following error:

TypeError: 'Column' object is not callable

Please suggest me what are the ways to resolve this. I am new to python and spark

Edit1: I m using Spark 2.1.1 and Python 2.7.6

This is my code for Airline data Analysis.

df_withDelay = df_mappedCarrierNames.filter(df_mappedCarrierNames.ArrDelay > 0)
sqlContext.registerDataFrameAsTable(df_withDelay,"SFO_ArrDelayAnalysisTable")
df_SFOArrDelay = sqlContext.sql \
                      ("select sfo.Month, sum(sfo.ArrDelay) as TotalArrivalDelay \
                      from SFO_ArrDelayAnalysisTable sfo \
                      where (sfo.Dest = 'SFO') \
                      group by sfo.Month")

I am trying to plot a graph with Month vs ArrDelay. From the above code I am getting Month as number. So I tried with the below option

udf = UserDefinedFunction(lambda x: calendar.month_abbr[int(x)], StringType())
new_df_mappedCarrierNames = df_mappedCarrierNames.select(*[udf(column).alias(name) if column == name else column for column in df_mappedCarrierNames.columns])

it works but, in my graph it's not in sorted order. whereas if I use the month numbers, it is in sorted order. My issue is in finding out how to map month numbers to month names in sorted order from Jan to dec. Thanks in advance for the replies.

Upvotes: 2

Views: 12453

Answers (4)

Vincent
Vincent

Reputation: 21

from pyspark.sql import functions as F
if 'date' column has a date format
.withColumn("month", F.date_format(F.col("date"), 'MMMM'))

Upvotes: 2

Aaka sh
Aaka sh

Reputation: 39

You can do these steps : input May 20 2012

process

df_train = df_train.withColumn("dates", from_unixtime(unix_timestamp(df_train.dates, 'MMMMM dd  yyy')))

Output : 2012-05-20 00:00:00

month_udf = udf(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S').strftime("%B"), returnType = StringType())
dftest = df_train.withColumn("monthname", month_udf(df_train.dates))

Output : May

Upvotes: 0

Krzysztof Przysowa
Krzysztof Przysowa

Reputation: 49

I would avoid using UDFs if possible (as they don't scale well). Try the combination of to_date(), date_format() and casting to integer:

from pyspark.sql.functions import col

df = df.withColumn('monthNumber', date_format(to_date(col('monthName'), 'MMMMM'), 'MM').cast('int'))

Details of date formatting codes: http://tutorials.jenkov.com/java-internationalization/simpledateformat.html

Upvotes: 4

András Nagy
András Nagy

Reputation: 311

Here is the clean solution:

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType

#1
month_lst = ['January', 'Feburary', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
#2
df = sqlContext.createDataFrame( [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))

name = 'x1'
#3
udf = UserDefinedFunction(lambda x: month_lst[int(x%12) - 1], StringType())
new_df = df.select(*[udf(column).alias(name) if column == name else column for column in df.columns])

1

You build a dictionary

2

Define a dataframe and select the column 3 month_lst[int(x%12) - 1] This Is the important part, where i only assume the input is a float, and return a value from the list

If you need more help, pleas add a comment,

Upvotes: 2

Related Questions