buhtz
buhtz

Reputation: 12200

Extract year/month from a Pandas Timestamp column and store it in two new columns

I have a DataFrame like this

                        DATE
0 2021-06-15 14:22:02.424126
1 2021-06-15 14:22:02.424126

I want new columns with the extracted YEAR and MONTH.

df['YEAR'] = df.DATE.apply(lambda x: x.year)
df['MONTH'] = df.DATE.apply(lambda x: x.month)

This gives me the expected result

                    DATE  YEAR  MONTH
0 2021-06-15 14:23:16.315543  2021      6
1 2021-06-15 14:23:16.315543  2021      6 

But I want to do this in one line and one "iteration". I assume this would be faster.

# what I wish
df[['A', 'B']] = df.DATE.apply(lambda x: (x.year, x.month))

But the result confuses me.

                        DATE  YEAR  MONTH     A     B
0 2021-06-15 14:23:16.315543  2021      6  2021  2021
1 2021-06-15 14:23:16.315543  2021      6     6     6

And I am not able to add a axis=1 to apply().

That is my full code

import pandas as pd

# That is the initial DataFrame
df = pd.DataFrame(data = {'DATE': [pd.Timestamp.now(),
                                   pd.Timestamp.now()]})

# works fine
df['YEAR'] = df.DATE.apply(lambda x: x.year)
df['MONTH'] = df.DATE.apply(lambda x: x.month)

# what I wish
df[['A', 'B']] = df.DATE.apply(lambda x: (x.year, x.month))

print(df)

Upvotes: 0

Views: 961

Answers (1)

jezrael
jezrael

Reputation: 863651

Faster is use native pandas functions, which are vectorized, not apply solutions (because apply are loops under the hood) with accessor .dt for apply function for Series/ column:

df['YEAR'] = df.DATE.dt.year
df['MONTH'] = df.DATE.dt.month

But it is possible by converting to list:

df[['A', 'B']] = df.DATE.apply(lambda x: (x.year, x.month)).tolist()

# That is the initial DataFrame with 10k rows
df = pd.DataFrame(data = {'DATE': [pd.Timestamp.now()] * 10000})


In [168]: %%timeit
     ...: df['YEAR'] = df.DATE.apply(lambda x: x.year)
     ...: df['MONTH'] = df.DATE.apply(lambda x: x.month)
     ...: 
     ...: 
70.4 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [169]: %%timeit
     ...: df[['A', 'B']] = df.DATE.apply(lambda x: (x.year, x.month)).tolist()
     ...: 
     ...: 
64.5 ms ± 7.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [170]: %%timeit
     ...: df['C'] = df.DATE.dt.year
     ...: df['D'] = df.DATE.dt.month
     ...: 
     ...: 
2.34 ms ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 1

Related Questions