Reputation: 238
I have a column of strings which represent quarters (format '%Y%q') which I want to convert to PeriodIndex. How to do this? Nothing is working, can't find a strptime function that takes a format string and deals with Quarterly.
out[0]
0 200001
1 200002
2 200003
3 200004
4 200101
...
94 202303
95 202304
96 202401
97 202402
98 202403
Name: 0, Length: 99, dtype: int64
Want:
PeriodIndex(['2000Q1',... '2024Q3'], dtype='period[Q-DEC]')
Upvotes: 0
Views: 45
Reputation: 14369
You can use pd.PeriodIndex.from_fields
.
With integers (you say you have strings, but your sample has dtype: int64
), use floor division for the year (//
), and modulo for the quarter (%
):
import pandas as pd
data = [202401, 202402, 202403, 202404]
df = pd.DataFrame({'Dates': data})
df['Quarters'] = pd.PeriodIndex.from_fields(year=df['Dates'] // 100,
quarter=df['Dates'] % 10,
freq='Q')
Output:
df['Quarters']
0 2024Q1
1 2024Q2
2 2024Q3
3 2024Q4
Name: Quarters, dtype: period[Q-DEC]
With strings, you can use Series.astype
to convert to integers and do the same as above, or slice via Series.str
and convert afterwards:
df = pd.DataFrame({'Dates': data}, dtype=str)
df['Quarters'] = pd.PeriodIndex.from_fields(year=df['Dates'].str[:4].astype(int),
quarter=df['Dates'].str[5:].astype(int),
freq='Q')
Note that the use of pd.PeriodIndex
in this way is deprecated since version 2.2.0 (e.g. pd.PeriodIndex(year=[2024], quarter=[1])
).
Performance comparison
Adding the pd.PeriodIndex
method used in the answer by @wjandrea:
s = pd.Series(['202401', '202402', '202403', '202404'])
s = pd.concat([s]*250)
%timeit pd.PeriodIndex.from_fields(year=s.str[:4].astype(int), quarter=s.str[5:].astype(int), freq='Q')
1.65 ms ± 273 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit pd.PeriodIndex(s.str.replace(r'(\d{4})0(\d)', r'\1Q\2', regex=True), freq='Q')
21.9 ms ± 833 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# alternative for replace with regex
%timeit pd.PeriodIndex(s.str[:4] + 'Q' + s.str[5:], freq='Q')
21.3 ms ± 651 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using pd.PeriodIndex.from_fields
will be much faster. Used directly on integers it will be faster still:
s = pd.Series([202401, 202402, 202403, 202404])
s = pd.concat([s]*250)
%timeit pd.PeriodIndex.from_fields(year=s // 100, quarter=s % 10, freq='Q')
1.05 ms ± 142 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Upvotes: 0
Reputation: 33145
Firstly, what you're showing isn't string data, it's dtype: int64
. I'm going to assume you're reading from a CSV or something and you can go back and fix that.
You can simply use regex to convert to the right format. Note that %q
doesn't have a leading zero (per docs for pandas.Period.strftime).
# Example setup
s = pd.Series(['202304', '202401', '202402', '202403'])
pd.PeriodIndex(
s.str.replace(r'(\d{4})0(\d)', r'\1Q\2', regex=True),
freq='Q')
PeriodIndex(['2023Q4', '2024Q1', '2024Q2', '2024Q3'], dtype='period[Q-DEC]')
Regex breakdown:
\d
A digit{4}
Repeated four times(...)
A group\1
Reference back to group 1Upvotes: 0