Reputation: 55
I have a Pandas Dataframe created from a dictionary with the following code:
import pandas as pd
pd.set_option('max_colwidth', 150)
df = pd.DataFrame.from_dict(data, orient= 'index', columns = ['text'])
df
The output is as follows:
text
./form/2003Q4/0001041379_2003-12-15.html \n10-K\n1\ng86024e10vk.htm\nAFC ENTERPRISES\n\n\n\nAFC ENTERPRISES\n\n\n\nTable of Contents\n\n\n\n\n\n\n\nUNITED STATES SECURITIES AND EXCHANGE\n...
./form/2007Q2/0001303804_2007-04-17.html \n10-K\n1\na07-6053_210k.htm\nANNUAL REPORT PURSUANT TO SECTION 13 AND 15(D)\n\n\n\n\n\n\n \nUNITED\nSTATES\nSECURITIES AND EXCHANGE\nCOMMISSION...
./form/2007Q2/0001349848_2007-04-02.html \n10-K\n1\nff060310k.txt\n\n UNITED STATES\n SECURITIES AND EXCHANGE COMMISSION\n ...
./form/2014Q1/0001141807_2014-03-31.html \n10-K\n1\nf32414010k.htm\nFOR THE FISCAL YEAR ENDED DECEMBER 31, 2013\n\n\n\nf32414010k.htm\n\n\n\n\n\n\n\n\n\n\nUNITED STATES\nSECURITIES AND EX...
./form/2007Q2/0001341853_2007-04-02.html \n10-K\n1\na07-9697_110k.htm\n10-K\n\n\n\n\n\n\n \n \nUNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\n \nFORM 10-K\n ...
I need to split the first column (the index) into three separate columns, Year & Qtr, CIK, Filing Data. So the values in these columns from the first row would be: 2003Q4, 0001041379, 2003-12-15.
I think that if this was in a proper column that I could do this using code similar to Example #2 found here:
https://www.geeksforgeeks.org/python-pandas-split-strings-into-two-list-columns-using-str-split/
However I am thrown by the fact that it is the index that I need to split, and not a named column.
Is there a way to separate the index or do I need to somehow save this as another column, and is this possible?
I'd appreciate any help. I am a newbie, so I don't always understand the more difficult solutions. Thanks in advance.
Upvotes: 3
Views: 1560
Reputation: 93161
The fact that the column is the index makes no difference when extracting components from it but you need to be careful when assigning those components back to the original dataframe.
# Extract the components from the index
# pandas allowed us to name the columns via named captured groups
pattern = r'(?P<Quarter>\d{4}Q\d)\/(?P<CIK>\d+)_(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})'
tmp = df.index.str.extract(pattern) \
.assign(Date=lambda x: pd.to_datetime(x[['Year', 'Month', 'Day']]))
# Since `df` and `tmp` are both dataframe, assignments between them
# will be based row label. We want them to align by position (i.e.
# line 1 to line 1) so we have to convert the right hand side to
# numpy array
cols = ['Quarter', 'CIK', 'Date']
df[cols] = tmp[cols].values
Upvotes: 1