Need to reformat row-organized dates into a date column, where data between date rows are assigned the appropriate date

Question

I am working with some poorly organized data, that records dates in row 1 (Y,M,D) while the next N rows are organized by values (ID, value1,value2)

For example:

    column1  column2 column3
    2018     10      2
    ID01     a       1
    ID02     b       4
    ID03     c       2
    2018     10      3
    ID01     d       7
    ID02     e       6
    ID03     f       1

I want to parse the dates from the date rows into datetime format (2018-10-2) and instead move them into a new column.

What I want it to look like:

    column1  column2  column3  date
    ID01     a        1        2018-10-2
    ID02     b        4        2018-10-2
    ID03     c        2        2018-10-2
    ID01     d        7        2018-10-3    
    ID02     e        6        2018-10-3
    ID03     f        1        2018-10-3

One of the approaches that I attempted is working backwards from the end of the file, detecting a date row (by checking if column 1 contains a year) and splitting the dataframe at that row into a new file named after a parsed datetime format. Then, in this file, I could read the filename (which is already parsed) and create a new column with just this data. I would then combine all of the files together afterwards into a cohesive dataset.

The issue that I ran into is that I am not able to figure out how to tell pandas to both work backwards and only split at a certain row. One idea I had was to write the index# of the last date to a variable, and split the dataframe at that row number to a new file, but I don't know how to iterate it from the bottom. Additionally, I know this approach is very messy and I'm sure there is a more efficient way to achieve this.

What I have as far as identifying date columns and parsing the dates:

years = [2015,2016,2017,2018]
df['year?'] = df['column1'].isin(years).astype(int) # Detect date row

if df['year?'] == 1: # I receive errors here
    # Parse dates here
    datedf = (df['column1'].astype(str))+(df['column2'].astype(str))+(df['column3'].astype(str))
    df['date'] = pd.to_datetime(datedf, format='%Y%m%d') # Write date to 'date' column

Erfan · Accepted Answer

We can solve this problem in several steps:

First we extract the date rows
We create a date from the three columns:
We concatenate the original dataframe with the new dataframe and forwardfill
We remove the rows which are the date rows

df2 = df[df['column1'].str.match('(\d{4})')]

df2['date'] = pd.to_datetime(df2['column1'].astype(str) + df2['column2'].astype(str) + df2['column3'].astype(str), format='%Y%m%d')

df_final = pd.concat([df[~df['column1'].str.match('(\d{4})')], df2]).sort_index().ffill()

df_final = df_final[~df_final['column1'].str.match('(\d{4})')]

  column1 column2  column3       date
1    ID01       a        1 2018-10-02
2    ID02       b        4 2018-10-02
3    ID03       c        2 2018-10-02
5    ID01       d        7 2018-10-03
6    ID02       e        6 2018-10-03
7    ID03       f        1 2018-10-03

Note

I used column1, column2 and column3 as column names

Need to reformat row-organized dates into a date column, where data between date rows are assigned the appropriate date

Answers (2)

Related Questions