Reputation: 55
I am working with some poorly organized data, that records dates in row 1 (Y,M,D) while the next N rows are organized by values (ID, value1,value2)
For example:
column1 column2 column3
2018 10 2
ID01 a 1
ID02 b 4
ID03 c 2
2018 10 3
ID01 d 7
ID02 e 6
ID03 f 1
I want to parse the dates from the date rows into datetime format (2018-10-2) and instead move them into a new column.
What I want it to look like:
column1 column2 column3 date
ID01 a 1 2018-10-2
ID02 b 4 2018-10-2
ID03 c 2 2018-10-2
ID01 d 7 2018-10-3
ID02 e 6 2018-10-3
ID03 f 1 2018-10-3
One of the approaches that I attempted is working backwards from the end of the file, detecting a date row (by checking if column 1 contains a year) and splitting the dataframe at that row into a new file named after a parsed datetime format. Then, in this file, I could read the filename (which is already parsed) and create a new column with just this data. I would then combine all of the files together afterwards into a cohesive dataset.
The issue that I ran into is that I am not able to figure out how to tell pandas to both work backwards and only split at a certain row. One idea I had was to write the index# of the last date to a variable, and split the dataframe at that row number to a new file, but I don't know how to iterate it from the bottom. Additionally, I know this approach is very messy and I'm sure there is a more efficient way to achieve this.
What I have as far as identifying date columns and parsing the dates:
years = [2015,2016,2017,2018]
df['year?'] = df['column1'].isin(years).astype(int) # Detect date row
if df['year?'] == 1: # I receive errors here
# Parse dates here
datedf = (df['column1'].astype(str))+(df['column2'].astype(str))+(df['column3'].astype(str))
df['date'] = pd.to_datetime(datedf, format='%Y%m%d') # Write date to 'date' column
Upvotes: 1
Views: 156
Reputation: 42926
We can solve this problem in several steps:
forwardfill
df2 = df[df['column1'].str.match('(\d{4})')]
df2['date'] = pd.to_datetime(df2['column1'].astype(str) + df2['column2'].astype(str) + df2['column3'].astype(str), format='%Y%m%d')
df_final = pd.concat([df[~df['column1'].str.match('(\d{4})')], df2]).sort_index().ffill()
df_final = df_final[~df_final['column1'].str.match('(\d{4})')]
column1 column2 column3 date
1 ID01 a 1 2018-10-02
2 ID02 b 4 2018-10-02
3 ID03 c 2 2018-10-02
5 ID01 d 7 2018-10-03
6 ID02 e 6 2018-10-03
7 ID03 f 1 2018-10-03
Note
I used column1
, column2
and column3
as column names
Upvotes: 1
Reputation: 16172
Try this:
df = pd.DataFrame([[2018,10,2],['ID01','a',1],['ID02','b',4],['ID03','c',2],[2018,10,3],['ID01','d',7],['ID02','e',6],['ID03','f',1]])
data = []
for index, row in df.iterrows():
if isinstance(row[0], int):
dt = pd.to_datetime('{}-{}-{}'.format(row[0],row[1],row[2]))
else:
data.append([row[0],row[1],row[2],dt])
df = pd.DataFrame(data)
Output
0 1 2 3
0 ID01 a 1 2018-10-02
1 ID02 b 4 2018-10-02
2 ID03 c 2 2018-10-02
3 ID01 d 7 2018-10-03
4 ID02 e 6 2018-10-03
5 ID03 f 1 2018-10-03
Upvotes: 1