Reputation: 837
I have a data frame where data is located in another column and I want to take those dates from that column and create a date column and store them. Here is my sample data.
df=[['Monday, 13 January 2020','',''],['Task 1',13588,'Jack'],['','','Address 1'],['','','City 1'],['Task 2',13589,'Ammie'],['','','Address 2'],['','','City'],['Task 3',13589,'Amanda'],['','','Address 3'],['','','City 3'],['Tuesday, 14 January 2020','',''],['Task 4',13587,'Chelsea'],['','','Address 4'],['','','City 4'],['Task 5','13586','Ibrahim'],['','','Address 5'],['','','City 5'],['Task 6',13585,'Kate'],['','','Address 6'],['','','City 6']]
df=pd.DataFrame(df)
df.columns = ['Task','ID','Supervisor']
df=df.replace(np.nan,'')
df
Task ID Supervisor
0 Monday, 13 January 2020
1 Task 1 13588 Jack
2 Address 1
3 City 1
4 Task 2 13589 Ammie
5 Address 2
6 City
7 Task 3 13589 Amanda
8 Address 3
9 City 3
10 Tuesday, 14 January 2020
11 Task 4 13587 Chelsea
12 Address 4
13 City 4
14 Task 5 13586 Ibrahim
15 Address 5
16 City 5
17 Task 6 13585 Kate
18 Address 6
19 City 6
And I want to get the following output.
Date Task ID Supervisor
0 Monday, 13 January 2020 Task 1 13588 Jack Address 1 City 1
1 Monday, 13 January 2020 Task 2 13589 Ammie Address 2 City
2 Monday, 13 January 2020 Task 3 13589 Amanda Address 3 City 3
3 Tuesday, 14 January 2020 Task 4 13587 Chelsea Address 4 City 4
4 Tuesday, 14 January 2020 Task 5 13586 Ibrahim Address 5 City 5
5 Tuesday, 14 January 2020 Task 6 13585 Kate Address 6 City 6
Here is my attempt.
def rowMerger(a,b):
try:
rule1 = lambda x: x not in ['']
u = a.loc[a.iloc[:,0].apply(rule1) & a.iloc[:,1].apply(rule1) & a.iloc[:,2].apply(rule1)].index
print(u)
findMergerindexs = list(u)
findMergerindexs.sort()
a = pd.DataFrame(a)
tabcolumns = pd.DataFrame(a.columns)
totalcolumns = len(tabcolumns)
b = pd.DataFrame(columns = list(tabcolumns))
if (len(findMergerindexs) > 0):
for m in range(len(findMergerindexs))
if not (m == (len(findMergerindexs)-1)):
startLoop = findMergerindexs[m]
endLoop = findMergerindexs[m+1]
else:
startLoop = findMergerindexs[m]
endLoop = len(a)
listValues = []
for i in range(totalcolumns):
value = ' '
for n in range(startLoop,endLoop):
value = value + ' ' + str(a.iloc[n,i])
listValues.insert(i,(value.strip()))
b = b.append(pd.Series(listValues),ignore_index = True)
else:
print("File is not having a row for merging instances - Please check the file manually for instance - ")
return b
except:
print("Error - While merging the rows")
return b
And this code is giving me the output below.
rowMerger(df,0)
0 1 2
0 Task 1 13588 Jack Address 1 City 1
1 Task 2 13589 Ammie Address 2 City
2 Task 3 Tuesday, 14 January 2020 13589 Amanda Address 3 City 3
3 Task 4 13587 Chelsea Address 4 City 4
4 Task 5 13586 Ibrahim Address 5 City 5
5 Task 6 13585 Kate Address 6 City 6
But the problem is that this code will only merge the rows. Not sure how to duplicate the dates across the respective rows as shown in the desired output and put it in a different column. Can anyone please help me on how to achieve this?
Upvotes: 2
Views: 278
Reputation: 28699
@Alexandre's answer is great - this is an alternative, where I avoid the regex extraction and shift, :
#convert empty cells to null
(df.replace("",np.nan)
#create a new column containing only the dates
#we'll use the null cells in Supervisor column to pick out the dates
.assign(Date = lambda x: x.loc[x.Supervisor.isna(),'Task'])
.ffill()
.dropna(subset=['ID'])
#drop Dates found in the Task column
.query('Task != Date')
.groupby(['Date','Task','ID'],as_index=False)
.Supervisor.agg(' '.join)
)
Date Task ID Supervisor
0 Monday, 13 January 2020 Task 1 13588 Jack Address 1 City 1
1 Monday, 13 January 2020 Task 2 13589 Ammie Address 2 City
2 Monday, 13 January 2020 Task 3 13589 Amanda Address 3 City 3
3 Tuesday, 14 January 2020 Task 4 13587 Chelsea Address 4 City 4
4 Tuesday, 14 January 2020 Task 5 13586 Ibrahim Address 5 City 5
5 Tuesday, 14 January 2020 Task 6 13585 Kate Address 6 City 6
Upvotes: 2
Reputation: 985
So essentially we use lambda to separate the Date from the task number and pd.Series.fillna(method='ffill')
to fill in the last valid date.
So we're going to add these lines:
# Split Date column
df['Date'] = df.apply(lambda x: " ".join(x[0].split(' ')[2:]) if len(x[0].split(' ')) > 2 else np.nan,axis=1).fillna(method='ffill')
# Clean Task column
df['Task'] = df.apply(lambda x: " ".join(x[0].split(' ')[:2]) if len(x[0].split(' ')) > 1 else x[0],axis=1)
# Rename and reorder remaining columns
df['ID'] = df[1]
df['Supervisor'] = df[2]
df = df[['Date','Task','ID','Supervisor']]
Upvotes: 1
Reputation: 5502
You can try the following:
task_mask = df.Task.str.match("Task\s+\d")
df.assign(Task = df.Task[task_mask],
Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
.replace("", np.NaN) \
.dropna(how='all') \
.ffill() \
.groupby(["Task", "ID", "Date"]).agg({"Supervisor": lambda x: " ".join(x)}) \
.reset_index()
output
# Task ID Date Supervisor
# 0 Task 1 13588 Monday, 13 January 2020 Jack Address 1 City 1
# 1 Task 2 13589 Monday, 13 January 2020 Ammie Address 2 City
# 2 Task 3 13589 Monday, 13 January 2020 Amanda Address 3 City 3
# 3 Task 4 13587 Tuesday, 14 January 2020 Chelsea Address 4 City 4
# 4 Task 5 13586 Tuesday, 14 January 2020 Ibrahim Address 5 City 5
# 5 Task 6 13585 Tuesday, 14 January 2020 Kate Address 6 City 6
Explanations:
Filter the Task
columns: dates
and task id
.
task id
. pandas.Series.str.match
does the work. The regex used is quite simple: "Task\s+\d"
means Task
+ any space + number.task_mask = df.Task.str.match("Task\s+\d")
From this mask, we can extract the Date
and the Tasks
. The tasks are easily reachable from the task_mask
with df.Task[task_mask]
The Dates
are a tiny bit much more difficult to extract.
pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()
Replace all empty string to NaN
using replace
Remove empty rows (e.g. old rows where there is only Date
) using dropna
with how="all"
Fill all NaN
values with previous not NaN
value using ffill
Groupby by "Task", "ID"a and "Date"
and aggregate rows using agg
. The aggregation function is based on str.join
: lambda x: " ".join(x)
Reset index from groupby using reset_index
.
Hope this is clear !
Code + illustration
# Create dataframe
data = [['Monday, 13 January 2020', '', ''], ['Task 1', 13588, 'Jack'], ['', '', 'Address 1'], ['', '', 'City 1'], ['Task 2', 13589, 'Ammie'], ['', '', 'Address 2'], ['', '', 'City'], ['Task 3', 13589, 'Amanda'], ['', '', 'Address 3'], ['', '', 'City 3'], [
'Tuesday, 14 January 2020', '', ''], ['Task 4', 13587, 'Chelsea'], ['', '', 'Address 4'], ['', '', 'City 4'], ['Task 5', '13586', 'Ibrahim'], ['', '', 'Address 5'], ['', '', 'City 5'], ['Task 6', 13585, 'Kate'], ['', '', 'Address 6'], ['', '', 'City 6']]
df = pd.DataFrame(data)
df.columns = ['Task', 'ID', 'Supervisor']
print(df)
# Step 1
task_mask = df.Task.str.match("Task\s+\d")
print(task_mask)
# 0 False
# 1 True
# 2 False
# 3 False
# 4 True
# 5 False
# 6 False
# 7 True
# 8 False
# 9 False
# 10 False
# 11 True
# 12 False
# 13 False
# 14 True
# 15 False
# 16 False
# 17 True
# 18 False
# 19 False
# Name: Task, dtype: bool
# Step 2
print(df.Task[task_mask])
# 1 Task 1
# 4 Task 2
# 7 Task 3
# 11 Task 4
# 14 Task 5
# 17 Task 6
# Name: Task, dtype: object
# Step 3
print(pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift())
# 0 NaN
# 1 Monday, 13 January 2020
# 2 NaN
# 3
# 4
# 5 NaN
# 6
# 7
# 8 NaN
# 9
# 10
# 11 Tuesday, 14 January 2020
# 12 NaN
# 13
# 14
# 15 NaN
# 16
# 17
# 18 NaN
# 19
# dtype: object
# Step 4
print(df.assign(Task=df.Task[task_mask],
Date=pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift())
.replace("", np.NaN))
# Task ID Supervisor Date
# 0 NaN NaN NaN NaN
# 1 Task 1 13588 Jack Monday, 13 January 2020
# 2 NaN NaN Address 1 NaN
# 3 NaN NaN City 1 NaN
# 4 Task 2 13589 Ammie NaN
# 5 NaN NaN Address 2 NaN
# 6 NaN NaN City NaN
# 7 Task 3 13589 Amanda NaN
# 8 NaN NaN Address 3 NaN
# 9 NaN NaN City 3 NaN
# 10 NaN NaN NaN NaN
# 11 Task 4 13587 Chelsea Tuesday, 14 January 2020
# 12 NaN NaN Address 4 NaN
# 13 NaN NaN City 4 NaN
# 14 Task 5 13586 Ibrahim NaN
# 15 NaN NaN Address 5 NaN
# 16 NaN NaN City 5 NaN
# 17 Task 6 13585 Kate NaN
# 18 NaN NaN Address 6 NaN
# 19 NaN NaN City 6 NaN
# Step 5:
print(df.assign(Task = df.Task[task_mask],
Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
.replace("", np.NaN) \
.dropna(how='all'))
# Task ID Supervisor Date
# 1 Task 1 13588 Jack Monday, 13 January 2020
# 2 NaN NaN Address 1 NaN
# 3 NaN NaN City 1 NaN
# 4 Task 2 13589 Ammie NaN
# 5 NaN NaN Address 2 NaN
# 6 NaN NaN City NaN
# 7 Task 3 13589 Amanda NaN
# 8 NaN NaN Address 3 NaN
# 9 NaN NaN City 3 NaN
# 11 Task 4 13587 Chelsea Tuesday, 14 January 2020
# 12 NaN NaN Address 4 NaN
# 13 NaN NaN City 4 NaN
# 14 Task 5 13586 Ibrahim NaN
# 15 NaN NaN Address 5 NaN
# 16 NaN NaN City 5 NaN
# 17 Task 6 13585 Kate NaN
# 18 NaN NaN Address 6 NaN
# 19 NaN NaN City 6 NaN
# Step 6:
print(df.assign(Task = df.Task[task_mask],
Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
.replace("", np.NaN) \
.dropna(how='all') \
.ffill())
# Task ID Supervisor Date
# 1 Task 1 13588 Jack Monday, 13 January 2020
# 2 Task 1 13588 Address 1 Monday, 13 January 2020
# 3 Task 1 13588 City 1 Monday, 13 January 2020
# 4 Task 2 13589 Ammie Monday, 13 January 2020
# 5 Task 2 13589 Address 2 Monday, 13 January 2020
# 6 Task 2 13589 City Monday, 13 January 2020
# 7 Task 3 13589 Amanda Monday, 13 January 2020
# 8 Task 3 13589 Address 3 Monday, 13 January 2020
# 9 Task 3 13589 City 3 Monday, 13 January 2020
# 11 Task 4 13587 Chelsea Tuesday, 14 January 2020
# 12 Task 4 13587 Address 4 Tuesday, 14 January 2020
# 13 Task 4 13587 City 4 Tuesday, 14 January 2020
# 14 Task 5 13586 Ibrahim Tuesday, 14 January 2020
# 15 Task 5 13586 Address 5 Tuesday, 14 January 2020
# 16 Task 5 13586 City 5 Tuesday, 14 January 2020
# 17 Task 6 13585 Kate Tuesday, 14 January 2020
# 18 Task 6 13585 Address 6 Tuesday, 14 January 2020
# 19 Task 6 13585 City 6 Tuesday, 14 January 2020
# Step 7
print(df.assign(Task = df.Task[task_mask],
Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
.replace("", np.NaN) \
.dropna(how='all') \
.ffill() \
.groupby(["Task", "ID", "Date"]).agg({"Supervisor": lambda x: " ".join(x)}))
# Supervisor
# Task ID Date
# Task 1 13588 Monday, 13 January 2020 Jack Address 1 City 1
# Task 2 13589 Monday, 13 January 2020 Ammie Address 2 City
# Task 3 13589 Monday, 13 January 2020 Amanda Address 3 City 3
# Task 4 13587 Tuesday, 14 January 2020 Chelsea Address 4 City 4
# Task 5 13586 Tuesday, 14 January 2020 Ibrahim Address 5 City 5
# Task 6 13585 Tuesday, 14 January 2020 Kate Address 6 City 6
# Step 8
df = df.assign(Task = df.Task[task_mask],
Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
.replace("", np.NaN) \
.dropna(how='all') \
.ffill() \
.groupby(["Task", "ID", "Date"]).agg({"Supervisor": lambda x: " ".join(x)}) \
.reset_index()
print(df)
# Task ID Date Supervisor
# 0 Task 1 13588 Monday, 13 January 2020 Jack Address 1 City 1
# 1 Task 2 13589 Monday, 13 January 2020 Ammie Address 2 City
# 2 Task 3 13589 Monday, 13 January 2020 Amanda Address 3 City 3
# 3 Task 4 13587 Tuesday, 14 January 2020 Chelsea Address 4 City 4
# 4 Task 5 13586 Tuesday, 14 January 2020 Ibrahim Address 5 City 5
# 5 Task 6 13585 Tuesday, 14 January 2020 Kate Address 6 City 6
Upvotes: 2