user86907
user86907

Reputation: 837

How to rearrange row of data frame using python?

I have a data frame where data is located in another column and I want to take those dates from that column and create a date column and store them. Here is my sample data.

df=[['Monday, 13 January 2020','',''],['Task 1',13588,'Jack'],['','','Address 1'],['','','City 1'],['Task 2',13589,'Ammie'],['','','Address 2'],['','','City'],['Task 3',13589,'Amanda'],['','','Address 3'],['','','City 3'],['Tuesday, 14 January 2020','',''],['Task 4',13587,'Chelsea'],['','','Address 4'],['','','City 4'],['Task 5','13586','Ibrahim'],['','','Address 5'],['','','City 5'],['Task 6',13585,'Kate'],['','','Address 6'],['','','City 6']]

df=pd.DataFrame(df)
df.columns = ['Task','ID','Supervisor']
df=df.replace(np.nan,'')
df

    Task    ID  Supervisor
0   Monday, 13 January 2020     
1   Task 1  13588   Jack
2           Address 1
3           City 1
4   Task 2  13589   Ammie
5           Address 2
6           City
7   Task 3  13589   Amanda
8           Address 3
9           City 3
10  Tuesday, 14 January 2020        
11  Task 4  13587   Chelsea
12          Address 4
13          City 4
14  Task 5  13586   Ibrahim
15          Address 5
16          City 5
17  Task 6  13585   Kate
18          Address 6
19          City 6

And I want to get the following output.

    Date                    Task    ID      Supervisor
0 Monday, 13 January 2020   Task 1  13588   Jack Address 1 City 1
1 Monday, 13 January 2020   Task 2  13589   Ammie Address 2 City
2 Monday, 13 January 2020   Task 3  13589   Amanda Address 3 City 3
3 Tuesday, 14 January 2020  Task 4  13587   Chelsea Address 4 City 4
4 Tuesday, 14 January 2020  Task 5  13586   Ibrahim Address 5 City 5
5 Tuesday, 14 January 2020  Task 6  13585   Kate Address 6 City 6

Here is my attempt.

def rowMerger(a,b):
    try:
        rule1 = lambda x: x not in ['']
        u = a.loc[a.iloc[:,0].apply(rule1) & a.iloc[:,1].apply(rule1) & a.iloc[:,2].apply(rule1)].index
        print(u)
        findMergerindexs = list(u)
        findMergerindexs.sort()
        a = pd.DataFrame(a)
        tabcolumns = pd.DataFrame(a.columns)
        totalcolumns = len(tabcolumns)
        b = pd.DataFrame(columns = list(tabcolumns))
        if (len(findMergerindexs) > 0):
            for m in range(len(findMergerindexs))
                if not (m == (len(findMergerindexs)-1)): 
                    startLoop = findMergerindexs[m]
                    endLoop = findMergerindexs[m+1]
                else:
                    startLoop = findMergerindexs[m]
                    endLoop = len(a)
                listValues = []
                for i in range(totalcolumns):
                    value = ' '
                    for n in range(startLoop,endLoop):
                        value = value + ' ' + str(a.iloc[n,i])
                    listValues.insert(i,(value.strip()))
                b = b.append(pd.Series(listValues),ignore_index = True)
        else:
            print("File is not having a row for merging instances - Please check the file manually for instance - ")
        return b
    except: 
        print("Error - While merging the rows")
    return b

And this code is giving me the output below.

rowMerger(df,0)
       0       1             2
0   Task 1  13588   Jack Address 1 City 1
1   Task 2  13589   Ammie Address 2 City
2   Task 3 Tuesday, 14 January 2020 13589   Amanda Address 3 City 3
3   Task 4  13587   Chelsea Address 4 City 4
4   Task 5  13586   Ibrahim Address 5 City 5
5   Task 6  13585   Kate Address 6 City 6

But the problem is that this code will only merge the rows. Not sure how to duplicate the dates across the respective rows as shown in the desired output and put it in a different column. Can anyone please help me on how to achieve this?

Upvotes: 2

Views: 278

Answers (3)

sammywemmy
sammywemmy

Reputation: 28699

@Alexandre's answer is great - this is an alternative, where I avoid the regex extraction and shift, :

#convert empty cells to null
(df.replace("",np.nan)
 #create a new column containing only the dates
 #we'll use the null cells in Supervisor column to pick out the dates
 .assign(Date = lambda x: x.loc[x.Supervisor.isna(),'Task'])
 .ffill()
 .dropna(subset=['ID'])
 #drop Dates found in the Task column
 .query('Task != Date')
 .groupby(['Date','Task','ID'],as_index=False)
 .Supervisor.agg(' '.join)
)

        Date                    Task      ID      Supervisor
0   Monday, 13 January 2020     Task 1  13588   Jack Address 1 City 1
1   Monday, 13 January 2020     Task 2  13589   Ammie Address 2 City
2   Monday, 13 January 2020     Task 3  13589   Amanda Address 3 City 3
3   Tuesday, 14 January 2020    Task 4  13587   Chelsea Address 4 City 4
4   Tuesday, 14 January 2020    Task 5  13586   Ibrahim Address 5 City 5
5   Tuesday, 14 January 2020    Task 6  13585   Kate Address 6 City 6

Upvotes: 2

Daniel Scott
Daniel Scott

Reputation: 985

So essentially we use lambda to separate the Date from the task number and pd.Series.fillna(method='ffill') to fill in the last valid date.

So we're going to add these lines:

# Split Date column
df['Date'] = df.apply(lambda x: " ".join(x[0].split(' ')[2:]) if len(x[0].split(' ')) > 2 else np.nan,axis=1).fillna(method='ffill')

# Clean Task column
df['Task'] = df.apply(lambda x: " ".join(x[0].split(' ')[:2]) if len(x[0].split(' ')) > 1 else x[0],axis=1)

# Rename and reorder remaining columns
df['ID'] = df[1]
df['Supervisor'] = df[2]
df = df[['Date','Task','ID','Supervisor']]

Upvotes: 1

Alexandre B.
Alexandre B.

Reputation: 5502

You can try the following:

task_mask = df.Task.str.match("Task\s+\d")
df.assign(Task = df.Task[task_mask],
          Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
        .replace("", np.NaN) \
        .dropna(how='all') \
        .ffill() \
        .groupby(["Task", "ID", "Date"]).agg({"Supervisor": lambda x: " ".join(x)}) \
        .reset_index()

output

#      Task     ID                      Date                Supervisor
# 0  Task 1  13588   Monday, 13 January 2020     Jack Address 1 City 1
# 1  Task 2  13589   Monday, 13 January 2020      Ammie Address 2 City
# 2  Task 3  13589   Monday, 13 January 2020   Amanda Address 3 City 3
# 3  Task 4  13587  Tuesday, 14 January 2020  Chelsea Address 4 City 4
# 4  Task 5  13586  Tuesday, 14 January 2020  Ibrahim Address 5 City 5
# 5  Task 6  13585  Tuesday, 14 January 2020     Kate Address 6 City 6

Explanations:

  1. Filter the Task columns: dates and task id.

    • One solution is to use a regex to match task id. pandas.Series.str.match does the work. The regex used is quite simple: "Task\s+\d" means Task + any space + number.
task_mask = df.Task.str.match("Task\s+\d")
  1. From this mask, we can extract the Date and the Tasks. The tasks are easily reachable from the task_mask with df.Task[task_mask]

  2. The Dates are a tiny bit much more difficult to extract.

    • We use np.where to either set Task value or NaN.
    • Then, we convert this array into a pd.Series
    • Finally, we shift all value by 1 using shift. Shifting rows let us to easily remove the NaN rows in step 5.
pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()
  1. Replace all empty string to NaN using replace

  2. Remove empty rows (e.g. old rows where there is only Date) using dropna with how="all"

  3. Fill all NaN values with previous not NaN value using ffill

  4. Groupby by "Task", "ID"a and "Date" and aggregate rows using agg. The aggregation function is based on str.join: lambda x: " ".join(x)

  5. Reset index from groupby using reset_index.

Hope this is clear !


Code + illustration

# Create dataframe
data = [['Monday, 13 January 2020', '', ''], ['Task 1', 13588, 'Jack'], ['', '', 'Address 1'], ['', '', 'City 1'], ['Task 2', 13589, 'Ammie'], ['', '', 'Address 2'], ['', '', 'City'], ['Task 3', 13589, 'Amanda'], ['', '', 'Address 3'], ['', '', 'City 3'], [
    'Tuesday, 14 January 2020', '', ''], ['Task 4', 13587, 'Chelsea'], ['', '', 'Address 4'], ['', '', 'City 4'], ['Task 5', '13586', 'Ibrahim'], ['', '', 'Address 5'], ['', '', 'City 5'], ['Task 6', 13585, 'Kate'], ['', '', 'Address 6'], ['', '', 'City 6']]
df = pd.DataFrame(data)
df.columns = ['Task', 'ID', 'Supervisor']
print(df)

# Step 1
task_mask = df.Task.str.match("Task\s+\d")
print(task_mask)
# 0     False
# 1      True
# 2     False
# 3     False
# 4      True
# 5     False
# 6     False
# 7      True
# 8     False
# 9     False
# 10    False
# 11     True
# 12    False
# 13    False
# 14     True
# 15    False
# 16    False
# 17     True
# 18    False
# 19    False
# Name: Task, dtype: bool

# Step 2
print(df.Task[task_mask])
# 1     Task 1
# 4     Task 2
# 7     Task 3
# 11    Task 4
# 14    Task 5
# 17    Task 6
# Name: Task, dtype: object

# Step 3
print(pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift())
# 0                          NaN
# 1      Monday, 13 January 2020
# 2                          NaN
# 3
# 4
# 5                          NaN
# 6
# 7
# 8                          NaN
# 9
# 10
# 11    Tuesday, 14 January 2020
# 12                         NaN
# 13
# 14
# 15                         NaN
# 16
# 17
# 18                         NaN
# 19
# dtype: object

# Step 4
print(df.assign(Task=df.Task[task_mask],
                Date=pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift())
        .replace("", np.NaN))
#       Task     ID Supervisor                      Date
# 0      NaN    NaN        NaN                       NaN
# 1   Task 1  13588       Jack   Monday, 13 January 2020
# 2      NaN    NaN  Address 1                       NaN
# 3      NaN    NaN     City 1                       NaN
# 4   Task 2  13589      Ammie                       NaN
# 5      NaN    NaN  Address 2                       NaN
# 6      NaN    NaN       City                       NaN
# 7   Task 3  13589     Amanda                       NaN
# 8      NaN    NaN  Address 3                       NaN
# 9      NaN    NaN     City 3                       NaN
# 10     NaN    NaN        NaN                       NaN
# 11  Task 4  13587    Chelsea  Tuesday, 14 January 2020
# 12     NaN    NaN  Address 4                       NaN
# 13     NaN    NaN     City 4                       NaN
# 14  Task 5  13586    Ibrahim                       NaN
# 15     NaN    NaN  Address 5                       NaN
# 16     NaN    NaN     City 5                       NaN
# 17  Task 6  13585       Kate                       NaN
# 18     NaN    NaN  Address 6                       NaN
# 19     NaN    NaN     City 6                       NaN

# Step 5:
print(df.assign(Task = df.Task[task_mask],
                Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
        .replace("", np.NaN) \
        .dropna(how='all'))
#       Task     ID Supervisor                      Date
# 1   Task 1  13588       Jack   Monday, 13 January 2020
# 2      NaN    NaN  Address 1                       NaN
# 3      NaN    NaN     City 1                       NaN
# 4   Task 2  13589      Ammie                       NaN
# 5      NaN    NaN  Address 2                       NaN
# 6      NaN    NaN       City                       NaN
# 7   Task 3  13589     Amanda                       NaN
# 8      NaN    NaN  Address 3                       NaN
# 9      NaN    NaN     City 3                       NaN
# 11  Task 4  13587    Chelsea  Tuesday, 14 January 2020
# 12     NaN    NaN  Address 4                       NaN
# 13     NaN    NaN     City 4                       NaN
# 14  Task 5  13586    Ibrahim                       NaN
# 15     NaN    NaN  Address 5                       NaN
# 16     NaN    NaN     City 5                       NaN
# 17  Task 6  13585       Kate                       NaN
# 18     NaN    NaN  Address 6                       NaN
# 19     NaN    NaN     City 6                       NaN

# Step 6:
print(df.assign(Task = df.Task[task_mask],
                 Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
            .replace("", np.NaN) \
            .dropna(how='all') \
            .ffill())
#       Task     ID Supervisor                      Date
# 1   Task 1  13588       Jack   Monday, 13 January 2020
# 2   Task 1  13588  Address 1   Monday, 13 January 2020
# 3   Task 1  13588     City 1   Monday, 13 January 2020
# 4   Task 2  13589      Ammie   Monday, 13 January 2020
# 5   Task 2  13589  Address 2   Monday, 13 January 2020
# 6   Task 2  13589       City   Monday, 13 January 2020
# 7   Task 3  13589     Amanda   Monday, 13 January 2020
# 8   Task 3  13589  Address 3   Monday, 13 January 2020
# 9   Task 3  13589     City 3   Monday, 13 January 2020
# 11  Task 4  13587    Chelsea  Tuesday, 14 January 2020
# 12  Task 4  13587  Address 4  Tuesday, 14 January 2020
# 13  Task 4  13587     City 4  Tuesday, 14 January 2020
# 14  Task 5  13586    Ibrahim  Tuesday, 14 January 2020
# 15  Task 5  13586  Address 5  Tuesday, 14 January 2020
# 16  Task 5  13586     City 5  Tuesday, 14 January 2020
# 17  Task 6  13585       Kate  Tuesday, 14 January 2020
# 18  Task 6  13585  Address 6  Tuesday, 14 January 2020
# 19  Task 6  13585     City 6  Tuesday, 14 January 2020

# Step 7
print(df.assign(Task = df.Task[task_mask],
                Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
        .replace("", np.NaN) \
        .dropna(how='all') \
        .ffill() \
        .groupby(["Task", "ID", "Date"]).agg({"Supervisor": lambda x: " ".join(x)}))
#                                                      Supervisor
# Task   ID    Date
# Task 1 13588 Monday, 13 January 2020      Jack Address 1 City 1
# Task 2 13589 Monday, 13 January 2020       Ammie Address 2 City
# Task 3 13589 Monday, 13 January 2020    Amanda Address 3 City 3
# Task 4 13587 Tuesday, 14 January 2020  Chelsea Address 4 City 4
# Task 5 13586 Tuesday, 14 January 2020  Ibrahim Address 5 City 5
# Task 6 13585 Tuesday, 14 January 2020     Kate Address 6 City 6

# Step 8
df = df.assign(Task = df.Task[task_mask],
               Date = pd.Series(np.where(~task_mask, df["Task"], np.NaN)).shift()) \
        .replace("", np.NaN) \
        .dropna(how='all') \
        .ffill() \
        .groupby(["Task", "ID", "Date"]).agg({"Supervisor": lambda x: " ".join(x)}) \
        .reset_index()
print(df)

#      Task     ID                      Date                Supervisor
# 0  Task 1  13588   Monday, 13 January 2020     Jack Address 1 City 1
# 1  Task 2  13589   Monday, 13 January 2020      Ammie Address 2 City
# 2  Task 3  13589   Monday, 13 January 2020   Amanda Address 3 City 3
# 3  Task 4  13587  Tuesday, 14 January 2020  Chelsea Address 4 City 4
# 4  Task 5  13586  Tuesday, 14 January 2020  Ibrahim Address 5 City 5
# 5  Task 6  13585  Tuesday, 14 January 2020     Kate Address 6 City 6

Upvotes: 2

Related Questions