Eoin Vaughan
Eoin Vaughan

Reputation: 173

Pandas Dataframe: split column into multiple columns

I need to break a column in a DataFrame that at present collects multiple values (someone else's excel sheet unfortunately) for a categorical data field that can have multiple values.

As you can see below the column has 15 category codes seen in the column header.

Original DataFrame

I want to split the column based on the category codes seen in the column header ['Pamphlet'] and then transform the values collected for each record in the original column to be mapped to there respective new columns as a (1) for checked and (0) for unchecked instead of the raw value [1,2,4,5].

This is the code to split based on , between values but I need to put these into the new columns I need to set up by splitting the column ['Pamphlet'] up by the values in the header [15: 1) OSA\n2) Nutrition\n3) Activity\n4) etc.].

'''df_old['Pamphlets'].str.split(pat = ',', n = -1, expand = True)'''

Shape of desired DatFrame

If I could just get an outline of whats the best approach, if it is even possible to do this within Pandas, Thanks.

Upvotes: 2

Views: 6396

Answers (1)

run-out
run-out

Reputation: 3184

You need to go through your columns one by one and divide the headers, then create a new dataframe for each column made up of split columns, then join all that back to the original dataframe. It's a bit messy but doable.

You need to use a function and some loops to go through the columns.

First lets define the dataframe. (It would be much appreciated if in future questions you supply a replicatable dataframe and any other data.

data = {
    "1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
    "1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)
print(df_full)

 1) Mail\n2) Email \n3) At PAC/TPAC  1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation
0                                   2                                                  5
1                                   1                                                  1
2                                   3                                                  4
3                                   2                                                  4
4                                   3                                                  2
5                                   1                                                  5
6                                   3                                                  1
7                                   2                                                  4
8                                   3                                                  3
9                                   1                                                  2

We will go through the dataframe column by column using a function. For now let's build the column manually for the first column. After we'll turn this next part into a function.

First, let's grab the first column.

s_col = df_full.iloc[:, 0]
print(s_col)

0    2
1    1
2    3
3    2
4    3
5    1
6    3
7    2
8    3
9    1
Name: 1) Mail\n2) Email \n3) At PAC/TPAC, dtype: int64

Split the header into individual pieces.

col = s_col.name.split("\n")
print(col)
['1) Mail', '2) Email ', '3) At PAC/TPAC']

Clean up any leading or trailing white space.

col = [x.strip() for x in col]
print(col)
['1) Mail', '2) Email', '3) At PAC/TPAC']

Create a new dataframe from series and column heads.

data = {col[x]: s_col.to_list() for x in range(len(col))}
df = pd.DataFrame(data)
print(df)
  1) Mail  2) Email  3) At PAC/TPAC
0        2         2               2
1        1         1               1
2        3         3               3
3        2         2               2
4        3         3               3
5        1         1               1
6        3         3               3
7        2         2               2
8        3         3               3
9        1         1               1

Create a copy to make changes to the values.

df_res = df.copy()

Go through the column headers, get the first number, then filter and apply bool.

for col in df.columns:
    value = pd.to_numeric(col[0])
    df_res.loc[df[col] == value, col] = 1
    df_res.loc[df[col] != value, col] = 0

print(df_res)
  1) Mail  2) Email  3) At PAC/TPAC
0        0         1               0
1        1         0               0
2        0         0               1
3        0         1               0
4        0         0               1
5        1         0               0
6        0         0               1
7        0         1               0
8        0         0               1
9        1         0               0

Now we have split a column into its components and assigned a bool value.

Let's step back and make the above a function so we can use it for each column in the original dataframe.

def split_column(s_col):
    # Split the header into individual pieces.
    col = s_col.name.split("\n")

    # Clean up any leading or trailing white space.
    col = [x.strip() for x in col]

    # Create a new dataframe from series and column heads.
    data = {col[x]: s_col.to_list() for x in range(len(col))}
    df = pd.DataFrame(data)

    # Create a copy to make changes to the values.
    df_res = df.copy()

    # Go through the column headers, get the first number, then filter and apply bool.
    for col in df.columns:
        value = pd.to_numeric(col[0])
        df_res.loc[df[col] == value, col] = 1
        df_res.loc[df[col] != value, col] = 0

    return df_res

Now for the last step. Let's create a loop to go through the columns in the original dataframe, call the function to split each column, and then concat it to the original dataframe less the columns that were split.

for c in df_full.columns:
    # Call the function to get the split columns in a new dataframe.
    df_split = split_column(df_full[c])

    # Join it with the origianl full dataframe but drop the current column.
    df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)

print(df_full)
   1) Mail  2) Email  3) At PAC/TPAC  1) ACC  2) IM  3) PT  4) Smoking,  5) Cessation
0        0         1               0       0      0      0            0             1
1        1         0               0       1      0      0            0             0
2        0         0               1       0      0      0            1             0
3        0         1               0       0      0      0            1             0
4        0         0               1       0      1      0            0             0
5        1         0               0       0      0      0            0             1
6        0         0               1       1      0      0            0             0
7        0         1               0       0      0      0            1             0
8        0         0               1       0      0      1            0             0
9        1         0               0       0      1      0            0             0

Here is the full code...

data = {
    "1) Mail\n2) Email \n3) At PAC/TPAC": [2, 1, 3, 2, 3, 1, 3, 2, 3, 1],
    "1) ACC\n2) IM \n3) PT\n4) Smoking, \n5) Cessation": [5, 1, 4, 4, 2, 5, 1, 4, 3, 2],
}
df_full = pd.DataFrame(data)


def split_column(s_col):
    # Split the header into individual pieces.
    col = s_col.name.split("\n")

    # Clean up any leading or trailing white space.
    col = [x.strip() for x in col]

    # Create a new dataframe from series and column heads.
    data = {col[x]: s_col.to_list() for x in range(len(col))}
    df = pd.DataFrame(data)

    # Create a copy to make changes to the values.
    df_res = df.copy()

    # Go through the column headers, get the first number, then filter and apply bool.
    for col in df.columns:
        value = pd.to_numeric(col[0])
        df_res.loc[df[col] == value, col] = 1
        df_res.loc[df[col] != value, col] = 0

    return df_res


for c in df_full.columns:
    # Call the function to get the split columns in a new dataframe.
    df_split = split_column(df_full[c])

    # Join it with the origianl full dataframe but drop the current column.
    df_full = pd.concat([df_full.loc[:, ~df_full.columns.isin([c])], df_split], axis=1)

print(df_full)

Upvotes: 1

Related Questions