rshah
rshah

Reputation: 691

Expand all columns with lists into own columns

Given the following dataframe:

    FrameLen FrameCapLen  IPHdrLen  ... TLSRecordLen  TLSAppData PacketTime
0  [117, 66]   [117, 66]  [20, 20]  ...  [46.0, nan]  [nan, nan]   0.000045
1  [117, 66]   [117, 66]  [20, 20]  ...  [46.0, nan]  [nan, nan]   0.000024
2  [117, 66]   [117, 66]  [20, 20]  ...  [46.0, nan]  [nan, nan]   0.000027
3  [117, 66]   [117, 66]  [20, 20]  ...  [46.0, nan]  [nan, nan]   0.000024
4  [117, 66]   [117, 66]  [20, 20]  ...  [46.0, nan]  [nan, nan]   0.000011

How can I expand each column that contains a list into respective new columns, so that FrameLen would become FrameLen_1, FrameLen_2, ..., etc.?

I know this can be done manually for each column manually (see below) but how could this be done automatically?

Example of manual operation (which looks inefficient?):

tags = df['FrameLen'].apply(pd.Series)
tags = column.rename(columns = lambda x : 'FrameLen_' + str(x))
pd.concat([df[:], tags[:]], axis=1)
del df['FrameLen']

Upvotes: 0

Views: 85

Answers (2)

r-beginners
r-beginners

Reputation: 35155

It would be redundant to iterate through the columns, but a simple approach would be to join the split columns and update them with new ones

import pandas as pd
import numpy as np
import io

data = '''
,FrameLen,FrameCapLen,IPHdrLen,TLSRecordLen,TLSAppData,PacketTime
0,"[117, 66]","[117, 66]","[20, 20]","[46.0, nan]","[nan, nan]",0.000045
1,"[117, 66]","[117, 66]","[20, 20]","[46.0, nan]","[nan, nan]",0.000024
2,"[117, 66]","[117, 66]","[20, 20]","[46.0, nan]","[nan, nan]",0.000027
3,"[117, 66]","[117, 66]","[20, 20]","[46.0, nan]","[nan, nan]",0.000024
4,"[117, 66]","[117, 66]","[20, 20]","[46.0, nan]","[nan, nan]",0.000011
'''

df = pd.read_csv(io.StringIO(data), sep=',', index_col=0)
# new columns create
c1 = df['FrameLen'].str.replace(r'^\[|\]$','').str.split(',', expand=True)
c2 = df['FrameCapLen'].str.replace(r'^\[|\]$','').str.split(',', expand=True)
c3 = df['IPHdrLen'].str.replace(r'^\[|\]$','').str.split(',', expand=True)
c4 = df['TLSRecordLen'].str.replace(r'^\[|\]$','').str.split(',', expand=True)
c5 = df['TLSAppData'].str.replace(r'^\[|\]$','').str.split(',', expand=True)
# new columns name create
new_col = [['{}_0'.format(col),'{}_1'.format(col)] for col in df.columns[:5]]
new_col = sum(new_col,[])
new_col.append('PacketTime')
# concat()
df_all = pd.concat([c1,c2,c3,c4,c5,df[['PacketTime']]], axis=1)
df_all.columns = new_col

df_all
|    |   FrameLen_0 |   FrameLen_1 |   FrameCapLen_0 |   FrameCapLen_1 |   IPHdrLen_0 |   IPHdrLen_1 |   TLSRecordLen_0 | TLSRecordLen_1   |   TLSAppData_0 | TLSAppData_1   |   PacketTime |
|---:|-------------:|-------------:|----------------:|----------------:|-------------:|-------------:|-----------------:|:-----------------|---------------:|:---------------|-------------:|
|  0 |          117 |           66 |             117 |              66 |           20 |           20 |               46 | nan              |            nan | nan            |      4.5e-05 |
|  1 |          117 |           66 |             117 |              66 |           20 |           20 |               46 | nan              |            nan | nan            |      2.4e-05 |
|  2 |          117 |           66 |             117 |              66 |           20 |           20 |               46 | nan              |            nan | nan            |      2.7e-05 |
|  3 |          117 |           66 |             117 |              66 |           20 |           20 |               46 | nan              |            nan | nan            |      2.4e-05 |
|  4 |          117 |           66 |             117 |              66 |           20 |           20 |               46 | nan              |            nan | nan            |      1.1e-05 |

Upvotes: 0

Mayank Porwal
Mayank Porwal

Reputation: 34056

You can use a for loop with df.add_prefix to append column name:

Take below df for example:

In [93]: In [55]: df = pd.DataFrame({'FrameLen':[[117, 66], [117, 66], [117, 66]], 'FrameCapLen':[[117, 66],[117, 66],[117, 66]], 'IPHdrLen':[[20, 20],[20, 20],[20, 20]], 'PacketTime':[0.000045, 0.000024,
    ...:  0.000027]})    

In [94]: df 
Out[94]: 
    FrameLen FrameCapLen  IPHdrLen  PacketTime
0  [117, 66]   [117, 66]  [20, 20]    0.000045
1  [117, 66]   [117, 66]  [20, 20]    0.000024
2  [117, 66]   [117, 66]  [20, 20]    0.000027

In [90]: d = pd.DataFrame() ## empty dataframe

In [91]: for i,j in df.items(): 
    ...:     if df['{}'.format(i)].dtype == 'object': 
    ...:         temp = pd.DataFrame(df[i].values.tolist()).add_prefix('{}_'.format(i)) 
    ...:         d = pd.concat([d,temp], axis=1) 
    ...:     else: 
    ...:         d = pd.concat([d, df[i]], axis=1) 
    ...:   

In [92]: d 
Out[92]: 
   FrameLen_0  FrameLen_1  FrameCapLen_0  FrameCapLen_1  IPHdrLen_0  IPHdrLen_1  PacketTime
0         117          66            117             66          20          20    0.000045
1         117          66            117             66          20          20    0.000024
2         117          66            117             66          20          20    0.000027

Upvotes: 1

Related Questions