ReverseEngineer
ReverseEngineer

Reputation: 559

Finding missing data in pandas DataFrame

I am trying to find a way to look for the missing data in dataframe based on the data in my list. Each interface must have this 5 sub-interfaces.

sub_interface_list = ['1030', '1035', '1039', '1050', '1059']

df = pd.DataFrame({'Device': ['DeviceA', 'DeviceA', 'DeviceA', 'DeviceA', 'DeviceA', 'DeviceA', 'DeviceA', 'DeviceA', 'DeviceA'], 'Interface': ['Eth-Trunk100', 'Eth-Trunk100', 'Eth-Trunk100', 'Eth-Trunk100', 'Eth-Trunk100', 'Eth-Trunk101', 'Eth-Trunk101', 'Eth-Trunk101', 'Eth-Trunk101'], 'Sub_interface': ['1030', '1035', '1039', '1050', '1059', '1030', '1039', '1050', '1059']})

The dataframe looks like this

Device  Interface   Sub_interface
DeviceA Eth-Trunk100    1030
DeviceA Eth-Trunk100    1035
DeviceA Eth-Trunk100    1039
DeviceA Eth-Trunk100    1050
DeviceA Eth-Trunk100    1059
DeviceA Eth-Trunk101    1030
DeviceA Eth-Trunk101    1039
DeviceA Eth-Trunk101    1050
DeviceA Eth-Trunk101    1059

From the list we can see that Eth-Trunk101 is missing the sub_interface of 1035, and I would like to insert the 1035 into the last row of each interface. I know using dataframe.iterrows() and search for the missing element is easy, but is there any way in pandas that can be used without using the for loop?

** This is a test data set, my data is much bigger and using iteration will be very time consuming.

Upvotes: 3

Views: 121

Answers (2)

sammywemmy
sammywemmy

Reputation: 28644

You could use the complete function from pyjanitor to expose the missing values:

df.complete(['Interface', 'Sub_interface'])

      Interface Sub_interface   Device
0  Eth-Trunk100          1030  DeviceA
1  Eth-Trunk100          1035  DeviceA
2  Eth-Trunk100          1039  DeviceA
3  Eth-Trunk100          1050  DeviceA
4  Eth-Trunk100          1059  DeviceA
5  Eth-Trunk101          1030  DeviceA
6  Eth-Trunk101          1035      NaN
7  Eth-Trunk101          1039  DeviceA
8  Eth-Trunk101          1050  DeviceA
9  Eth-Trunk101          1059  DeviceA

You can use ffill to fill the null value:

df.complete(['Interface', 'Sub_interface']).ffill()

If you want to stick solely within Pandas (pyjanitor is a collection of convenient wrappers around Pandas), the solution below works well:

Create unique index of interface and sub_interface:

interface = pd.MultiIndex.from_product([df.Interface.unique(), 
                                        df.Sub_interface.unique()])

In [456]: interface
Out[456]: 
MultiIndex([('Eth-Trunk100', '1030'),
            ('Eth-Trunk100', '1035'),
            ('Eth-Trunk100', '1039'),
            ('Eth-Trunk100', '1050'),
            ('Eth-Trunk100', '1059'),
            ('Eth-Trunk101', '1030'),
            ('Eth-Trunk101', '1035'),
            ('Eth-Trunk101', '1039'),
            ('Eth-Trunk101', '1050'),
            ('Eth-Trunk101', '1059')],
           )

Set interface and sub_interface as index, reindex with interface and reset_index:

  df.set_index(['Interface', 'Sub_interface']).reindex(interface).reset_index()

 
      Interface Sub_interface   Device
0  Eth-Trunk100          1030  DeviceA
1  Eth-Trunk100          1035  DeviceA
2  Eth-Trunk100          1039  DeviceA
3  Eth-Trunk100          1050  DeviceA
4  Eth-Trunk100          1059  DeviceA
5  Eth-Trunk101          1030  DeviceA
6  Eth-Trunk101          1035      NaN
7  Eth-Trunk101          1039  DeviceA
8  Eth-Trunk101          1050  DeviceA
9  Eth-Trunk101          1059  DeviceA

Reindexing here works because the combination of interface and sub_interface is unique; if it is not unique, then a merge on outer is a better step; complete takes care of these checks in the background.

Also be wary of setting index with nulls; the Pandas docs suggests to avoid it - Although, so far with reindexing I have not noticed any issues.

You could also use unstack/stack, since the index is unique:

df.set_index(['Interface', 'Sub_interface']).unstack().stack(dropna = False).reset_index()

     Interface Sub_interface   Device
0  Eth-Trunk100          1030  DeviceA
1  Eth-Trunk100          1035  DeviceA
2  Eth-Trunk100          1039  DeviceA
3  Eth-Trunk100          1050  DeviceA
4  Eth-Trunk100          1059  DeviceA
5  Eth-Trunk101          1030  DeviceA
6  Eth-Trunk101          1035      NaN
7  Eth-Trunk101          1039  DeviceA
8  Eth-Trunk101          1050  DeviceA
9  Eth-Trunk101          1059  DeviceA

Upvotes: 4

Quang Hoang
Quang Hoang

Reputation: 150735

One way is to pivot then stack:

(df.assign(dummy=1)
   .pivot_table(index=['Device','Interface'], columns='Sub_interface', 
                values='dummy', fill_value=1)
   .reindex(sub_interface_list, fill_value=1, axis=1)
   .stack().reset_index(name='dummy')
   .drop('dummy', axis=1)
)

Output:

    Device     Interface Sub_interface
0  DeviceA  Eth-Trunk100          1030
1  DeviceA  Eth-Trunk100          1035
2  DeviceA  Eth-Trunk100          1039
3  DeviceA  Eth-Trunk100          1050
4  DeviceA  Eth-Trunk100          1059
5  DeviceA  Eth-Trunk101          1030
6  DeviceA  Eth-Trunk101          1035
7  DeviceA  Eth-Trunk101          1039
8  DeviceA  Eth-Trunk101          1050
9  DeviceA  Eth-Trunk101          1059

Upvotes: 1

Related Questions