Darkstarone
Darkstarone

Reputation: 4730

Generate missing blocks of data in pandas dataframe

So I have a dataframe like so:

[5232 rows x 2 columns]
                                   0       2
0                                               
2018-02-01 00:00:00  2018-02-01 00:00:00  435.24
2018-02-01 00:30:00  2018-02-01 00:30:00  357.12
2018-02-01 01:00:00  2018-02-01 01:00:00  301.32
2018-02-01 01:30:00  2018-02-01 01:30:00  256.68
2018-02-01 02:00:00  2018-02-01 02:00:00  245.52
2018-02-01 02:30:00  2018-02-01 02:30:00  223.20
2018-02-01 03:00:00  2018-02-01 03:00:00  212.04
2018-02-01 03:30:00  2018-02-01 03:30:00  212.04
2018-02-01 04:00:00  2018-02-01 04:00:00  212.04
2018-02-01 04:30:00  2018-02-01 04:30:00  212.04
2018-02-01 05:00:00  2018-02-01 05:00:00  223.20
2018-02-01 05:30:00  2018-02-01 05:30:00  234.36

And what I can currently do is replace a portion of values (say 10% at random with NaN:

df_missing.loc[df_missing.sample(frac=0.1, random_state=100).index, 2] = np.NaN

What I'd like to be able to do, is do the same thing, but with random blocks of size x, say 10% of the data should be blocked NaN.

For example, if the block size was 4, and and the proportion was 30%, the above dataframe might look like:

[5232 rows x 2 columns]
                                   0       2
0                                               
2018-02-01 00:00:00  2018-02-01 00:00:00  435.24
2018-02-01 00:30:00  2018-02-01 00:30:00  357.12
2018-02-01 01:00:00  2018-02-01 01:00:00  NaN
2018-02-01 01:30:00  2018-02-01 01:30:00  NaN
2018-02-01 02:00:00  2018-02-01 02:00:00  NaN
2018-02-01 02:30:00  2018-02-01 02:30:00  NaN
2018-02-01 03:00:00  2018-02-01 03:00:00  212.04
2018-02-01 03:30:00  2018-02-01 03:30:00  212.04
2018-02-01 04:00:00  2018-02-01 04:00:00  212.04
2018-02-01 04:30:00  2018-02-01 04:30:00  212.04
2018-02-01 05:00:00  2018-02-01 05:00:00  223.20
2018-02-01 05:30:00  2018-02-01 05:30:00  234.36

I've figured out I can get the number of blocks with:

number_of_samples = int((df.shape[0] * proporition) / block_size)

But I can't figure out how to actually create the missing blocks.

I've seen this question, which is helpful but it has two caveats:

  1. It doesn't modify the original dataframe with NaN values, just returns samples.
  2. There's no guarantee the samples won't overlap (which I'd ideally like to avoid)

Could someone explain how to convert that answer for those above points (or explain a different solution)?

Upvotes: 1

Views: 210

Answers (2)

Darkstarone
Darkstarone

Reputation: 4730

@caseWestern gave a great solution, which I based my own on somewhat:

def block_sample(df_length : int, number_of_samples : int, block_size : int):
    """ Generates the the initial index of a block of block_size WITHOUT replacement.

        Does this by removing x-(block_size+1):x+block_size from the possible values, 
        so that the next value must be at least a block_size away from the last value. 


        Raises
        ------
        ValueError: In cases of more samples than possible.
    """
    full_range = list(range(df_length))
    for _ in range(number_of_samples):
        x = random.sample(full_range, 1)[0]
        indx = full_range.index(x)
        yield x
        del full_range[indx-(block_size-1):indx+block_size]

try: 
    for x in block_sample(df_length, number_of_samples, block_size):
        df_missing.loc[x:x+block_size, 2] = np.NaN
except ValueError:
        pass

Upvotes: 0

willk
willk

Reputation: 3817

This code gets the job done in a rather inelegant fashion using if statements to check for overlaps in blocks. It also uses the chain method with argument unpacking (*) to flatten a list of lists to a single list:

import pandas as pd
import random
import numpy as np
from itertools import chain

# Example dataframe
df = pd.DataFrame({0: pd.date_range(start = pd.datetime(2018, 2, 1, 0, 0, 0), 
                                    end = pd.datetime(2018, 2, 1, 10, 0, 0), freq = '30 min'),
                   2: np.random.randn(21)})

# Set basic parameters
proportion = 0.4
block_size = 4
number_of_samples = int((df.shape[0] * proportion) / block_size)

# This will hold all indexes to be set to NaN
block_indexes = []

i = 0 

# Iterate until number of samples are found
while i < number_of_samples:
    
    # Choose a potential start and end
    potential_start = random.sample(list(df.index), 1)[0]
    potential_end = potential_start + block_size
    
    # Flatten the list of lists
    flattened_indexes = list(chain(*block_indexes))
    
    # Check to make sure potential start and potential end are not already in the indexes
    if potential_start not in flattened_indexes \
    and potential_end not in flattened_indexes:
        
        # If they are not, append the block indexes
        block_indexes.append(list(range(potential_start, potential_end)))
        
        i += 1
        
# Flatten the list of lists
block_indexes = list(chain(*block_indexes))

# Set the blocks to nan accounting for end of dataframe
df.loc[[x for x in block_indexes if x in df.index], 2] = np.nan

With the result applied to the example dataframe:

enter image description here

I'm not sure how you want to handle the blocks at the end of the dataframe, but this code ignores any indexes that occur outside of the range of the dataframe index. I'm sure there is a more Pythonic way to write this code, and any comments would be appreciated!

Upvotes: 2

Related Questions