Reputation: 4730
So I have a dataframe like so:
[5232 rows x 2 columns]
0 2
0
2018-02-01 00:00:00 2018-02-01 00:00:00 435.24
2018-02-01 00:30:00 2018-02-01 00:30:00 357.12
2018-02-01 01:00:00 2018-02-01 01:00:00 301.32
2018-02-01 01:30:00 2018-02-01 01:30:00 256.68
2018-02-01 02:00:00 2018-02-01 02:00:00 245.52
2018-02-01 02:30:00 2018-02-01 02:30:00 223.20
2018-02-01 03:00:00 2018-02-01 03:00:00 212.04
2018-02-01 03:30:00 2018-02-01 03:30:00 212.04
2018-02-01 04:00:00 2018-02-01 04:00:00 212.04
2018-02-01 04:30:00 2018-02-01 04:30:00 212.04
2018-02-01 05:00:00 2018-02-01 05:00:00 223.20
2018-02-01 05:30:00 2018-02-01 05:30:00 234.36
And what I can currently do is replace a portion of values (say 10% at random with NaN
:
df_missing.loc[df_missing.sample(frac=0.1, random_state=100).index, 2] = np.NaN
What I'd like to be able to do, is do the same thing, but with random blocks of size x, say 10% of the data should be blocked NaN
.
For example, if the block size was 4, and and the proportion was 30%, the above dataframe might look like:
[5232 rows x 2 columns]
0 2
0
2018-02-01 00:00:00 2018-02-01 00:00:00 435.24
2018-02-01 00:30:00 2018-02-01 00:30:00 357.12
2018-02-01 01:00:00 2018-02-01 01:00:00 NaN
2018-02-01 01:30:00 2018-02-01 01:30:00 NaN
2018-02-01 02:00:00 2018-02-01 02:00:00 NaN
2018-02-01 02:30:00 2018-02-01 02:30:00 NaN
2018-02-01 03:00:00 2018-02-01 03:00:00 212.04
2018-02-01 03:30:00 2018-02-01 03:30:00 212.04
2018-02-01 04:00:00 2018-02-01 04:00:00 212.04
2018-02-01 04:30:00 2018-02-01 04:30:00 212.04
2018-02-01 05:00:00 2018-02-01 05:00:00 223.20
2018-02-01 05:30:00 2018-02-01 05:30:00 234.36
I've figured out I can get the number of blocks with:
number_of_samples = int((df.shape[0] * proporition) / block_size)
But I can't figure out how to actually create the missing blocks.
I've seen this question, which is helpful but it has two caveats:
Could someone explain how to convert that answer for those above points (or explain a different solution)?
Upvotes: 1
Views: 210
Reputation: 4730
@caseWestern gave a great solution, which I based my own on somewhat:
def block_sample(df_length : int, number_of_samples : int, block_size : int):
""" Generates the the initial index of a block of block_size WITHOUT replacement.
Does this by removing x-(block_size+1):x+block_size from the possible values,
so that the next value must be at least a block_size away from the last value.
Raises
------
ValueError: In cases of more samples than possible.
"""
full_range = list(range(df_length))
for _ in range(number_of_samples):
x = random.sample(full_range, 1)[0]
indx = full_range.index(x)
yield x
del full_range[indx-(block_size-1):indx+block_size]
try:
for x in block_sample(df_length, number_of_samples, block_size):
df_missing.loc[x:x+block_size, 2] = np.NaN
except ValueError:
pass
Upvotes: 0
Reputation: 3817
This code gets the job done in a rather inelegant fashion using if
statements to check for overlaps in blocks. It also uses the chain
method with argument unpacking (*
) to flatten a list of lists to a single list:
import pandas as pd
import random
import numpy as np
from itertools import chain
# Example dataframe
df = pd.DataFrame({0: pd.date_range(start = pd.datetime(2018, 2, 1, 0, 0, 0),
end = pd.datetime(2018, 2, 1, 10, 0, 0), freq = '30 min'),
2: np.random.randn(21)})
# Set basic parameters
proportion = 0.4
block_size = 4
number_of_samples = int((df.shape[0] * proportion) / block_size)
# This will hold all indexes to be set to NaN
block_indexes = []
i = 0
# Iterate until number of samples are found
while i < number_of_samples:
# Choose a potential start and end
potential_start = random.sample(list(df.index), 1)[0]
potential_end = potential_start + block_size
# Flatten the list of lists
flattened_indexes = list(chain(*block_indexes))
# Check to make sure potential start and potential end are not already in the indexes
if potential_start not in flattened_indexes \
and potential_end not in flattened_indexes:
# If they are not, append the block indexes
block_indexes.append(list(range(potential_start, potential_end)))
i += 1
# Flatten the list of lists
block_indexes = list(chain(*block_indexes))
# Set the blocks to nan accounting for end of dataframe
df.loc[[x for x in block_indexes if x in df.index], 2] = np.nan
With the result applied to the example dataframe:
I'm not sure how you want to handle the blocks at the end of the dataframe, but this code ignores any indexes that occur outside of the range of the dataframe index. I'm sure there is a more Pythonic way to write this code, and any comments would be appreciated!
Upvotes: 2