Carrie Lin
Carrie Lin

Reputation: 31

Avoid two same values in consecutive column during sampling the whole DataFrame in pandas

I want to randomize my DataFrame by 'pd.sample' method, but in order to achieve conditional sampling. I need the results satisfy the below condition: no same values of two consecutive rows in certain column.

e.g.: A  B
      a  2
      b  3 
      c  3
      d  4

I wanna the sample results like

e.g.  A  B
      a  2
      c  3
      d  4
      b  3 

(i.e. rows where B hold same values would not be consecutive)

How can I achieve this? I already have something below, but as a newbie I am not familiar with writing loop sentence:

def compareSyllable(data):
    for i in data:
        (data['B'] != data['B'].shift()).any()

while compareSyllable(data) == True: 
        for i in range(data):
            data_final=data.sample(frac=1)

Upvotes: 1

Views: 196

Answers (1)

piRSquared
piRSquared

Reputation: 294498

compare_syllable

I'd use the values attribute to get a numpy array. The syntax is cleaner for this comparison. And IIUC, you don't want any matching values from one row to the next... in any column. So using all on a numpy array automatically checks every value in the array.

conditional_sample

I made this so that you can actually pass whatever parameters you want to pd.DataFrame.sample via the pd.DataFrame.pipe. I've also parameterized the predicate on which the conditional sampling depends. Meaning, you can create other condition functions to pass to suit other purposes.

pd.DataFrame.pipe

pipe is a DataFrame method that whose first argument is a callable. pipe passes to that callable as its first argument, the same DataFrame that called pipe. The rest of the arguments are passed to the callable.

def compare_syllable(data):
    v = data.values
    return (v[:-1] != v[1:]).all()

def conditional_sample(df, predicate, *args, **kwargs):
    d = df.sample(*args, **kwargs)
    while not predicate(d):
        d = df.sample(*args, **kwargs)
    return d

df.pipe(conditional_sample, compare_syllable, frac=1)

   A  B
2  c  3
0  a  2
1  b  3
3  d  4

I'll also point out that pipe was a choice of mine and we could have also called this via a more normal way like this:

conditional_sample(
    df=df,
    predicate=compare_syllable,
    frac=1
)

   A  B
2  c  3
0  a  2
3  d  4
1  b  3

Demo

To further demonstrate that we could have passed other parameters, I'll forgo the frac=1 argument and pass 3 instead which will be the number of rows to sample

df.pipe(conditional_sample, compare_syllable, 3)

A B

1  b  3
0  a  2
3  d  4

Which also satisfies the condition.


A version that includes infinite loop handling

from itertools import count, islice

def compare_syllable(data):
    v = data.values
    return (v[:-1] != v[1:]).all()

def conditional_sample(df, predicate, limit=None, **kwargs):
    """default limit of None is infinite limit"""
    d = df.sample(**kwargs)
    for i in islice(count(), limit):
        if predicate(d):
            return d
            break
        else:
            d = df.sample(**kwargs)

df.pipe(conditional_sample, compare_syllable, limit=10, frac=1)

   A  B
2  c  3
3  d  4
1  b  3
0  a  2

Upvotes: 2

Related Questions