Reasoning for numpy random seed in this function?

I'm following a tutorial on the usage of Python in bioinformatics. In the tutorial a Mann-Whitney U test was performed via the function below.

numpy.random.seed was used in the first line after packages but nowhere else. I was wondering what is the use for this action as it seemingly doesn't effect the results?

def mannwhitney(descriptor, verbose=False):

  from numpy.random import seed 
  from numpy.random import randn
  from scipy.stats import mannwhitneyu 

  seed(1)

  selection  =[descriptor, "Bioactivity_Class"]
  df = df_2class[selection]
  active = df[df.Bioactivity_Class == "active"]
  active = active[descriptor]

  selection=[descriptor,"Bioactivity_Class"]
  df = df_2class[selection]
  inactive = df[df.Bioactivity_Class == "inactive"]
  inactive = inactive[descriptor]

  stat,p = mannwhitneyu(active,inactive)

  #creating a result dataframe for easier interpretation 
  
  alpha = 0.05

  if p> alpha:
    interpretation = "Same distribution (fail to reject H0)"

  else: 
    interpretation = "Different distribution (reject H0)"

  results = pd.DataFrame ({"Descriptor": descriptor,"Statistics": stat,"p":p,
                           "alpha":alpha, "Interpretation":interpretation},
                          index =[0])
  
  return results
        

Upvotes: 1

Views: 105

Answers (1)

anmol_gorakshakar
anmol_gorakshakar

Reputation: 146

This is a great question. Seeds in numpy guarantee reproducibility of randomly generated values.

imagine you are following a tutorial and it generates two random distributions to compare with a statistical test. Random generation by definition is random the statistical tests will give you slightly different results every time you run the cell or script. To avoid this people like to set the seed.

In your case however, the data that goes into the mann whitney test is potentially deterministic i.e. provided externally through descriptor and and df_2class method. If these methods for any reason generate synthetic data then your seed is making sure between independent runs your p-value and statistic are exactly the same as the underlying syntetic data is exactly the same.

If this data is in fact static / deterministic then the seed is practically useless as nothing is randomly generated.

best guess is look what descriptor and df_2class does to generated df variable to see if seed is useful or not in this particular definition.

Upvotes: 0

Related Questions