Reputation: 21
I'm following a tutorial on the usage of Python in bioinformatics. In the tutorial a Mann-Whitney U test was performed via the function below.
numpy.random.seed was used in the first line after packages but nowhere else. I was wondering what is the use for this action as it seemingly doesn't effect the results?
def mannwhitney(descriptor, verbose=False):
from numpy.random import seed
from numpy.random import randn
from scipy.stats import mannwhitneyu
seed(1)
selection =[descriptor, "Bioactivity_Class"]
df = df_2class[selection]
active = df[df.Bioactivity_Class == "active"]
active = active[descriptor]
selection=[descriptor,"Bioactivity_Class"]
df = df_2class[selection]
inactive = df[df.Bioactivity_Class == "inactive"]
inactive = inactive[descriptor]
stat,p = mannwhitneyu(active,inactive)
#creating a result dataframe for easier interpretation
alpha = 0.05
if p> alpha:
interpretation = "Same distribution (fail to reject H0)"
else:
interpretation = "Different distribution (reject H0)"
results = pd.DataFrame ({"Descriptor": descriptor,"Statistics": stat,"p":p,
"alpha":alpha, "Interpretation":interpretation},
index =[0])
return results
Upvotes: 1
Views: 105
Reputation: 146
This is a great question. Seeds in numpy guarantee reproducibility of randomly generated values.
imagine you are following a tutorial and it generates two random distributions to compare with a statistical test. Random generation by definition is random
the statistical tests will give you slightly different results every time you run the cell or script. To avoid this people like to set the seed.
In your case however, the data that goes into the mann whitney test is potentially deterministic i.e. provided externally through descriptor
and and df_2class
method. If these methods for any reason generate synthetic data then your seed is making sure between independent runs your p-value and statistic are exactly the same as the underlying syntetic data is exactly the same.
If this data is in fact static / deterministic then the seed is practically useless as nothing is randomly generated.
best guess is look what descriptor
and df_2class
does to generated df variable to see if seed is useful or not in this particular definition.
Upvotes: 0