Reputation: 1448
I have longitudinal data as follows:
import pandas as pd
# Define the updated data with samples only in 'sample_A' or 'sample_B'
data = {
'gene_id': ['gene_1', 'gene_1', 'gene_1', 'gene_1', 'gene_1',
'gene_1', 'gene_1', 'gene_1', 'gene_1', 'gene_1',
'gene_2', 'gene_2', 'gene_2', 'gene_2', 'gene_2',
'gene_2', 'gene_2', 'gene_2', 'gene_2', 'gene_2',
'gene_3', 'gene_3', 'gene_3', 'gene_3', 'gene_3',
'gene_3', 'gene_3', 'gene_3', 'gene_3', 'gene_3'],
'position': [1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5],
'value': [5.1, 5.5, 5.7, 6.0, 6.3,
6.3, 6.5, 6.7, 6.8, 5.1,
2.3, 2.5, 2.7, 3.0, 3.1,
3.1, 3.2, 3.3, 3.4, 2.3,
3.7, 3.8, 3.9, 4.0, 4.0,
4.0, 4.1, 4.2, 4.3, 3.7],
'sample': ['sample_A', 'sample_A', 'sample_A', 'sample_A', 'sample_B',
'sample_B', 'sample_B', 'sample_B', 'sample_B', 'sample_A',
'sample_A', 'sample_A', 'sample_A', 'sample_A', 'sample_B',
'sample_B', 'sample_B', 'sample_B', 'sample_B', 'sample_A',
'sample_A', 'sample_A', 'sample_A', 'sample_A', 'sample_B',
'sample_B', 'sample_B', 'sample_B', 'sample_B', 'sample_A']
}
# Create the DataFrame
df = pd.DataFrame(data)
My goal is to cluster gene value profiles then see how those clusters correspond to samples. So for example here, a profile is defined as follows: take a sample, take a gene_id, now take all (position, value) tuples within the resulting subset.
By clustering here, I am interested in understanding how the shape and amplitudes of the curves plotted by profiles cluster. As a start, a simple KMeans would be fine with me.
After clustering the idea would be to restore to each profile the sample it came from, and then plot the cluster space and see how samples gets distributed.
I've seen solutions in R for this, but haven't seen any solutions in python. Any help is appreciated.
Upvotes: 1
Views: 100
Reputation: 169
If I understood your question correctly, you could solve your problem by using the code below. Basically, you need to reshape your data by pivoting the dataframe on the sample
and gene_id
columns. Each profile will consist of the values across the positions for that gene in that sample. Then, you apply K-means clustering on the data (I used 3 clusters based on the number of genes, but you can change that easily). I used PCA to decompose the clustering and be able to plot the data. Also, I calculated the distribution of samples for each cluster:
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
data = {
'gene_id': ['gene_1', 'gene_1', 'gene_1', 'gene_1', 'gene_1',
'gene_1', 'gene_1', 'gene_1', 'gene_1', 'gene_1',
'gene_2', 'gene_2', 'gene_2', 'gene_2', 'gene_2',
'gene_2', 'gene_2', 'gene_2', 'gene_2', 'gene_2',
'gene_3', 'gene_3', 'gene_3', 'gene_3', 'gene_3',
'gene_3', 'gene_3', 'gene_3', 'gene_3', 'gene_3'],
'position': [1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5],
'value': [5.1, 5.5, 5.7, 6.0, 6.3,
6.3, 6.5, 6.7, 6.8, 5.1,
2.3, 2.5, 2.7, 3.0, 3.1,
3.1, 3.2, 3.3, 3.4, 2.3,
3.7, 3.8, 3.9, 4.0, 4.0,
4.0, 4.1, 4.2, 4.3, 3.7],
'sample': ['sample_A', 'sample_A', 'sample_A', 'sample_A', 'sample_B',
'sample_B', 'sample_B', 'sample_B', 'sample_B', 'sample_A',
'sample_A', 'sample_A', 'sample_A', 'sample_A', 'sample_B',
'sample_B', 'sample_B', 'sample_B', 'sample_B', 'sample_A',
'sample_A', 'sample_A', 'sample_A', 'sample_A', 'sample_B',
'sample_B', 'sample_B', 'sample_B', 'sample_B', 'sample_A']
}
df = pd.DataFrame(data)
pivot_df = df.pivot_table(index=['sample', 'gene_id'], columns='position', values='value').reset_index()
# Perform K-means clustering
n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
profiles = pivot_df.drop(columns=['sample', 'gene_id'])
kmeans.fit(profiles)
pivot_df['cluster'] = kmeans.labels_
# Reduce dimensions using PCA
pca = PCA(n_components=2)
profiles_pca = pca.fit_transform(profiles)
plot_df = pd.DataFrame(profiles_pca, columns=['PC1', 'PC2'])
plot_df['cluster'] = kmeans.labels_
plot_df['sample'] = pivot_df['sample']
plot_df['gene_id'] = pivot_df['gene_id']
colors = ['red', 'blue', 'green']
# Build the plot
plt.figure(figsize=(8,6))
markers = {'sample_A': 'o', 'sample_B': 's'}
for cluster in range(n_clusters):
cluster_data = plot_df[plot_df['cluster'] == cluster]
for sample in ['sample_A', 'sample_B']:
sample_data = cluster_data[cluster_data['sample'] == sample]
plt.scatter(sample_data['PC1'], sample_data['PC2'],
marker=markers[sample],
color=colors[cluster],
label=f'Cluster {cluster} - {sample}',
alpha=0.7)
plt.title('KMeans Clustering of Gene Profiles with Sample Distribution (PCA Reduced)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()
sample_cluster_distribution = pivot_df.groupby(['cluster', 'sample']).size().unstack().fillna(0)
print(sample_cluster_distribution)
This will result in the following plot:
I hope this is what you wanted to do. Cheers!
Upvotes: 0
Reputation: 15293
Don't pivot the dataframe. This is possible with a call to kmeans2
. How many clusters you want is up to you.
There's an infinite number of ways to visualise this, so let's randomly pick one: plot original points by all four variables, with position and value spatial; plot the cluster centroids as crosses; and then circle all points in a colour corresponding to their cluster:
import numpy as np
import scipy
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
df = pd.DataFrame({
'gene_id': ('gene_1', 'gene_1', 'gene_1', 'gene_1', 'gene_1',
'gene_1', 'gene_1', 'gene_1', 'gene_1', 'gene_1',
'gene_2', 'gene_2', 'gene_2', 'gene_2', 'gene_2',
'gene_2', 'gene_2', 'gene_2', 'gene_2', 'gene_2',
'gene_3', 'gene_3', 'gene_3', 'gene_3', 'gene_3',
'gene_3', 'gene_3', 'gene_3', 'gene_3', 'gene_3'),
'position': (1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5),
'value': (5.1, 5.5, 5.7, 6.0, 6.3,
6.3, 6.5, 6.7, 6.8, 5.1,
2.3, 2.5, 2.7, 3.0, 3.1,
3.1, 3.2, 3.3, 3.4, 2.3,
3.7, 3.8, 3.9, 4.0, 4.0,
4.0, 4.1, 4.2, 4.3, 3.7),
'sample': ('sample_A', 'sample_A', 'sample_A', 'sample_A', 'sample_B',
'sample_B', 'sample_B', 'sample_B', 'sample_B', 'sample_A',
'sample_A', 'sample_A', 'sample_A', 'sample_A', 'sample_B',
'sample_B', 'sample_B', 'sample_B', 'sample_B', 'sample_A',
'sample_A', 'sample_A', 'sample_A', 'sample_A', 'sample_B',
'sample_B', 'sample_B', 'sample_B', 'sample_B', 'sample_A')
})
centroid_data, df['cluster_label'] = scipy.cluster.vq.kmeans2(
data=df[['position', 'value']], k=4, seed=0,
)
centroids = pd.DataFrame(
index=pd.RangeIndex(name='cluster_label', stop=len(centroid_data)),
columns=('position', 'value'),
data=centroid_data,
)
print(df)
print(centroids)
fig, ax = plt.subplots()
sns.scatterplot(ax=ax, data=df, x='position', y='value', hue='sample', style='gene_id')
cmap = plt.cm.rainbow(np.linspace(0, 1, len(centroids)))
for (label, cluster), color in zip(df.groupby('cluster_label'), cmap):
ax.scatter(
[centroids.loc[label, 'position']],
[centroids.loc[label, 'value']], s=60, color=color, marker='+',
)
ax.scatter(
cluster['position'], cluster['value'], s=120, color=color, marker='o', facecolors='none',
)
plt.show()
gene_id position value sample cluster_label
0 gene_1 1 5.1 sample_A 1
1 gene_1 2 5.5 sample_A 1
2 gene_1 3 5.7 sample_A 1
3 gene_1 4 6.0 sample_A 0
4 gene_1 5 6.3 sample_B 0
5 gene_1 1 6.3 sample_B 1
6 gene_1 2 6.5 sample_B 1
7 gene_1 3 6.7 sample_B 1
8 gene_1 4 6.8 sample_B 0
9 gene_1 5 5.1 sample_A 0
10 gene_2 1 2.3 sample_A 3
11 gene_2 2 2.5 sample_A 3
12 gene_2 3 2.7 sample_A 3
13 gene_2 4 3.0 sample_A 2
14 gene_2 5 3.1 sample_B 2
15 gene_2 1 3.1 sample_B 3
16 gene_2 2 3.2 sample_B 3
17 gene_2 3 3.3 sample_B 3
18 gene_2 4 3.4 sample_B 2
19 gene_2 5 2.3 sample_A 2
20 gene_3 1 3.7 sample_A 3
21 gene_3 2 3.8 sample_A 3
22 gene_3 3 3.9 sample_A 3
23 gene_3 4 4.0 sample_A 2
24 gene_3 5 4.0 sample_B 2
25 gene_3 1 4.0 sample_B 3
26 gene_3 2 4.1 sample_B 3
27 gene_3 3 4.2 sample_B 3
28 gene_3 4 4.3 sample_B 2
29 gene_3 5 3.7 sample_A 2
position value
cluster_label
0 4.5 6.050000
1 2.0 5.966667
2 4.5 3.475000
3 2.0 3.400000
Upvotes: 1