Robin_hood_963
Robin_hood_963

Reputation: 65

Group the lines of an array based on a number

I have an array with 100 lines and 5 columns. I would like to group them in separate arrays, based on a number given in the 5th column. The 5th column contains integer numbers from 0 to N (0, 1, 2, ...N).

So for N=2, values in column 5th will be 0, 1, 2

thus I would like to create 3 arrays with the lines having 0, 1, 2 respectively.

here is the code in python for N=3:

df_array_with_clusters=... 

for i in range(len(df_array_with_clusters)):

 if df_array_with_clusters[i, -1]== 0:

  cluster_0[i,:] = df_array_with_clusters[i, :-1]

 elif df_array_with_clusters[i, -1]== 1:
  cluster_1[i,:] = df_array_with_clusters[i, :-1]

 else:
  cluster_2[i,:] = df_array_with_clusters[i, :-1]

thanks

Upvotes: 1

Views: 271

Answers (1)

joanis
joanis

Reputation: 12263

A solution using lists

Something like this should work for you:

def distribute_into_clusters(data, N):
    clusters=[[] for _ in range(N)]
    for row in data:
        cluster_id = row[-1]
        clusters[cluster_id].append(row[:-1])
    return clusters

What this returns is a list of clusters, each of which is a list of rows as np.array.

If you want each cluster to be an array instead, change the return statement to this:

    return [np.array(cluster) for cluster in clusters]

A NumPy solution

Here's a second solution that distributes the data into clusters in NumPy. It might be more efficient.

def distribute_into_clusters(data, N):
    return [
        data[[row[-1] == cluster_id for row in data]][:,:-1]
        for cluster_id in range(N)
    ]
  • [row[-1] == cluster_id for row in data] gives me a list of bools indicating which rows belong in cluster_id.
  • data[...] slices data keeping only the rows where the bool is True
  • [:,:-1] removes the cluster ID column

Upvotes: 1

Related Questions