kirannambu
kirannambu

Reputation: 11

Excel group by substring

I am extremely new into Machine learning Feature of Python. I wanted to group i.e. create a cluster based on specific texts from rows. In Input Excel there are 3 columns Sr no, Name and Summary. I wanted to create a cluster based on the specific values from the summary text i.e. if the summary contains the text "Veg", then it should be in one cluster and if the text contains "Non Veg", then it should be in another cluster. Expected Output Output Excel, where the third column will contain the clustered value. All veg are grouped to Cluster 0 and Non Veg to cluster 1

K-means can solve this for me. But how to cluster based on the text from the summary. Kindly help. Thanks in advance.

Upvotes: 0

Views: 2390

Answers (2)

Vivek Sable
Vivek Sable

Reputation: 10213

You can use xlrd for read Excel file.

You can use pandas to read Excel file also.

Following Demo is with pandas

Steps

  1. Read Excel file and create Dataframe from it. pandas.read_excel method.
  2. Write a function which return cluster number according to Summary value in each row.

    Input to this function is row output is 0(Vegetarian), 1(Non Vegetarian), -1(not define)

  3. Apply this function to each row of Dataframe.

  4. Write final output back to Excel file by pandas.to_excel method.

code:

>>> import pandas as pd
>>> a = "43583564_input.xlsx"
>>> df = pd.read_excel(a)
>>> df
   sr. no Name              Summary
0       1   T1      I am Vegetarian
1       2   T2  I am Non Vegetarian
2       3   T3  I am Non Vegetarian
3       4   T4      I am Vegetarian
4       5   T5  I am Non Vegetarian
>>> def getCluster(row):
...     if row["Summary"]=="I am Non Vegetarian":
...         return 1
...     elif row["Summary"]=="I am Vegetarian":
...         return 0
...     else:
...         return -1
... 
>>> df["Cluster"] = df.apply(getCluster, axis=1)
>>> df
   sr. no Name              Summary  Cluster
0       1   T1      I am Vegetarian        0
1       2   T2  I am Non Vegetarian        1
2       3   T3  I am Non Vegetarian        1
3       4   T4      I am Vegetarian        0
4       5   T5  I am Non Vegetarian        1
>>> df.to_excel("43583564_output.xlsx")

Upvotes: 0

kabdulla
kabdulla

Reputation: 5419

I would go one further than suggestions in the comments and say that you don't need to use Python for this task. Why not just include the following formula in the cluster column:

=IF(ISNUMBER(SEARCH("non veg", D3)), 1, IF(ISNUMBER(SEARCH("veg", D3)), 0, -1))

Assuming the top-left corner of your tale is B2, and this is the formula in the first row (i.e. in cell E3 of the table). This should give 0 for any cells containing non veg, 1 for cells containing veg and -1 for any rows containing neither.

You can of course do something similar in Python as suggested by @juanpa.arrivillaga, but if your input and desired output are in excel, and there's an easy way to do it in excel, I would suggest that's easiest option.

Upvotes: 1

Related Questions