Reputation: 11
I am extremely new into Machine learning Feature of Python. I wanted to group i.e. create a cluster based on specific texts from rows. In there are 3 columns Sr no, Name and Summary. I wanted to create a cluster based on the specific values from the summary text i.e. if the summary contains the text "Veg", then it should be in one cluster and if the text contains "Non Veg", then it should be in another cluster. Expected Output , where the third column will contain the clustered value. All veg are grouped to Cluster 0 and Non Veg to cluster 1
K-means can solve this for me. But how to cluster based on the text from the summary. Kindly help. Thanks in advance.
Upvotes: 0
Views: 2390
Reputation: 10213
You can use xlrd
for read Excel file.
You can use pandas
to read Excel file also.
Following Demo is with pandas
Steps
Dataframe
from it. pandas.read_excel
method.Write a function which return cluster number according to Summary
value in each row.
Input to this function is row
output is 0(Vegetarian), 1(Non Vegetarian), -1(not define)
Apply this function to each row of Dataframe.
pandas.to_excel
method.code:
>>> import pandas as pd
>>> a = "43583564_input.xlsx"
>>> df = pd.read_excel(a)
>>> df
sr. no Name Summary
0 1 T1 I am Vegetarian
1 2 T2 I am Non Vegetarian
2 3 T3 I am Non Vegetarian
3 4 T4 I am Vegetarian
4 5 T5 I am Non Vegetarian
>>> def getCluster(row):
... if row["Summary"]=="I am Non Vegetarian":
... return 1
... elif row["Summary"]=="I am Vegetarian":
... return 0
... else:
... return -1
...
>>> df["Cluster"] = df.apply(getCluster, axis=1)
>>> df
sr. no Name Summary Cluster
0 1 T1 I am Vegetarian 0
1 2 T2 I am Non Vegetarian 1
2 3 T3 I am Non Vegetarian 1
3 4 T4 I am Vegetarian 0
4 5 T5 I am Non Vegetarian 1
>>> df.to_excel("43583564_output.xlsx")
Upvotes: 0
Reputation: 5419
I would go one further than suggestions in the comments and say that you don't need to use Python for this task. Why not just include the following formula in the cluster
column:
=IF(ISNUMBER(SEARCH("non veg", D3)), 1, IF(ISNUMBER(SEARCH("veg", D3)), 0, -1))
Assuming the top-left corner of your tale is B2, and this is the formula in the first row (i.e. in cell E3 of the table). This should give 0 for any cells containing non veg
, 1 for cells containing veg
and -1 for any rows containing neither.
You can of course do something similar in Python as suggested by @juanpa.arrivillaga, but if your input and desired output are in excel, and there's an easy way to do it in excel, I would suggest that's easiest option.
Upvotes: 1