Muhammad Rusli
Muhammad Rusli

Reputation: 97

check data based on other data csv using pandas

i have two data csv The first :

word,centroid
she,1
great,0
good,3
mother,2
father,2
After,4
before,4
.....

The second:

sentences,label
good mother,1
great father,1

I want to check each sentence based on the cluster results so if the sentences is good mother good on the centroid 3 then array will be [0,0,0,1,0] and word mother on the centroid 2 then array will be [0,0,1,1,0]...

I have complicated and wrong code ... can anyone help me

this is my code:

import pandas as pd
import re
array=[]
data = pd.read_csv('data/data_komentar.csv',encoding = "ISO-8859-1") 
df = pd.read_csv('data/hasil_cluster.csv',encoding = "ISO-8859-1") 
for index,row in data.iterrows():
    kalimat=row[0]
    words=re.sub(r'([^\s\w]|_)', '', str(kalimat))
    words= re.sub(r'[0-9]+', '', words)
    for word in words.split():    
        kata=word.lower()
        df = df[df.eq(kata)]
        if df.empty:
            print("empty")
        else:
            print(kata)
            if df['centroid;'] is 0:
                array=array+[1,0,0,0,0]
            if df['centroid'] is 1:
                array=array+[0,1,0,0,0]
            if df['centroid'] is 2:
                array=array+[0,0,1,0,0]
            if df['centroid;'] is 3:
                array=array+[0,0,0,1,0]
            if df['centroid;'] is 4:
                array=array+[0,0,0,0,1]
            print(array)

Upvotes: 0

Views: 63

Answers (1)

ilja
ilja

Reputation: 2692

You can use apply() on the sentences column of the DataFrame:

import numpy as np

MAX_CENTROIDS = 5

def get_centroids(row):
    centroids = np.zeros(MAX_CENTROIDS, dtype=int)
    for word in row.split(' '):
        if word in df1['word'].values:
            centroids[df1[df1['word']==word]['centroid'].values]+=1
    return centroids

df2['centroid'] = df2['sentences'].apply(get_centroids)

Result df2:

enter image description here

df1 is the DataFrame with your words and centroids, df2 with the sententes. You have to specify the maximal number of centroids in MAX_CENTROIDS (=length of the centroid list).

Edit

To read the datasample you provided:

# Maybe remove encoding on your system
df1 = pd.read_csv('hasil_cluster.csv', sep=',', encoding='iso-8859-1')

# Drop Values without a centroid:
df1.dropna(inplace=True)

# Remove ; from every centroid value and convert the column to integers
df1['centroid'] = df1['centroid;'].apply(lambda x:str(x).replace(';', '')).astype(int)

# Remove unused colum
df1.drop('centroid;', inplace=True, axis=1)

Upvotes: 1

Related Questions