Reputation: 335
I'm running text clustering on a table that contains medical terms, I want to cluster strings that have similar words, if two have have two words or more, should be included in one cluster more likely than if they only have one word in common.
I tried many techniques and I'm not getting any efficient results! I tried first using Levenshtein distance with both kmeans and AgglomerativeClustering( three linkage methods: ward, complete and avaerage). It returns poor results and this metric combined words having partially similar letters, like " dog" and " door".
I changed the distance metric into using TF-IDF and then run cosine similarity, and then I converted the similarity to distance by subtract each value by 1 ( distance = 1- similarity), cause I tried the wiki method by 2* acosine(similarity) and it returned nan values!
Anyway, with this distance metric, I tried also both algorithms, It return good clusters overall, except one huge one that don't contain similar words between them! no matter how I change the value of no of clusters, this huge cluster will still appear even If I choose large no of k ( close to n which is the length of input), and it's usually appearing in the beginning, either cluster 0, 1, 2, 3.. Why this is happening?? what I'm doing wrong? my dataset length is more than 5000. and this is part of the clusters output.
cluster no 0:['Prolonged INR', 'Prolonged PTT', 'Prolonged QT Interval']
cluster no 1:['GI bleeding', 'Gastrointestinal (GI) Bleeding', 'Lower GI bleeding']
cluster no 2:['ACS', 'Acetazolamide', 'Achondroplasia', 'Acrocyanosis', 'Acromegaly', 'Adenoidectomy', 'Adenomyosis', 'Afebrile', 'Antihistamine', 'Apheresis', 'Aplasia', 'Argatroban', 'Arthralgia', 'Arthrocentesis', 'Arthrography', 'Arthroplasty', 'Asbestosis', 'Ascorbate', 'Asian', 'Asterixis', 'Astigmatism', 'Astrocytoma', 'Asymptomatic', 'Atelectasis', 'Atherosclerosis', 'Atropine', 'Audiogram', 'Autonomic Dysreflexia', 'Autopsy', 'Bacteremia', 'Balanitis', 'Balanoposthitis', 'Breastfeeding', 'Breech Presentation', 'Bronchiectasis', 'Bronchiolitis', 'Bronchospasm', 'Cachexia', 'Caf� Au Lait Spot', 'Calcaneovalgus', 'Chalazion', 'Chemistry Panels', 'Chills', 'Cholelithiasis', 'Cholera', 'Chondroblastoma', 'Chondrosarcoma', 'Chorioamnionitis', 'Chorionic Villus Sampling (CVS)', 'Choroid Plexus Papilloma (CPP)', 'Circumcision', 'Citrate', 'Claudication', 'Clonus', 'Coccidioidomycosis', 'Coccygodynia', 'Costochondritis', 'Craniectomy', 'Craniofacial Anomalies', 'Craniopharyngioma', 'Craniosynostosis', 'Craniotomy', 'Cri du Chat', 'Croup', 'Cryofibrinogen', 'Cryoglobulin', 'Cyclophosphamide', 'Cystometry', 'D-Dimer', 'Dacryocystitis', 'Dacryocystorhinostomy (DCR)', 'Dacryostenosis', 'Dantrolene', 'Deformational Plagiocephaly', 'Delusions', 'Demeclocycline', 'Dentures', 'Dermabrasion', 'Deviated Septum', 'Electrolytes', 'Electronystagmography (ENG)', 'Embolectomy', 'Emmetropia', 'Empyema', 'Enchondroma', 'Encopresis', 'Enterovirus', 'Ependymoma', 'Epididymitis', 'Epirubicin', 'Episiotomy', 'Epispadias', 'Eribulin', 'Erythroderma', 'Esophagectomy', 'Essential Tremor', 'Foraminotomy', 'Frostnip/Frostbite', 'Gallstones', 'Gastritis', 'Gastrojejunostomy', 'Gastroschisis', 'Giardiasis', 'Gingivitis', 'Gingivostomatitis', 'Glaucoma', 'Gliomas', 'Glomerulonephritis', 'Glomerulosclerosis', 'Group B Streptococcus', 'Herpangina', 'Hiccups', 'Hidradenitis Suppurativa', 'Hirsutism', 'Hookworm', 'Hordeolum (Stye)', 'Hydatidiform Mole', 'Hydration', 'Hydrocelectomy', 'Hydrops Fetalis', 'Hyperbilirubinemia', 'Hyperlipidemia', 'Hyperopia', 'Hyperphosphatemia', 'Hyperreflexia', 'Hypnosis', 'Hypoparathyroidism', 'Hypopituitarism', 'Hypovolemia', 'Hypoxia', 'Hysterosalpingogram (HSG)', 'Hysteroscopy', 'Intussusception', 'Irritability', 'Isoproterenol', 'Ixabepilone', 'Jewish', 'Karyotype', 'Keratoconus', 'Ketonemia', 'Ketonuria', 'Kyphoplasty', 'Kyphosis', 'Labyrinthitis', 'Lactulose', 'Laminectomy', 'Laminotomy', 'Lapatinib', 'Laryngectomy', 'Laryngitis', 'Laryngomalacia', 'Laryngoscopy', 'Laxative', 'Lymphadenitis', 'Lymphangitis', 'Lymphocele', 'Malaise', 'Malaria', 'Malocclusion', 'Mammography', 'Mannitol', 'Mastalgia', 'Mastectomy', 'Mastitis', 'Mastoidectomy', 'Mastopexy', 'Mediastinoscopy', 'Megaureter', 'Melena', 'Meningioma', 'Menopause', 'Menorrhagia', 'Menstruation', 'Metatarsalgia', 'Metatarsus Adductus', 'Metoclopramide', 'Neomycin', 'Nephrectomy', 'Nephrolithiasis', 'Neuromyelitis Optica', 'Neurosonography', 'Neurosurgery', 'Nocturnal Enuresis', 'Norovirus', 'Pericardectomy', 'Perimenopause', 'Periventricular Leukomalacia', 'Pertuzumab', 'Phimosis', 'Phobia', 'Photorefractive Keratectomy (PRK)', 'Phytophotodermatitis', 'Pilomatrixoma', 'Pinworms', 'Pityriasis Rosea', 'Plain radiograph', 'Platelets', 'Pleurisy', 'Pneumococcus', 'Pneumoconiosis', 'Pneumonectomy', 'Psychosis', 'Pterygium', 'Ptosis', 'Pulpitis (Toothache)', 'Pyeloplasty', 'Quantitative Immunoglobulins', 'Rabies', 'Rales', 'Red wale marks', 'Refractive Error', 'Smallpox', 'Smoking Cessation', 'Snoring', 'Sonohysterography', 'Spasmodic Dysphonia', 'Spina Bifida', 'Terlipressin', 'Tetany', 'Thoracotomy', 'Thrombocythemia', 'Thrombophilia', 'Thrombophlebitis', 'Thyroidectomy', 'Tinnitus', 'Tonsillar enlargement', 'Torn Annulus', 'Toxoplasmosis', 'Trabeculectomy', 'Ureterolysis', 'Ureteroplasty', 'Ureterosigmoidostomy', 'Urethritis', 'Urethroplasty', 'Uroflowmetry', 'Urostomy', 'Urticaria (Hives)', 'Uvulitis', 'Uvulopalatopharyngoplasty (UPPP)', 'Valsalva Maneuver', 'Varicella (Chickenpox)', 'Vasculitis', 'Vasopressin', 'Vasopressor', 'Venography', 'Ventriculostomy', 'Vertebroplasty', 'Vesicoureteral Reflux (VUR)', 'Osteochondritis Dissecans (OCD)', 'Osteochondroma', 'Osteogenesis Imperfecta (OI)', 'Osteopenia', 'Osteophyte formation', 'Osteosarcoma', 'Overuse Injuries', 'Overweight', 'Pallister Killian', 'Pallor', 'Palpitation', 'Palpitations', 'Paraesthesia', 'Paranoia', 'Paraphimosis', 'Parasomnias', 'Parathyroidectomy', 'Paronychia', 'Parotidectomy', 'Peaked T waves', 'Pemphigus Vulgaris', 'Lepirudin', 'Lethargy', 'Letrozole', 'Lichen Planus', 'Liposarcoma', 'Listeriosis', 'Living will', 'Lordosis', 'Excessive urination', 'Exemestane', 'Exploratory Laparotomy', 'Facelift (Rhytidectomy)', 'Fainting', 'Fibrinogen', 'Fibromyalgia', 'Fluorouracil', 'Folliculitis', 'Fondaparinux', 'Bedbound', 'Bedrest', 'Bevacizumab', 'BiPAP', 'Biloma', 'Birthmark', 'Bisphosphonate', 'Bivalirudin', 'Blepharitis', 'Blepharoplasty', 'Blindness', 'Blister', 'Bloodborne Pathogens', 'Allopurinol', 'Alopecia', 'Amblyopia', 'Amenorrhea', 'Amniocentesis', 'Anastrozole', 'Anencephaly', 'Angiodysplasia', 'Angioembolization', 'Ankyloglossia', 'Ankylosing Spondylitis', 'Haptoglobin', 'HbA1C', 'Heatstroke', 'Height', 'Heliox', 'Hematemesis', 'Hematochezia', 'Hematocrit', 'Hematology', 'Hemifacial Microsomia', 'Hemochromatosis', 'Hemoglobinuria', 'Hemophagocytic Lymphohistiocytosis (HLH)', 'Hemothorax', 'Hepatoblastoma', 'Hepatomegaly', 'Hepatosplenomegaly', 'Hepatotoxicity', 'Her2neu', 'IgG Deficiencies', 'Ileostomy', 'Impetigo', 'Improving', 'Impulsiveness', 'Incontinentia Pigmenti', 'Restlessness', 'Retinitis Pigmentosa', 'Retinoblastoma', 'Reversible Dementias', 'Rhabdomyosarcoma', 'Rhinoplasty', 'Rifaximin', 'Rosacea', 'Roseola', 'STEMI', 'Sacroiliitis', 'Scabies', 'Schistocytes', 'Sciatica', 'Scleral Buckling', 'Scleroderma', 'Sclerotherapy', 'Scotoma', 'Selective Mutism', 'Digitalization', 'Dihydroergotamine', 'Discogram', 'Dislocations', 'Disorientation', 'Diverticulosis', 'Docetaxel', 'Domperidone', 'Dopamine', 'Doxorubicin', 'Drooling', 'Drowsiness', 'Duodenitis', "Dupuytren's Contracture", 'Dyskeratosis Congenita', 'Dyslipidemia', 'Dysmenorrhea', 'Dysphasia', 'Dyssomnias', 'Dysthymia', 'Dysuria', 'ESR', 'Eclampsia', 'Ectropion (Eublepharon)', 'Ehrlichiosis', 'Translocations', 'Transverse Myelitis', 'Trastuzumab', 'Trigeminal Neuralgia', 'Tympanoplasty', 'Unconscious', 'Underweight', 'Undescended Testes (Cryptorchidism)', 'Ureter obstructed', 'Colchicine', 'Coldness', 'Colectomy', 'Coloboma', 'Colostomy', 'Colposcopy', 'Comfort Measures Only (CMO)', 'Comorbid conditions', 'Compromised local circulation', 'Conivaptan', 'Constipation', 'Continence', 'Cor Pulmonale', 'Splinters', 'Spondylolisthesis', 'Spondylolysis', 'Stapedectomy', 'Steroid', 'Stillbirth', 'Stomatitis', 'Strabismus (Crossed Eyes)', 'Stridor', 'Stupor', 'Suicide plan', 'Sunburn', 'Suprasternal retractions', 'Sympathectomy', 'Tapeworm', 'Tattoo', 'Tau/A Beta42', 'Teething', 'Telangiectasias', 'Temper Tantrum', 'Temporal Arteritis', 'Microbiology', 'Microcephaly', 'Microdiskectomy', 'Micropenis', 'Midodrine', 'Miscarriage', 'Modified duke criteria', 'Molluscum Contagiosum', 'Monoamniotic twins', 'Mosaicism', 'Motorcycle accident', 'Myalgias', 'Myasthenia Gravis', 'Myelogram', 'Myoclonus', 'Myoglobinuria', 'Myopia', 'Myositis', 'Myxedema', 'NSAID', 'Narcolepsy', 'Nausea', 'Poliomyelitis', 'Poly-pharmacy', 'Polyhydramnios (Hydramnios)', 'Polymyalgia Rheumatica', 'Polymyositis', 'Postictal State', 'Presbycusis', 'Presbyopia', 'Presyncope', 'Proctectomy', 'Proctocolectomy', 'Pruritis Ani', 'Pseudotumor Cerebri', 'Vinorelbine', 'Vitrectomy', 'Voiding Cystourethrogram (VCUG)', 'Vomit', 'Vulvitis', "Wegener's Granulomatosis", 'Whiplash', 'Widening QRS', 'Wrinkles', 'X-linked Agammaglobulinemia', 'YAG Capsulotomy', 'Yersiniosis', 'caffeine', 'coagulopathy', 'dexamethasone', 'Infliximab', 'Insomnia', 'Insulinoma', 'Intravenous contrast extravasation', 'Obtundation', 'Octreotide', 'Odynophagia', 'Oligodendroglioma', 'Oligohydramnios', 'Oliguria', 'Omphalocele', 'Onychomycosis', 'Oophorectomy', 'Orchiectomy', 'Orchitis', 'Orthopnea', 'Carboplatin', 'Cardiomegaly', 'Cataracts', 'Cecostomy', 'Cephalopelvic Disproportion (CPD)']
cluster no 3:['Brain Malignancy', 'Brain metastasis']
cluster no 4:['Pubic Lice', 'Lice', 'Head Lice']
cluster no 5:['Assistive, Adaptive, Supportive or Protective Device Fitting', 'Gait Training Using an Assistive Device', 'Unsteady gait']
cluster no 6:['Removal of Soft Tissue Foreign Body', 'Soft Tissue Foreign Body']
cluster no 7:['Necrotizing pneumonia', 'Pneumocystis Pneumonia', 'Pneumocystis pneumonia', 'Pneumonia', 'Pneumonia', 'Mycoplasma Pneumonia', 'Walking Pneumonia']
cluster no 8:['Esophageal Atresia', 'Esophageal Dilation', 'Esophageal Manometry', 'Esophageal ring/web', 'Esophageal stricture']
What am I doing wrong? Are my techniques are wrong here? Here is my code, I change to other techniques easily using sklearn packages,:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
import pprint
my_list = ['Cervical Cryotherapy', 'Cervical Disk Replacement Surgery', 'Cervical Disk Rupture', 'Cervical Disk Surgery', 'Cervical Epidural Injection', 'Cervical Fracture (exclude uncomplicated compression fractures)', 'Cervical Insufficiency (Cervical Incompetence)', 'Cervical Neck Brace', 'Cervical Radiculopathy', 'Cervical Spinal Fusion', 'Cervical Spine Disorder', 'Cervical Spondylosis', 'Cervical Subluxation', 'Cervical dilation', 'Cervical dislocation', 'Cervical effacement', 'Cervical ripening procedure', 'Cervicitis', 'Cervicitis (Non-STD)', 'Cervicitis (STD)', 'Cervix', 'Cervix closed', 'Cesarean Section (C-Section)', 'Cesarean section procedure', 'Chagas Disease']
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(my_list)
#print (tfidf_matrix.shape)
k=len(my_list)
dist = np.zeros((k,k))
for i in range(k):
dist[i] = cosine_similarity(tfidf_matrix[i:i+1], tfidf_matrix)
#print(dist
dist1 = np.subtract(np.ones((k,k),dtype=np.float), dist) ## convert to distance
#print(dist1)
data2=np.asarray(dist1)
arr_3d = data2.reshape((1,k,k))
#print(arr_3d)
for i in range(len(arr_3d)):
km = KMeans(n_clusters=5, init='k-means++')
km = km.fit(arr_3d[i])
centers = km.cluster_centers_
labels = km.labels_
print (labels)
print(type(labels))
Groups = {}
for element, label in zip(my_list, labels):
print 'element', element
print 'label', label
try:
Groups[str(label)].append(element)
except:
Groups[str(label)] = [element]
pprint.pprint(Groups)
EDIT: I'm using now only cosine similarity, and getting the same problem, big cluster with unrelated words, so it's not tf-idf problem!
WORD = re.compile(r'\w+')
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
k=len(my_list)
data1 = np.zeros((k,k))
for i,string1 in enumerate(my_list):
for j,string2 in enumerate(my_list):
data1[i][j] = 1-get_cosine(text_to_vector(string1), text_to_vector(string2))
print(data1)
k=len(my_list)
data2=np.asarray(data1)
arr_3d = data2.reshape((1,k,k))
Edit: I run LSA instead of TF-IDF which should be suitable for short text, but I got very very bad results! clusters that don't match:
vectorizer = CountVectorizer(min_df = 1, stop_words = 'english')
dtm = vectorizer.fit_transform(my_list)
lsa = TruncatedSVD(2, algorithm = 'arpack')
dtm_lsa = lsa.fit_transform(dtm)
dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)
similarity = np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T)
#print(1-similarity)
k=len(my_list)
dist1 = np.subtract(np.ones((k,k),dtype=np.float), similarity)
#dist1.astype(float)
print(dist1)
Upvotes: 2
Views: 1674
Reputation: 77505
k-means is based on variance minimization.
It minimizes the sum of squared deviations, (x[i]-center[i])**2
for every object x
, the dimensions i
and the optimum (smallest cost) center center
. It cannot minimize arbitrary distances (see the many many many many question on this matter here).
There are two fatal problems in your code:
Upvotes: 1