Text labeling with machine learning

Question

I want to label a bunch of bank transactions according to a set of predefined classes (example below, its a map in clojure). I tried a naive bayes approach but sometimes it totally gives me the wrong label.

According to my research, I should use a supervised ML algorithm, something like a linear SVM tuned for multiclass classification. Problem is I don't know anything about ML really. Second problem is that most clojure libs are outdated.

{:label "5339134-17-CPR-FARMODISSEIA LD", :value -13271 :class :health}
{:label "PAG.SERV. 10297 779747511", :value -2889 :class :utilities}
{:label "5339134-14-CPR-GREEN PEPER", :value -1785 :class :restaurants}
{:label "5339134-03-LEV-Av Alm Kings", :value -4000 :class :atm}
{:label "5339134-02-LEV-Big Field, 1", :value -7000 :class :atm}
{:label "IMPOSTO DE SELO", :value -17 :class :banking}

So most of the similar transactions have like 90% similar text (see eg: :atm), I believe this should be an easy problem.

My questions:

what algorithms can I use?
how should I prepare the data? I believe I only have two features, tx label and tx value. Some tutorials I see have a bunch of vectors, but I don't know if/how to convert the string data to the proper ML format.

Any sample in either clj or java will be greatly appreciated.

Sam Estep · Accepted Answer

Since you said in your question that

most of the similar transactions have like 90% similar text

I thought it would make sense to first figure out which transaction labels are similar to each other and group them together. Then you have a limited number of groups, and the group that each label falls into can be used as a nominal attribute in place of the text itself. If transactions in the same class have similar label text, then hopefully this should allow the classification algorithm to easily draw correlations between label and class.

I tried implementing a solution using these dependencies:

[[org.clojure/clojure "1.8.0"]
 [clj-fuzzy "0.4.0"]
 [cc.artifice/clj-ml "0.8.5"]
 [rm-hull/clustering "0.1.3"]]

After clustering the labels, the naïve Bayes approach seemed to work fine for me:

(require '[clj-fuzzy.metrics :as fm]
         '[clj-ml.classifiers :as classify]
         '[clj-ml.data :as data]
         '[clustering.core.qt :as qt])

(def data
  [{:label "5339134-17-CPR-FARMODISSEIA LD", :value -13271 :class :health}
   {:label "PAG.SERV. 10297 779747511", :value -2889 :class :utilities}
   {:label "5339134-14-CPR-GREEN PEPER", :value -1785 :class :restaurants}
   {:label "5339134-03-LEV-Av Alm Kings", :value -4000 :class :atm}
   {:label "5339134-02-LEV-Big Field, 1", :value -7000 :class :atm}
   {:label "IMPOSTO DE SELO", :value -17 :class :banking}])

(def clusters
  (into {}
        (for [cluster (qt/cluster fm/levenshtein (map :label data) 13 1)
              s cluster]
          [s (keyword (str "cluster" (hash cluster)))])))

(def dataset
  (-> (data/make-dataset "my-data"
                         [:value
                          {:label (seq (set (vals clusters)))}
                          {:class [:health :utilities :restaurants :atm :banking]}]
                         (map (juxt :value (comp clusters :label) :class) data))
      (data/dataset-set-class :class)))

(def data-map
  (let [m (into {} (map (juxt data/instance-to-map identity)
                        (data/dataset-seq dataset)))]
    (into {} (for [x data]
               [x (-> x (update :label clusters) (update :value double) m)]))))

(def classifier
  (-> (classify/make-classifier :bayes :naive)
      (classify/classifier-train dataset)))

(defn foo []
  (for [x data]
     (->> x
          data-map
          data/instance-set-class-missing
          (classify/classifier-classify classifier)
          (assoc x :predicted))))

(run! prn (foo))
;; {:label "5339134-17-CPR-FARMODISSEIA LD", :value -13271, :class :health, :predicted :health}
;; {:label "PAG.SERV. 10297 779747511", :value -2889, :class :utilities, :predicted :utilities}
;; {:label "5339134-14-CPR-GREEN PEPER", :value -1785, :class :restaurants, :predicted :restaurants}
;; {:label "5339134-03-LEV-Av Alm Kings", :value -4000, :class :atm, :predicted :atm}
;; {:label "5339134-02-LEV-Big Field, 1", :value -7000, :class :atm, :predicted :atm}
;; {:label "IMPOSTO DE SELO", :value -17, :class :banking, :predicted :banking}

I'm quite new to ML though, so please let me know if there's something I've overlooked.

Also, in my implementation, I use QT clustering to do a one-time partition of the labels in the input dataset, but if the goal is to continue incorporating new data over time, it may be necessary to use a streaming clustering algorithm instead. It looks like this may be possible with k-means, but that would require implementation of a "Levenshtein averaging" function. In addition, I'm not sure if the clustering library I'm using supports iteration upon its initial result, so further implementation may be necessary.

Text labeling with machine learning

Answers (1)

Related Questions