Miguel Ping
Miguel Ping

Reputation: 18337

Text labeling with machine learning

I want to label a bunch of bank transactions according to a set of predefined classes (example below, its a map in clojure). I tried a naive bayes approach but sometimes it totally gives me the wrong label.

According to my research, I should use a supervised ML algorithm, something like a linear SVM tuned for multiclass classification. Problem is I don't know anything about ML really. Second problem is that most clojure libs are outdated.

{:label "5339134-17-CPR-FARMODISSEIA LD", :value -13271 :class :health}
{:label "PAG.SERV. 10297 779747511", :value -2889 :class :utilities}
{:label "5339134-14-CPR-GREEN PEPER", :value -1785 :class :restaurants}
{:label "5339134-03-LEV-Av Alm Kings", :value -4000 :class :atm}
{:label "5339134-02-LEV-Big Field, 1", :value -7000 :class :atm}
{:label "IMPOSTO DE SELO", :value -17 :class :banking}

So most of the similar transactions have like 90% similar text (see eg: :atm), I believe this should be an easy problem.

My questions:

Any sample in either clj or java will be greatly appreciated.

Upvotes: 2

Views: 922

Answers (1)

Sam Estep
Sam Estep

Reputation: 13294

Since you said in your question that

most of the similar transactions have like 90% similar text

I thought it would make sense to first figure out which transaction labels are similar to each other and group them together. Then you have a limited number of groups, and the group that each label falls into can be used as a nominal attribute in place of the text itself. If transactions in the same class have similar label text, then hopefully this should allow the classification algorithm to easily draw correlations between label and class.

I tried implementing a solution using these dependencies:

[[org.clojure/clojure "1.8.0"]
 [clj-fuzzy "0.4.0"]
 [cc.artifice/clj-ml "0.8.5"]
 [rm-hull/clustering "0.1.3"]]

After clustering the labels, the naïve Bayes approach seemed to work fine for me:

(require '[clj-fuzzy.metrics :as fm]
         '[clj-ml.classifiers :as classify]
         '[clj-ml.data :as data]
         '[clustering.core.qt :as qt])

(def data
  [{:label "5339134-17-CPR-FARMODISSEIA LD", :value -13271 :class :health}
   {:label "PAG.SERV. 10297 779747511", :value -2889 :class :utilities}
   {:label "5339134-14-CPR-GREEN PEPER", :value -1785 :class :restaurants}
   {:label "5339134-03-LEV-Av Alm Kings", :value -4000 :class :atm}
   {:label "5339134-02-LEV-Big Field, 1", :value -7000 :class :atm}
   {:label "IMPOSTO DE SELO", :value -17 :class :banking}])

(def clusters
  (into {}
        (for [cluster (qt/cluster fm/levenshtein (map :label data) 13 1)
              s cluster]
          [s (keyword (str "cluster" (hash cluster)))])))

(def dataset
  (-> (data/make-dataset "my-data"
                         [:value
                          {:label (seq (set (vals clusters)))}
                          {:class [:health :utilities :restaurants :atm :banking]}]
                         (map (juxt :value (comp clusters :label) :class) data))
      (data/dataset-set-class :class)))

(def data-map
  (let [m (into {} (map (juxt data/instance-to-map identity)
                        (data/dataset-seq dataset)))]
    (into {} (for [x data]
               [x (-> x (update :label clusters) (update :value double) m)]))))

(def classifier
  (-> (classify/make-classifier :bayes :naive)
      (classify/classifier-train dataset)))

(defn foo []
  (for [x data]
     (->> x
          data-map
          data/instance-set-class-missing
          (classify/classifier-classify classifier)
          (assoc x :predicted))))

(run! prn (foo))
;; {:label "5339134-17-CPR-FARMODISSEIA LD", :value -13271, :class :health, :predicted :health}
;; {:label "PAG.SERV. 10297 779747511", :value -2889, :class :utilities, :predicted :utilities}
;; {:label "5339134-14-CPR-GREEN PEPER", :value -1785, :class :restaurants, :predicted :restaurants}
;; {:label "5339134-03-LEV-Av Alm Kings", :value -4000, :class :atm, :predicted :atm}
;; {:label "5339134-02-LEV-Big Field, 1", :value -7000, :class :atm, :predicted :atm}
;; {:label "IMPOSTO DE SELO", :value -17, :class :banking, :predicted :banking}

I'm quite new to ML though, so please let me know if there's something I've overlooked.

Also, in my implementation, I use QT clustering to do a one-time partition of the labels in the input dataset, but if the goal is to continue incorporating new data over time, it may be necessary to use a streaming clustering algorithm instead. It looks like this may be possible with k-means, but that would require implementation of a "Levenshtein averaging" function. In addition, I'm not sure if the clustering library I'm using supports iteration upon its initial result, so further implementation may be necessary.

Upvotes: 1

Related Questions