Learning Clojure: January 2011

Sunday, January 30, 2011

Turning Exceptions into Return Values

;; I often find that the default behaviour of exceptions in emacs/slime is a bit
;; annoying.

;; Up pops up a debugger window, but that moves the focus away from where
;; you're working, and the new window then needs dismissing, and has often
;; buggered up the size of other windows, etc...

;; Also when playing with experimental clojure versions, the behaviour is a bit
;; random, and often the expression just gets swallowed, with no clue as to what
;; it was:

;; Anyway I wrote this little macro, which turns an exception into a return value:
(defmacro tryc[ & e]
  `(try ~@e
       (catch Exception a# a#)))


(tryc (java.io.File ".")) ;; #<ClassCastException java.lang.ClassCastException: java.lang.Class cannot be cast to clojure.lang.IFn>

;; Obviously you should only use this at the top level!

;; I find I'm using it all the time when writing code in emacs

Finding Something in a Vector, Parsing CSV Files

;; Indexing Into a Vector: A Sequence of Answers


;; The other day, I had a csv file to read (csv files are comma-separated
;; values, often produced by exporting spreadsheets, or a reasonably standard
;; human-readable format for grids of data)


;; There's a lovely library which takes all the hard work out of reading csv
;; files

;; The library's on clojars, so you use it by adding this to your maven pom.xml

;; <dependency>
;;   <groupId>clojure-csv</groupId>
;;   <artifactId>clojure-csv</artifactId>
;;   <version>1.2.0</version>
;; </dependency>

;; and then you can use clojure-csv.core/parse-csv to read your file in as a
;; sequence.  It seems to magically handle all the intricies of all the
;; different possible csv file arrangements.

;; Thankyou David Santiago, it works like a charm.

;; But then I hit a problem: I wanted to read the email-addresses from every
;; record.

;; I've rather simplified here, but let's imagine that the data after parsing
;; looked like this:

(def csv-data
     [["Name" "E-mail 1" "Address" "E-mail 2" "E-mail 3"]
      ["John" "john@mailinator.com" "3 Church St" "xyz@learnclojure.com" ""]
      ["Fred" "fred@mailinator.com" "7 Park Street" "fred@gmail.com" "fred@googlemail.com" ]])

(def header (first csv-data))
(def data   (rest csv-data))

;; As it happened, most of the email-addresses were stored in the second column,
;; but some records had two or even three addresses, and it looked as though the
;; generating program might be labelling its columns according to the number of
;; e-mail addresses it needed to output.

;; It seemed very likely that another data set might have an "E-mail 4" column,
;; and it seemed unwise to rely on the needed columns always being 1,3,4 and
;; possibly 5. What if the program introduced another field entirely?

;; So obviously I needed a function to look up things in the header row somehow.
;; And there didn't seem to be one, although there were the interesting
;; functions keep-indexed and map-indexed, which I hadn't noticed before.

;; And I couldn't find one. So I figured that I could either write one, or I
;; could ask on Stack Overflow.

;; And so I did both, expecting to find that I'd re-invent something in the
;; standard library, or at least in contrib, that I hadn't been able to find.

;; http://stackoverflow.com/questions/4830900/how-do-i-find-the-index-of-an-item-in-a-vector

;; So the function(s) I came up with were:

(defn indices-of [f coll]
  (keep-indexed #(if (f %2) %1 nil) coll))

(defn first-index-of [f coll]
  (first (indices-of f coll)))

(defn find-thing [value coll]
  (first-index-of #(= % value) coll))

;; And here are some examples:

(indices-of #(= "Name" %) header) ; (0)
(indices-of (partial re-matches #".*E-mail.*") header) ; (1 3 4)
(first-index-of #(= "Name" %) header) ; 0
(find-thing "Name" header) ; 0

;; But I was a bit nervous about these solutions, because I thought I must just
;; have re-invented some sort of wheel, and also because they're happy to return
;; answers for sets and maps, where the question doesn't really make much sense

;;fine
(find-thing "two" ["one" "two" "three" "two"]) ; 1
(find-thing "two" '("one" "two" "three")) ; 1

;;but these answers are a bit silly
(find-thing "two" #{"one" "two" "three"}) ; 1
(find-thing "two" {"one" "two" "two" "three"}) ; nil

;; But I went back to Stack Overflow, in order to answer my own question, and
;; found a couple of much better answers.

;; Brian Carper pointed out:

(.indexOf header "Name") ; 0

;; Which is clearly the answer for searching vectors.

;; And ponzao pointed out this lovely thing, originally due to Stuart Halloway:
(require 'clojure.contrib.seq)
(first (clojure.contrib.seq/positions #{"Name"} header)) ; 0

;; positions is like indices-of, but using a set as the predicate is really clever.

;; anyway, now I could say:
(map #( % (.indexOf header "Name")) data) ; ("John" "Fred")
(map #( % (.indexOf header "E-mail 1")) data) ; ("john@mailinator.com" "fred@mailinator.com")

;; Or even, for fans of terseness and crypticity (and forgive me Lord, for I am
;; such a fan), something like:
(map #(vector (% (.indexOf header "Name"))
              (for [i (clojure.contrib.seq/positions
                       (partial re-matches #"E-mail \d+") header)]
                (% i))) data)

;; (["John" ("john@mailinator.com" "xyz@learnclojure.com" "")]
;;  ["Fred" ("fred@mailinator.com" "fred@gmail.com" "fred@googlemail.com")])

;; But a little later, there was another answer from cgrand, who warned me
;; against using indices on general principles. And I agree with that, so I
;; asked what I should do in the particular case of csv files. And there was an
;; answer from Alex Stoddard, which I believe is the answer to the question that
;; I should have asked.

;; We can make a map from strings to indices
(def hmap (zipmap header (iterate inc 0)))

;; And use it like this:
(map #(% (hmap "Name")) data) ; ("John" "Fred")

;; or this:
(def e-mails (filter (partial re-matches #"E-mail \d+") header))

(map #(vector (% (hmap "Name")) (for [e e-mails] (% (hmap e)))) data)

;; or this:
(map #(vector
       (% (hmap "Name"))
       (for [e (filter (partial re-matches #"E-mail \d+") header)]
         (% (hmap e)))) data)

;; to get:
;; (["John" ("john@mailinator.com" "xyz@learnclojure.com" "")]
;;  ["Fred" ("fred@mailinator.com" "fred@gmail.com" "fred@googlemail.com")])

;; or even this (although now you do have to rely on the name coming before the e-mails):
(map #(for [e (filter (partial re-matches #"E-mail \d+|Name") header)]
        (% (hmap e)))
     data)

;; to get:
;; (("John" "john@mailinator.com" "xyz@learnclojure.com" "")
;;  ("Fred" "fred@mailinator.com" "fred@gmail.com" "fred@googlemail.com"))

;;If we want to abstract out a pattern, then:
(defn columns [f header data]
  (let [hmap (zipmap header (iterate inc 0))
        cols (filter f header)
    (map #(for [e cols] (% (hmap e))) data)))

;; allows:
(columns #{"Name"} header data) ; (("John") ("Fred"))
(columns (partial re-matches #"E-mail \d+") header data) ; (("john@mailinator.com" "xyz@learnclojure.com" "") ("fred@mailinator.com" "fred@gmail.com" "fred@googlemail.com"))

Thursday, January 27, 2011

£500 if you can find me a job (version 1.0)

That worked a treat, £500 awarded to lucky winner who found me a fun short contract.

Now I'd like another one, so I repeat the offer:

If, within the next six months, I take a job which lasts longer than one month, and that is not obtained through an agency, then on the day the first cheque from that job cashes, I'll give £500 to the person who provided the crucial introduction.

If there are a number of people involved somehow, then I'll apportion it fairly between them. And if the timing conditions above are not quite met, or if someone points me at a short contract which the £500 penalty makes not worth taking, then I'll do something fair and proportional anyway. (The thing Simon pointed me at only lasted three weeks and I paid him in full anyway, because it was neat.)

And this offer applies even to personal friends, and to old contacts who I have not got round to calling yet, and to people who are themselves offering work, because why wouldn't it?

And obviously if I find one through my own efforts then I'll keep the money. But my word is generally thought to be good, and I have made a public promise on my own blog to this effect, so if I cheat you you can blacken my name and ruin my reputation for honesty, which is worth much more to me than £500.

Anyhow, my CV is at http://www.aspden.com, and any advice on how it could be improved will be gratefully received.

I'll also repeat the original advert: (job-hunt.contrib 1.0)

Anyone in Cambridge need a programmer? Obviously Clojure is a speciality, and my current obsession, but I'm also pretty good with C (especially the embedded variety), microcontrollers, and Python, and I have a particular facility with mathematical concepts and algorithms of all kinds. My obsessions can be pretty quickly changed when I need them to be.

I have a (deserved) reputation for being able to produce heavily optimised but nevertheless bug-free and readable code, but I also know how to hack together sloppy, bug-ridden prototypes, and I know which style is appropriate when, and how to slide along the continuum between them.

I've worked in telecoms, commercial research, banking, university research, a chip design company, server virtualization, a couple of startups, and occasionally completely alone.

I've worked on many sizes of machine. I've written programs for tiny 8-bit microcontrollers, and once upon a time every IBM machine in one building in Imperial College was running my partial differential equation solvers in parallel in the background.

I'm smart and I get things done. I'm confident enough in my own abilities that if I can't do something I admit it and find someone who can.

I also have various ancient and rusty skills with things like Java, C++, R, OCaml, Common LISP, Scheme, FORTRAN and Pascal which can be brushed up if necessary. Like all lispers, I occasionally write toy interpreters for made-up languages for fun.

If you're a local company using Java, who might be interested in giving Clojure a try (motivation here, in Paul Graham's classic Beating the Averages), I'd love to try to show you what all the fuss is about.

CV here if you're interested: http://www.aspden.com

I've never used a CV before, having always found work through word of mouth. So I expect that it can be improved. If anyone's got any suggestions as to how it could be better written, do please leave comments or e-mail cv@aspden.com.

Tuesday, January 25, 2011

K-means : An Algorithm for Clustering Data

;; K-means

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; K-means is a Clustering Algorithm.

;; We use it when we have some data, and we want to split the data into separate categories.

;; For instance, an early biologist, let's call him Adam, might measure all
;; sorts of things about the living objects he encounters in the world.

;; (black, feathers, flies)
;; (green, tasty)
;; (green, slithers, poisonous)
;; ... and so on

;; After he collects enough data, he might go looking for structure in it.

;; Uninformed by theory, he might nevertheless notice that many things that do
;; not move are green, and that many things that are not green move.

;; He might name these two obvious groups the Animals and the Plants.

;; Further analysis of the data might split the Plants into Trees and Flowers,
;; and the Animals into Mammals, Birds, and Fish.

;; Theoretically, this process could continue further, extracting 'natural
;; categories' from the observed structure of the data, without any theory about
;; how the various properties come about

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; Let's consider a very simple clustering situation. We have some numbers,
;; and we'd like to see if they form groups.

;; Suppose we want to cluster the points 2 3 5 6 10 11 100 101 102
(def data '(2 3 5 6 10 11 100 101 102))
;; You may be able to see some natural groupings in this data.

;; It's easy enough to say how far one number is from another
(defn distance[a b] = (if (< a b) (- b a) (- a b)))

;; To do K-means, we need to start with some guesses about where the clusters are.
;; They don't have to be terribly good guesses.
(def guessed-means '(0 10))

;; Given a particular point, we want to know which of our means is closest
(defn closest [point means distance]
  (first (sort-by #(distance % point) means)))

;; In our little dataset, 2 is closest to the guess of 0, and 100 is closest to the guess of 10
(closest 2   guessed-means distance) ; 0
(closest 100 guessed-means distance) ; 10

;; So we can talk about the group of all points for which 0 is the best guess
;; and the group of all points for which 10 is the best guess.
(defn point-groups [means data distance]
  (group-by #(closest % means distance) data))

(point-groups guessed-means data distance) ; {0 [2 3 5], 10 [6 10 11 100 101 102]}

;; We can take an average of a group of points
(defn average [& list] (/ (reduce + list) (count list)))

(average 6 10 11 100 101 102) ; 55

;; So we can take the average of each group, and use it as a new guess for where
;; the clusters are. If a mean doesn't have a group, then we'll leave it where
;; it is.
(defn new-means [average point-groups old-means]
  (for [o old-means]
    (if (contains? point-groups o)
      (apply average (get point-groups o)) o)))

(new-means average (point-groups guessed-means data distance) guessed-means) ; (10/3 55)

;; So if we know we've got a particular set of points, and a particular idea of
;; distance, and a way of averaging things, that gives us a way of making a new
;; list of guesses from our original list of guesses
(defn iterate-means [data distance average]
  (fn [means] (new-means average (point-groups means data distance) means)))

((iterate-means data distance average) '(0 10)) ; (10/3 55)

;; and of course we can use that as a new guess, and improve it again.
((iterate-means data distance average) '(10/3 55)) ; (37/6 101)

;; We can do this repeatedly until it settles down.
(iterate (iterate-means data distance average) '(0 10)) ; ((0 10) (10/3 55) (37/6 101) (37/6 101) .....)

;; K-means with two means seems to be trying to tell us that we've got a group
;; centered around 6 and another centred around 101

;; These groups are:
(defn groups [data distance means]
  (vals (point-groups means data distance)))

(groups data distance '(37/6 101)) ; ([2 3 5 6 10 11] [100 101 102])

;; Ideally we'd like to iterate until the groups stop changing.
;; I described a function for doing this in a previous post:
(defn take-while-unstable 
  ([sq] (lazy-seq (if-let [sq (seq sq)]
                    (cons (first sq) (take-while-unstable (rest sq) (first sq))))))
  ([sq last] (lazy-seq (if-let [sq (seq sq)]
                         (if (= (first sq) last) '() (take-while-unstable sq))))))

(take-while-unstable '(1 2 3 4 5 6 7 7 7 7)) ; (1 2 3 4 5 6 7)

(take-while-unstable (map #(groups data distance %) (iterate (iterate-means data distance average) '(0 10))))
                                        ; (([2 3 5] [6 10 11 100 101 102])
                                        ;  ([2 3 5 6 10 11] [100 101 102]))

;; Shows that our first guesses group 2,3 and 5 (nearer to 0 than 10) vs all the rest.
;; K-means modifies that instantly to separate out the large group of three.


;; We can make a function, which takes our data, notion of distance, and notion of average,
;; and gives us back a function which, for a given set of initial guesses at the means,
;; shows us how the group memberships change.
(defn k-groups [data distance average]
  (fn [guesses]
    (take-while-unstable
     (map #(groups data distance %)
          (iterate (iterate-means data distance average) guesses)))))

(def grouper (k-groups data distance average))

(grouper '(0 10))
                                        ; (([2 3 5] [6 10 11 100 101 102])
                                        ;  
                                        ;  ([2 3 5 6 10 11] [100 101 102]))


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; Nothing we said above limits us to only having two guesses
(grouper '(1 2 3))
                                        ; (([2] [3 5 6 10 11 100 101 102])
                                        ;  ([2 3 5 6 10 11] [100 101 102])
                                        ;  ([2 3] [5 6 10 11] [100 101 102])
                                        ;  ([2 3 5] [6 10 11] [100 101 102])
                                        ;
                                        ;  ([2 3 5 6] [10 11] [100 101 102]))


;; The more means we start with, the more detailed our clustering.
(grouper '(0 1 2 3 4))
                                        ; (([2] [3] [5 6 10 11 100 101 102])
                                        ;  ([2] [3 5 6 10 11] [100 101 102])
                                        ;  ([2 3] [5 6 10 11] [100 101 102])
                                        ;  ([2 3 5] [6 10 11] [100 101 102])
                                        ;  ([2] [3 5 6] [10 11] [100 101 102])
                                        ;
                                        ;  ([2 3] [5 6] [10 11] [100 101 102]))

;; We have to be careful not to start off with too many means, or we just get our data back:
(grouper (range 200)) ; (([2] [3] [100] [5] [101] [6] [102] [10] [11]))

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; Generalizing to Other Spaces

;; In fact, nothing we said above depends on our inputs being numbers

;; We can use any data where we can define a distance, and a method of averaging:

;; One of the easiest things to do this for would be vectors:

(defn vec-distance [a b] (reduce + (map #(* % %) (map - a b))))
(defn vec-average  [& l] (map #(/ % (count l)) (apply map + l)))

(vec-distance [1 2 3][5 6 7]) ; 48
(vec-average  [1 2 3][5 6 7]) ; (3 4 5)

;; Here's a little set of vectors
(def vector-data '( [1 2 3] [3 2 1] [100 200 300] [300 200 100] [50 50 50]))

;; And choosing three guesses in a fairly simple-minded manner, we can see how the algorithm
;; divides them into three different groups.
((k-groups vector-data vec-distance vec-average) '([1 1 1] [2 2 2] [3 3 3]))
                                        ; (([[1 2 3] [3 2 1]] [[100 200 300] [300 200 100] [50 50 50]])

                                        ;  ([[1 2 3] [3 2 1] [50 50 50]]
                                        ;   [[100 200 300] [300 200 100]])

                                        ;  ([[1 2 3] [3 2 1]]
                                        ;   [[100 200 300] [300 200 100]]
                                        ;   [[50 50 50]]))

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; Pedantic Footnote

;; Note that the algorithm as described above isn't quite the classic K-means.
;; I don't think the difference is terribly important, and I thought it would complicate the explanation to deal with it.

;; In the usual K-means, if you have two identical means, then you're only supposed to update one of them.

;; Here our two identical guesses are both getting updated
(new-means average (point-groups '(0 0) '(0 1 2 3 4) distance) '(0 0)) ; (2 2)

;; Our update function:
(defn new-means [average point-groups old-means]
  (for [o old-means]
    (if (contains? point-groups o)
      (apply average (get point-groups o)) o)))

;; Needs to be changed so that if there are two identical means only one of them will be changed:

(defn update-seq [sq f]
  (let [freqs (frequencies sq)]
    (apply concat
     (for [[k v] freqs]
       (if (= v 1) (list (f k))
           (cons (f k) (repeat (dec v) k)))))))


(defn new-means [average point-groups old-means]
  (update-seq old-means (fn[o]
                          (if (contains? point-groups o)
                            (apply average (get point-groups o)) o))))

;; Now only one will get updated at once
(new-means average (point-groups '(0 0) '(0 1 2 3 4) distance) '(0 0)) ; (2 0)


;; Now we don't lose groups when the means get aliased.
((k-groups '(0 1 2 3 4) distance average) '(0 1)) ; (([0] [1 2 3 4]) ([0 1] [2 3 4]))
((k-groups '(0 1 2 3 4) distance average) '(0 0)) ; (([0 1 2 3 4]) ([0] [1 2 3 4]) ([0 1] [2 3 4]))

((k-groups vector-data vec-distance vec-average) '([1 1 1] [1 1 1] [1 1 1])) ;
                                        ; (([[1 2 3] [3 2 1] [100 200 300] [300 200 100] [50 50 50]])
                                        ;  ([[1 2 3] [3 2 1]] [[100 200 300] [300 200 100] [50 50 50]])
                                        ;  ([[1 2 3] [3 2 1] [50 50 50]] [[100 200 300] [300 200 100]])
                                        ;  ([[1 2 3] [3 2 1]] [[100 200 300] [300 200 100]] [[50 50 50]]))

;; Although it's still possible that a mean never acquires any points, so we can still get out fewer groups than means.
((k-groups '(0 1 2 3 4) distance average) '(0 5 10)) ; (([0 1 2] [3 4]))

Cleaning Old Definitions from the REPL : shred-user

;; I often find myself restarting the clojure repl in order to get rid of old
;; definitions which introduce subtle bugs.

;; Consider:

(defn factorial [n] (if (< n 2) 1 (* n (factorial (dec n))))) ; #'user/factorial

(factorial 10) ; 3628800

;; Suppose instead I wanted (factorial2 n) to be (* 1 2 2 2 3 2 4 .... 2 n)

(defn factorial2 [n] (if (< n 2) 1 (* 2 n (factorial (dec n))))) ; #'user/factorial2

;; There is a subtle bug introduced by my failing to rename the recursive call

(map factorial2 '(1 2 3 4)) ; (1 4 24 192)

;; But (factorial2 4) should be (* 1 2 2 2 3 2 4) = 192

;; I suspect that there is an old definition hanging around confusing matters.

;; To find it, I can run the program from the command line, or restart the repl and recompile, or:

(doseq [s (map first (ns-interns 'user))](ns-unmap 'user s)) ; nil

(defn factorial2 [n] (if (< n 2) 1 (* 2 n (factorial (dec n))))) ; Unable to resolve symbol: factorial in this context

;; Aha! 

(defn factorial2 [n] (if (< n 2) 1 (* 2 n (factorial2 (dec n))))) ; #'user/factorial2

(factorial2 4) ; 192


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; For convenience I'd like to make that a function:

(defn shred-user []
  (doseq [s (map first (ns-interns 'user))] (ns-unmap 'user s)))

(shred-user) ; nil

(shred-user) ; Unable to resolve symbol: shred-user in this context ; bugger!

(defn shred-user []
  (doseq [s (filter (complement #{'shred-user}) (map first (ns-interns 'user)))] (ns-unmap 'user s)))

(shred-user) ; nil

(shred-user) ; nil

take-while-unstable

;; I often find myself wanting to iterate something until it converges.

(defn collatz [n] (if (even? n) (recur (/ n 2)) (+ 1 (* 3 n))))

(iterate collatz 100) ; (100 76 58 88 34 52 40 16 4 4 4 4 4 4 ...)

;; And so I often find myself writing:

(defn take-while-unstable
  ([sq] (if (empty? sq) '()
            (cons (first sq) (#'take-while-unstable (rest sq) (first sq)))))
  ([sq last] (cond (empty? sq) '()
                   (= (first sq) last) '()
                   :else (cons (first sq) (#'take-while-unstable (rest sq) (first sq))))))

;; Which allows me to stop the iteration once it's settled down.
(take-while-unstable (iterate collatz 100)) ; (100 76 58 88 34 52 40 16 4)

;; You can see how it works here:
(require 'clojure.contrib.trace)

(binding  [*print-length* 20] ; taking care not to look at the medusa
  (clojure.contrib.trace/dotrace (take-while-unstable) (take-while-unstable (iterate collatz 100))))

;; TRACE t15884: (take-while-unstable (100 76 58 88 34 52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 ...))
;; TRACE t15885: |    (take-while-unstable (76 58 88 34 52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 100)
;; TRACE t15886: |    |    (take-while-unstable (58 88 34 52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 76)
;; TRACE t15887: |    |    |    (take-while-unstable (88 34 52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 58)
;; TRACE t15888: |    |    |    |    (take-while-unstable (34 52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 88)
;; TRACE t15889: |    |    |    |    |    (take-while-unstable (52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 34)
;; TRACE t15890: |    |    |    |    |    |    (take-while-unstable (40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 52)
;; TRACE t15891: |    |    |    |    |    |    |    (take-while-unstable (16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 40)
;; TRACE t15892: |    |    |    |    |    |    |    |    (take-while-unstable (4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 16)
;; TRACE t15893: |    |    |    |    |    |    |    |    |    (take-while-unstable (4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 4)
;; TRACE t15893: |    |    |    |    |    |    |    |    |    => ()
;; TRACE t15892: |    |    |    |    |    |    |    |    => (4)
;; TRACE t15891: |    |    |    |    |    |    |    => (16 4)
;; TRACE t15890: |    |    |    |    |    |    => (40 16 4)
;; TRACE t15889: |    |    |    |    |    => (52 40 16 4)
;; TRACE t15888: |    |    |    |    => (34 52 40 16 4)
;; TRACE t15887: |    |    |    => (88 34 52 40 16 4)
;; TRACE t15886: |    |    => (58 88 34 52 40 16 4)
;; TRACE t15885: |    => (76 58 88 34 52 40 16 4)
;; TRACE t15884: => (100 76 58 88 34 52 40 16 4)
;; (100 76 58 88 34 52 40 16 4)

;; Anyway, this function comes in so handy that I thought I'd lazyize it and give it some tests

(use 'clojure.test)

(with-test
  
  (defn take-while-unstable
    ([sq] (lazy-seq (if-let [sq (seq sq)]
                      (cons (first sq) (take-while-unstable (rest sq) (first sq))))))
    ([sq last] (lazy-seq (if-let [sq (seq sq)]
                           (if (= (first sq) last) '() (take-while-unstable sq))))))

  (is (= (take-while-unstable '()) '()))
  (is (= (take-while-unstable '(1)) '(1)))
  (is (= (take-while-unstable '(1 1)) '(1)))
  (is (= (take-while-unstable '(1 2)) '(1 2)))
  (is (= (take-while-unstable '(1 1 2)) '(1)))
  (is (= (take-while-unstable '(1 1 1)) '(1)))
  (is (= (take-while-unstable '(1 1 2 1)) '(1)))
  (is (= (take-while-unstable '(1 2 3 4 5 6 7 7 7 7)) '(1 2 3 4 5 6 7)))
  (is (= (take 10000 (take-while-unstable (range))) (take 10000 (range))))
  (is (= (take-while-unstable (concat (take 10000 (range)) '(10000) (drop 10000 (range)))) (range 10001)))
  (is (= (take-while-unstable '(nil)) '(nil)))
  (is (= (take-while-unstable '(nil nil)) '(nil)))
  (is (= (take-while-unstable '[nil nil]) '(nil)))
  (is (= (take-while-unstable [ :a :b :c :d :d :e]) '(:a :b :c :d))))

(run-tests) ; {:type :summary, :test 1, :pass 14, :fail 0, :error 0}

;; Hopefully someone will find it useful. Can anyone break it?

Monday, January 17, 2011

A Very Gentle Introduction to Information Theory: Guessing the Entropy

;; A Very Gentle Introduction to Information Theory : Part IV

;; Entropy and Huffman Coding

;; Once again, I'll keep some code from the earlier parts, without explanation

;; Probability distributions

(defn combine-keywords [& a] (keyword (apply str (mapcat #(drop 1 (str %)) a))))

(defn combine-distributions
  ([P] P)
  ([P1 P2]
     (into {}
           (for [[s1 p1] P1
                 [s2 p2] P2]
             [(combine-keywords s1 s2) (* p1 p2)])))
  ([P1 P2 & Plist] (reduce combine-distributions (cons P1 (cons P2 Plist)))))

;; Coding and Decoding

(defn decoder
  ([code-tree stream] (decoder code-tree code-tree stream))
  ([current-code-tree code-tree stream]
     (lazy-seq
        (if (keyword? current-code-tree)
          (cons current-code-tree (decoder code-tree code-tree stream))
          (if-let [stream (seq stream)]
            (if (= (first stream) 1)
              (decoder (first current-code-tree)  code-tree (rest stream))
              (decoder (second current-code-tree) code-tree (rest stream))))))))

(defn encoder [code stream] (mapcat code stream))

(defn make-encoder [code]  (fn [s] (encoder code s)))
(defn make-decoder [code-tree] (fn[s] (decoder code-tree s)))

;; Huffman encoding

(defn symbols [prefix code-tree]
  (if (keyword? code-tree) (list prefix code-tree)
      (concat (symbols (cons 1 prefix) (first code-tree))
              (symbols (cons 0 prefix) (second code-tree)))))

(defn make-code [code-tree]
  (into {} (map (fn[[c s]][s (reverse c)]) (partition 2 (symbols '() code-tree)))))

;; The original make-code-tree was very slow, because it re-sorted the list every
;; iteration. We can speed it up considerably by using a priority queue instead
;; of sorting the list every iteration. Clojure doesn't have one built in, so
;; here's a poor man's version built out of a sorted map of lists.

(defn madd [m [k p]] (assoc m p (cons k (m p))))

(defn mpop [m]
  (let [[k vlist] (first m)]
    [k (first vlist)
    (if (empty? (rest vlist))
      (dissoc m k)
      (assoc m k (rest vlist)))]))

(defn mcombine [m]
  (let [[pa a pop1] (mpop m)
        [pb b pop2] (mpop pop1)]
    (madd pop2 [[a b] (+ pa pb)])))

(defn make-code-tree [P]
  (second (mpop
          (nth (iterate mcombine (reduce madd (sorted-map) P)) (dec (count P))))))

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

(def fair-coin {:H 1 :T 1})
(def unfair-coin {:H 3 :T 1})
(def unfair-triples (combine-distributions unfair-coin unfair-coin unfair-coin))

(def triple-code-tree (make-code-tree unfair-triples))
(def triple-code (make-code triple-code-tree))


;; We don't need to estimate the cost of a code by generating a long random stream and transmitting it.

;; Given a probability distribution and a code, we can just calculate the expected cost:

;; We make a distribution over the transmitted symbols
(defn code-distribution [P code]  (for [s (keys P)] [(P s) (code s)]))

;; e.g.
(code-distribution unfair-triples triple-code)
;;([27 (0)] [9 (1 1 1)] [9 (1 1 0)] [3 (1 0 0 1 0)] [9 (1 0 1)] [3 (1 0 0 0 1)] [3 (1 0 0 0 0)] [1 (1 0 0 1 1)])

;; And from that, it's easy to calculate the expected length of a sequence
(defn expected-code-length [P code]
  (let [cd (code-distribution P code)]
    (/ (reduce + (for [[k v] cd] (* k (count v))))
       (reduce + (map first cd)))))

;; So the expected cost per symbol is:
(expected-code-length unfair-triples triple-code) ;79/32
;; or per coin-toss:
(float ( / 79/32 3)) ; 0.8229167

;; So we can get the noise out of our table:

(defn cost-for-n-code [ P n ]
     (let [Pn (apply combine-distributions (repeat n P))
           code (make-code (make-code-tree Pn))]
       (float (/ (expected-code-length Pn code) n))))

(cost-for-n-code unfair-coin 1) ; 1.0
(cost-for-n-code unfair-coin 2) ; 0.84375
(cost-for-n-code unfair-coin 3) ; 0.8229167
(cost-for-n-code unfair-coin 4) ; 0.8183594
(cost-for-n-code unfair-coin 5) ; 0.81777346
(cost-for-n-code unfair-coin 6) ; 0.8186849
(cost-for-n-code unfair-coin 7) ; 0.81685966
(cost-for-n-code unfair-coin 8) ; 0.81575775
(cost-for-n-code unfair-coin 9) ; 0.81493336
(cost-for-n-code unfair-coin 10) ; 0.8141917
(cost-for-n-code unfair-coin 11) ; 0.8137328
(cost-for-n-code unfair-coin 12) ; 0.81351095

;; It looks like something is converging, although the convergence isn't monotonic. 
;; I'm now revising my estimate of the cost of sending the results of a 1:3 process to be about 0.813

;; But we don't know whether that's the limit of the coding process, or a property of the 1:3 distribution,
;; or whether it's in some way specific to transmitting over a binary channel.

;; Let's look at some other distributions.

;; For the fair coin distribution, huffman coding triples doesn't help at all.
(cost-for-n-code fair-coin 1) ; 1.0
(cost-for-n-code fair-coin 2) ; 1.0
(cost-for-n-code fair-coin 3) ; 1.0
(cost-for-n-code fair-coin 4) ; 1.0

;; But for an even choice between three things, it does:
(def triad {:A 1 :B 1 :C 1})

(cost-for-n-code triad 1) ; 1.6666666
(cost-for-n-code triad 2) ; 1.6111112
(cost-for-n-code triad 3) ; 1.6049383
(cost-for-n-code triad 4) ; 1.6049383
(cost-for-n-code triad 5) ; 1.5893004
(cost-for-n-code triad 6) ; 1.5992227
(cost-for-n-code triad 7) ; 1.5895878
(cost-for-n-code triad 8) ; 1.5939262



;; For a choice between four things, it makes no difference
(def quad {:A 1 :B 1 :C 1 :D 1})

(cost-for-n-code quad 1) ; 2.0
(cost-for-n-code quad 2) ; 2.0
(cost-for-n-code quad 3) ; 2.0
(cost-for-n-code quad 4) ; 2.0
(cost-for-n-code quad 5) ; 2.0

;; For five it's a good thing to do
(def quint {:A 1 :B 1 :C 1 :D 1 :E 1})

(cost-for-n-code quint 1) ; 2.4
(cost-for-n-code quint 2) ; 2.36
(cost-for-n-code quint 3) ; 2.3253334
(cost-for-n-code quint 4) ; 2.3404
(cost-for-n-code quint 5) ; 2.337856
(cost-for-n-code quint 6) ; 2.3252373


;; And again, for the next power of two, no difference.
(def octet {:A 1 :B 1 :C 1 :D 1 :E 1 :F 1 :G 1 :H 1})

(cost-for-n-code octet 1) ; 3.0
(cost-for-n-code octet 2) ; 3.0
(cost-for-n-code octet 3) ; 3.0

;; I think that we might have guessed that it would take three bits to decide between
;; eight equally likely things, and two bits for four things, but what about the other numbers?

;; If 8 = 2*2*2 -> 3, and 4 = 2*2 -> 2, and 2 -> 1, what's the easiest pattern we can fit to that?

(defn bits [n] (/ (Math/log n)(Math/log 2))) ;; Also known as log2

(map bits (range 2 10)) ; (1.0 1.5849625007211563 2.0 2.321928094887362 2.584962500721156 2.807354922057604 3.0 3.1699250014423126)

;; So let's make a prediction
(def sextet {:A 1 :B 1 :C 1 :D 1 :E 1 :F 1})
(bits 6) ; 2.584962500721156

(cost-for-n-code sextet 1) ; 2.6666667
(cost-for-n-code sextet 2) ; 2.6111112
(cost-for-n-code sextet 3) ; 2.6049383
(cost-for-n-code sextet 4) ; 2.6049383
(cost-for-n-code sextet 5) ; 2.5893004
(cost-for-n-code sextet 6) ; 2.5992227


;; It looks as though the cost of coding an even distribution using huffman
;; encoding of runs is pretty close to being the logarithm (to base 2) of the number of symbols.

Saturday, January 15, 2011

A Very Gentle Introduction to Information Theory: Huffman Coding

;; A Very Gentle Introduction to Information Theory : Part III

;; Entropy and Huffman Coding

;; Once again, I'll keep some code from the first two parts, without explanation

(defn random-stream [P]
  (let [pseq (vec (mapcat (fn[[k v]](repeat v k )) P))]
    (for [i (range)] (rand-nth pseq))))

(defn cost [encoder decoder message]
  (let [coded (encoder message)]
    (if (= (decoder coded) message) (count coded) :fail)))

(defn decoder
  ([code-tree stream] (decoder code-tree code-tree stream))
  ([current-code-tree code-tree stream]
     (lazy-seq
        (if (keyword? current-code-tree)
          (cons current-code-tree (decoder code-tree code-tree stream))
          (if-let [stream (seq stream)]
            (if (= (first stream) 1)
              (decoder (first current-code-tree)  code-tree (rest stream))
              (decoder (second current-code-tree) code-tree (rest stream))))))))

(defn encoder [code stream] (mapcat code stream))

(defn make-encoder [code]  (fn [s] (encoder code s)))
(defn make-decoder [code-tree] (fn[s] (decoder code-tree s)))

(defn combine-keywords [& a] (keyword (apply str (mapcat #(drop 1 (str %)) a))))
(defn split-keyword [a] (map #(keyword (str %)) (drop 1 (str a))))

(defn make-combination-encoder [code n]
  (fn [s] (encoder code (map #(apply combine-keywords %) (partition n s)))))

(defn make-combination-decoder [code-tree]
  (fn [s] (mapcat split-keyword (decoder code-tree s))))

;; So far we've looked at three probability distributions:
(def fair-coin {:H 1 :T 1})
(def unfair-coin {:H 3 :T 1})
(def unfair-pairs {:HH 9, :HT 3, :TH 3, :TT 1})

;; And two codes:
(def fair-code-tree [:H :T])
(def fair-code {:H '(1) :T '(0)})
     
(def unfair-code-tree [ :HH [ :HT [ :TH :TT]]])
(def unfair-code {:HH '(1) :HT '(0 1) :TH '(0 0 1) :TT '(0 0 0)})

;; We should add a fourth probability distribution to represent pairs of fair coin toss results
(def fair-pairs {:HH 1 :HT 1 :TH 1 :TT 1})

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; We found that

(defn estimate-cost [P encoder decoder]
  (let [n 100000
        c (cost encoder decoder (take n (random-stream P)))]
    (if (number? c) (float (/ c n)) c)))

;; Using the best code we can think of for the fair coin resulted in a transmission cost of 1 (£/symbol)
(estimate-cost fair-coin   (make-encoder fair-code) (make-decoder fair-code-tree)) ; 1.0
;; And that that was also the cost for the unfair coin with this code:
(estimate-cost unfair-coin (make-encoder fair-code) (make-decoder fair-code-tree)) ; 1.0

;; But that we could come up with a code for pairs of coin tosses
;; which did substantially better for pairs of unfair coin tosses
(estimate-cost unfair-pairs (make-encoder unfair-code) (make-decoder unfair-code-tree)) ; 1.68338
;; but substantially worse for pairs of fair coin tosses
(estimate-cost fair-pairs   (make-encoder unfair-code) (make-decoder unfair-code-tree)) ; 2.24722
;; remember that that's the transmission cost per symbol, and that each symbol represents two coin tosses

;; In case you think there's any sleight of hand going on there, here's how we'd use the pairs code to transmit
;; the original unpaired streams
(estimate-cost unfair-coin  (make-combination-encoder unfair-code 2) (make-combination-decoder unfair-code-tree)) ; 0.84561
(estimate-cost fair-coin    (make-combination-encoder unfair-code 2) (make-combination-decoder unfair-code-tree)) ; 1.12407
;; Notice that the costs here are per-toss, showing that the unfair code is actually an improvement 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;; Now it seems that, if we can send the results of unfair-coin more efficiently
;; by considering {:HH 9, :HT 3, :TH 3, :TT 1}, the distribution of pairs of
;; tosses, then we should have a look at the distribution of triples, and work out a code for that:

(defn combine-distributions
  ([P] P)
  ([P1 P2]
     (into {}
           (for [[s1 p1] P1
                 [s2 p2] P2]
             [(combine-keywords s1 s2) (* p1 p2)])))
  ([P1 P2 & Plist] (reduce combine-distributions (cons P1 (cons P2 Plist)))))

(def unfair-triples (combine-distributions unfair-coin unfair-coin unfair-coin))

;; unfair-triples is {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, :TTH 3, :TTT 1}

;; Now how should we work out a code for this distribution?

;; Huffman tells us that we should combine the two lowest probability events
;; so that
{:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, :TTH 3, :TTT 1}
;; goes to
{:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, {:TTH 3, :TTT 1} 4}
;; and then do it again, so that
{:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, {:TTH 3, :TTT 1} 4}
;; goes to
{:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, {:THT 3, {:TTH 3, :TTT 1} 4} 7}
;; and so on .....

(defn huffman-combine [P]
  (let [plist (sort-by second P)
        newelement (into {} (take 2 plist))]
    (into {} (cons [newelement (reduce + (vals newelement))] (drop 2 plist)))))

(nth (iterate huffman-combine unfair-triples) (dec (dec (count unfair-triples))))
;; {{{:HHT 9, :HTH 9} 18, {:THH 9, {{:TTT 1, :HTT 3} 4, {:THT 3, :TTH 3} 6} 10} 19} 37, :HHH 27}

;; At the end, we get a sort of nested binary probability distribution
;; You could think of this as being a way to generate the triples by tossing strangely biased coins!

;; From that, we can generate our code tree directly by just throwing away the numbers
(require 'clojure.walk)
(defn make-code-tree [P]
  (clojure.walk/postwalk #(if (map? %) (into[] (map first  %)) %)
                         (nth (iterate huffman-combine P) (dec (dec (count P))))))

(def triple-code-tree (make-code-tree unfair-triples))
;;[[[:HHT :HTH] [:THH [[:TTT :HTT] [:THT :TTH]]]] :HHH]

;; If we have the decoder, then we can use it to generate the coder!

(defn symbols [prefix code-tree]
  (if (keyword? code-tree) (list prefix code-tree)
      (concat (symbols (cons 1 prefix) (first code-tree))
              (symbols (cons 0 prefix) (second code-tree)))))

(defn make-code [code-tree]
  (into {} (map (fn[[c s]][s (reverse c)]) (partition 2 (symbols '() code-tree)))))

(def triple-code (make-code triple-code-tree))
;; {:HHT (0 0 0), :HTH (0 0 1), :THH (0 1 0), :TTT (0 1 1 0 0), :HTT (0 1 1 0 1), :THT (0 1 1 1 0), :TTH (0 1 1 1 1), :HHH (1)}

;; Let's see how this does
(estimate-cost unfair-triples (make-encoder triple-code) (make-decoder triple-code-tree)) ; 2.4615

;; £2.46 per symbol, or 0.82p per toss

;; So while going from single tosses to pairs allowed us to go from 1->0.85, going from pairs to triples only allowed us to get from 0.85->0.82.

;; Is there an end to this process?

(defn bit-rate [P n]
  (let [Pn (apply combine-distributions (repeat n P))
        tree (make-code-tree Pn)]
    (/ (estimate-cost Pn (make-encoder (make-code tree)) (make-decoder tree)) n)))

(bit-rate unfair-coin 1) ; 1.0
(bit-rate unfair-coin 2) ; 0.844435
(bit-rate unfair-coin 3) ; 0.82466
(bit-rate unfair-coin 4) ; 0.8196275
(bit-rate unfair-coin 5) ; 0.818912
(bit-rate unfair-coin 6) ; 0.81896996
(bit-rate unfair-coin 7) ; 0.8166514
(bit-rate unfair-coin 8) ; 0.815995
(bit-rate unfair-coin 9) ; 0.81352335
(bit-rate unfair-coin 10) ; 0.81462896
(bit-rate unfair-coin 11) ; 0.8137691

;; To me, this is at least suggestive that there might be something fundamental
;; about a cost of 82p to transmit the result of a 3:1 random result over a
;; binary channel.

Learning Clojure

About Me

Blog Archive

Search This Blog