;; I often find that the default behaviour of exceptions in emacs/slime is a bit ;; annoying. ;; Up pops up a debugger window, but that moves the focus away from where ;; you're working, and the new window then needs dismissing, and has often ;; buggered up the size of other windows, etc... ;; Also when playing with experimental clojure versions, the behaviour is a bit ;; random, and often the expression just gets swallowed, with no clue as to what ;; it was: ;; Anyway I wrote this little macro, which turns an exception into a return value: (defmacro tryc[ & e] `(try ~@e (catch Exception a# a#))) (tryc (java.io.File ".")) ;; #<ClassCastException java.lang.ClassCastException: java.lang.Class cannot be cast to clojure.lang.IFn> ;; Obviously you should only use this at the top level! ;; I find I'm using it all the time when writing code in emacs
Blog Archive
-
▼
2011
(26)
-
▼
January
(10)
- Turning Exceptions into Return Values
- Finding Something in a Vector, Parsing CSV Files
- £500 if you can find me a job (version 1.0)
- K-means : An Algorithm for Clustering Data
- Cleaning Old Definitions from the REPL : shred-user
- take-while-unstable
- A Very Gentle Introduction to Information Theory: ...
- A Very Gentle Introduction to Information Theory: ...
- A Very Gentle Introduction to Information Theory :...
- A Very Gentle Introduction to Information Theory :...
-
▼
January
(10)
Search This Blog
Sunday, January 30, 2011
Turning Exceptions into Return Values
Finding Something in a Vector, Parsing CSV Files
;; Indexing Into a Vector: A Sequence of Answers ;; The other day, I had a csv file to read (csv files are comma-separated ;; values, often produced by exporting spreadsheets, or a reasonably standard ;; human-readable format for grids of data) ;; There's a lovely library which takes all the hard work out of reading csv ;; files ;; The library's on clojars, so you use it by adding this to your maven pom.xml ;; <dependency> ;; <groupId>clojure-csv</groupId> ;; <artifactId>clojure-csv</artifactId> ;; <version>1.2.0</version> ;; </dependency> ;; and then you can use clojure-csv.core/parse-csv to read your file in as a ;; sequence. It seems to magically handle all the intricies of all the ;; different possible csv file arrangements. ;; Thankyou David Santiago, it works like a charm. ;; But then I hit a problem: I wanted to read the email-addresses from every ;; record. ;; I've rather simplified here, but let's imagine that the data after parsing ;; looked like this: (def csv-data [["Name" "E-mail 1" "Address" "E-mail 2" "E-mail 3"] ["John" "john@mailinator.com" "3 Church St" "xyz@learnclojure.com" ""] ["Fred" "fred@mailinator.com" "7 Park Street" "fred@gmail.com" "fred@googlemail.com" ]]) (def header (first csv-data)) (def data (rest csv-data)) ;; As it happened, most of the email-addresses were stored in the second column, ;; but some records had two or even three addresses, and it looked as though the ;; generating program might be labelling its columns according to the number of ;; e-mail addresses it needed to output. ;; It seemed very likely that another data set might have an "E-mail 4" column, ;; and it seemed unwise to rely on the needed columns always being 1,3,4 and ;; possibly 5. What if the program introduced another field entirely? ;; So obviously I needed a function to look up things in the header row somehow. ;; And there didn't seem to be one, although there were the interesting ;; functions keep-indexed and map-indexed, which I hadn't noticed before. ;; And I couldn't find one. So I figured that I could either write one, or I ;; could ask on Stack Overflow. ;; And so I did both, expecting to find that I'd re-invent something in the ;; standard library, or at least in contrib, that I hadn't been able to find. ;; http://stackoverflow.com/questions/4830900/how-do-i-find-the-index-of-an-item-in-a-vector ;; So the function(s) I came up with were: (defn indices-of [f coll] (keep-indexed #(if (f %2) %1 nil) coll)) (defn first-index-of [f coll] (first (indices-of f coll))) (defn find-thing [value coll] (first-index-of #(= % value) coll)) ;; And here are some examples: (indices-of #(= "Name" %) header) ; (0) (indices-of (partial re-matches #".*E-mail.*") header) ; (1 3 4) (first-index-of #(= "Name" %) header) ; 0 (find-thing "Name" header) ; 0 ;; But I was a bit nervous about these solutions, because I thought I must just ;; have re-invented some sort of wheel, and also because they're happy to return ;; answers for sets and maps, where the question doesn't really make much sense ;;fine (find-thing "two" ["one" "two" "three" "two"]) ; 1 (find-thing "two" '("one" "two" "three")) ; 1 ;;but these answers are a bit silly (find-thing "two" #{"one" "two" "three"}) ; 1 (find-thing "two" {"one" "two" "two" "three"}) ; nil ;; But I went back to Stack Overflow, in order to answer my own question, and ;; found a couple of much better answers. ;; Brian Carper pointed out: (.indexOf header "Name") ; 0 ;; Which is clearly the answer for searching vectors. ;; And ponzao pointed out this lovely thing, originally due to Stuart Halloway: (require 'clojure.contrib.seq) (first (clojure.contrib.seq/positions #{"Name"} header)) ; 0 ;; positions is like indices-of, but using a set as the predicate is really clever. ;; anyway, now I could say: (map #( % (.indexOf header "Name")) data) ; ("John" "Fred") (map #( % (.indexOf header "E-mail 1")) data) ; ("john@mailinator.com" "fred@mailinator.com") ;; Or even, for fans of terseness and crypticity (and forgive me Lord, for I am ;; such a fan), something like: (map #(vector (% (.indexOf header "Name")) (for [i (clojure.contrib.seq/positions (partial re-matches #"E-mail \d+") header)] (% i))) data) ;; (["John" ("john@mailinator.com" "xyz@learnclojure.com" "")] ;; ["Fred" ("fred@mailinator.com" "fred@gmail.com" "fred@googlemail.com")]) ;; But a little later, there was another answer from cgrand, who warned me ;; against using indices on general principles. And I agree with that, so I ;; asked what I should do in the particular case of csv files. And there was an ;; answer from Alex Stoddard, which I believe is the answer to the question that ;; I should have asked. ;; We can make a map from strings to indices (def hmap (zipmap header (iterate inc 0))) ;; And use it like this: (map #(% (hmap "Name")) data) ; ("John" "Fred") ;; or this: (def e-mails (filter (partial re-matches #"E-mail \d+") header)) (map #(vector (% (hmap "Name")) (for [e e-mails] (% (hmap e)))) data) ;; or this: (map #(vector (% (hmap "Name")) (for [e (filter (partial re-matches #"E-mail \d+") header)] (% (hmap e)))) data) ;; to get: ;; (["John" ("john@mailinator.com" "xyz@learnclojure.com" "")] ;; ["Fred" ("fred@mailinator.com" "fred@gmail.com" "fred@googlemail.com")]) ;; or even this (although now you do have to rely on the name coming before the e-mails): (map #(for [e (filter (partial re-matches #"E-mail \d+|Name") header)] (% (hmap e))) data) ;; to get: ;; (("John" "john@mailinator.com" "xyz@learnclojure.com" "") ;; ("Fred" "fred@mailinator.com" "fred@gmail.com" "fred@googlemail.com")) ;;If we want to abstract out a pattern, then: (defn columns [f header data] (let [hmap (zipmap header (iterate inc 0)) cols (filter f header) (map #(for [e cols] (% (hmap e))) data))) ;; allows: (columns #{"Name"} header data) ; (("John") ("Fred")) (columns (partial re-matches #"E-mail \d+") header data) ; (("john@mailinator.com" "xyz@learnclojure.com" "") ("fred@mailinator.com" "fred@gmail.com" "fred@googlemail.com"))
Thursday, January 27, 2011
£500 if you can find me a job (version 1.0)
That worked a treat, £500 awarded to lucky winner who found me a fun short contract.
Now I'd like another one, so I repeat the offer:
If, within the next six months, I take a job which lasts longer than one month, and that is not obtained through an agency, then on the day the first cheque from that job cashes, I'll give £500 to the person who provided the crucial introduction.
If there are a number of people involved somehow, then I'll apportion it fairly between them. And if the timing conditions above are not quite met, or if someone points me at a short contract which the £500 penalty makes not worth taking, then I'll do something fair and proportional anyway. (The thing Simon pointed me at only lasted three weeks and I paid him in full anyway, because it was neat.)
And this offer applies even to personal friends, and to old contacts who I have not got round to calling yet, and to people who are themselves offering work, because why wouldn't it?
And obviously if I find one through my own efforts then I'll keep the money. But my word is generally thought to be good, and I have made a public promise on my own blog to this effect, so if I cheat you you can blacken my name and ruin my reputation for honesty, which is worth much more to me than £500.
Anyhow, my CV is at http://www.aspden.com, and any advice on how it could be improved will be gratefully received.
I'll also repeat the original advert: (job-hunt.contrib 1.0)
Anyone in Cambridge need a programmer? Obviously Clojure is a speciality, and my current obsession, but I'm also pretty good with C (especially the embedded variety), microcontrollers, and Python, and I have a particular facility with mathematical concepts and algorithms of all kinds. My obsessions can be pretty quickly changed when I need them to be.
I have a (deserved) reputation for being able to produce heavily optimised but nevertheless bug-free and readable code, but I also know how to hack together sloppy, bug-ridden prototypes, and I know which style is appropriate when, and how to slide along the continuum between them.
I've worked in telecoms, commercial research, banking, university research, a chip design company, server virtualization, a couple of startups, and occasionally completely alone.
I've worked on many sizes of machine. I've written programs for tiny 8-bit microcontrollers, and once upon a time every IBM machine in one building in Imperial College was running my partial differential equation solvers in parallel in the background.
I'm smart and I get things done. I'm confident enough in my own abilities that if I can't do something I admit it and find someone who can.
I also have various ancient and rusty skills with things like Java, C++, R, OCaml, Common LISP, Scheme, FORTRAN and Pascal which can be brushed up if necessary. Like all lispers, I occasionally write toy interpreters for made-up languages for fun.
If you're a local company using Java, who might be interested in giving Clojure a try (motivation here, in Paul Graham's classic Beating the Averages), I'd love to try to show you what all the fuss is about.
CV here if you're interested: http://www.aspden.com
I've never used a CV before, having always found work through word of mouth. So I expect that it can be improved. If anyone's got any suggestions as to how it could be better written, do please leave comments or e-mail cv@aspden.com.
Now I'd like another one, so I repeat the offer:
If, within the next six months, I take a job which lasts longer than one month, and that is not obtained through an agency, then on the day the first cheque from that job cashes, I'll give £500 to the person who provided the crucial introduction.
If there are a number of people involved somehow, then I'll apportion it fairly between them. And if the timing conditions above are not quite met, or if someone points me at a short contract which the £500 penalty makes not worth taking, then I'll do something fair and proportional anyway. (The thing Simon pointed me at only lasted three weeks and I paid him in full anyway, because it was neat.)
And this offer applies even to personal friends, and to old contacts who I have not got round to calling yet, and to people who are themselves offering work, because why wouldn't it?
And obviously if I find one through my own efforts then I'll keep the money. But my word is generally thought to be good, and I have made a public promise on my own blog to this effect, so if I cheat you you can blacken my name and ruin my reputation for honesty, which is worth much more to me than £500.
Anyhow, my CV is at http://www.aspden.com, and any advice on how it could be improved will be gratefully received.
I'll also repeat the original advert: (job-hunt.contrib 1.0)
Anyone in Cambridge need a programmer? Obviously Clojure is a speciality, and my current obsession, but I'm also pretty good with C (especially the embedded variety), microcontrollers, and Python, and I have a particular facility with mathematical concepts and algorithms of all kinds. My obsessions can be pretty quickly changed when I need them to be.
I have a (deserved) reputation for being able to produce heavily optimised but nevertheless bug-free and readable code, but I also know how to hack together sloppy, bug-ridden prototypes, and I know which style is appropriate when, and how to slide along the continuum between them.
I've worked in telecoms, commercial research, banking, university research, a chip design company, server virtualization, a couple of startups, and occasionally completely alone.
I've worked on many sizes of machine. I've written programs for tiny 8-bit microcontrollers, and once upon a time every IBM machine in one building in Imperial College was running my partial differential equation solvers in parallel in the background.
I'm smart and I get things done. I'm confident enough in my own abilities that if I can't do something I admit it and find someone who can.
I also have various ancient and rusty skills with things like Java, C++, R, OCaml, Common LISP, Scheme, FORTRAN and Pascal which can be brushed up if necessary. Like all lispers, I occasionally write toy interpreters for made-up languages for fun.
If you're a local company using Java, who might be interested in giving Clojure a try (motivation here, in Paul Graham's classic Beating the Averages), I'd love to try to show you what all the fuss is about.
CV here if you're interested: http://www.aspden.com
I've never used a CV before, having always found work through word of mouth. So I expect that it can be improved. If anyone's got any suggestions as to how it could be better written, do please leave comments or e-mail cv@aspden.com.
Tuesday, January 25, 2011
K-means : An Algorithm for Clustering Data
;; K-means ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; K-means is a Clustering Algorithm. ;; We use it when we have some data, and we want to split the data into separate categories. ;; For instance, an early biologist, let's call him Adam, might measure all ;; sorts of things about the living objects he encounters in the world. ;; (black, feathers, flies) ;; (green, tasty) ;; (green, slithers, poisonous) ;; ... and so on ;; After he collects enough data, he might go looking for structure in it. ;; Uninformed by theory, he might nevertheless notice that many things that do ;; not move are green, and that many things that are not green move. ;; He might name these two obvious groups the Animals and the Plants. ;; Further analysis of the data might split the Plants into Trees and Flowers, ;; and the Animals into Mammals, Birds, and Fish. ;; Theoretically, this process could continue further, extracting 'natural ;; categories' from the observed structure of the data, without any theory about ;; how the various properties come about ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Let's consider a very simple clustering situation. We have some numbers, ;; and we'd like to see if they form groups. ;; Suppose we want to cluster the points 2 3 5 6 10 11 100 101 102 (def data '(2 3 5 6 10 11 100 101 102)) ;; You may be able to see some natural groupings in this data. ;; It's easy enough to say how far one number is from another (defn distance[a b] = (if (< a b) (- b a) (- a b))) ;; To do K-means, we need to start with some guesses about where the clusters are. ;; They don't have to be terribly good guesses. (def guessed-means '(0 10)) ;; Given a particular point, we want to know which of our means is closest (defn closest [point means distance] (first (sort-by #(distance % point) means))) ;; In our little dataset, 2 is closest to the guess of 0, and 100 is closest to the guess of 10 (closest 2 guessed-means distance) ; 0 (closest 100 guessed-means distance) ; 10 ;; So we can talk about the group of all points for which 0 is the best guess ;; and the group of all points for which 10 is the best guess. (defn point-groups [means data distance] (group-by #(closest % means distance) data)) (point-groups guessed-means data distance) ; {0 [2 3 5], 10 [6 10 11 100 101 102]} ;; We can take an average of a group of points (defn average [& list] (/ (reduce + list) (count list))) (average 6 10 11 100 101 102) ; 55 ;; So we can take the average of each group, and use it as a new guess for where ;; the clusters are. If a mean doesn't have a group, then we'll leave it where ;; it is. (defn new-means [average point-groups old-means] (for [o old-means] (if (contains? point-groups o) (apply average (get point-groups o)) o))) (new-means average (point-groups guessed-means data distance) guessed-means) ; (10/3 55) ;; So if we know we've got a particular set of points, and a particular idea of ;; distance, and a way of averaging things, that gives us a way of making a new ;; list of guesses from our original list of guesses (defn iterate-means [data distance average] (fn [means] (new-means average (point-groups means data distance) means))) ((iterate-means data distance average) '(0 10)) ; (10/3 55) ;; and of course we can use that as a new guess, and improve it again. ((iterate-means data distance average) '(10/3 55)) ; (37/6 101) ;; We can do this repeatedly until it settles down. (iterate (iterate-means data distance average) '(0 10)) ; ((0 10) (10/3 55) (37/6 101) (37/6 101) .....) ;; K-means with two means seems to be trying to tell us that we've got a group ;; centered around 6 and another centred around 101 ;; These groups are: (defn groups [data distance means] (vals (point-groups means data distance))) (groups data distance '(37/6 101)) ; ([2 3 5 6 10 11] [100 101 102]) ;; Ideally we'd like to iterate until the groups stop changing. ;; I described a function for doing this in a previous post: (defn take-while-unstable ([sq] (lazy-seq (if-let [sq (seq sq)] (cons (first sq) (take-while-unstable (rest sq) (first sq)))))) ([sq last] (lazy-seq (if-let [sq (seq sq)] (if (= (first sq) last) '() (take-while-unstable sq)))))) (take-while-unstable '(1 2 3 4 5 6 7 7 7 7)) ; (1 2 3 4 5 6 7) (take-while-unstable (map #(groups data distance %) (iterate (iterate-means data distance average) '(0 10)))) ; (([2 3 5] [6 10 11 100 101 102]) ; ([2 3 5 6 10 11] [100 101 102])) ;; Shows that our first guesses group 2,3 and 5 (nearer to 0 than 10) vs all the rest. ;; K-means modifies that instantly to separate out the large group of three. ;; We can make a function, which takes our data, notion of distance, and notion of average, ;; and gives us back a function which, for a given set of initial guesses at the means, ;; shows us how the group memberships change. (defn k-groups [data distance average] (fn [guesses] (take-while-unstable (map #(groups data distance %) (iterate (iterate-means data distance average) guesses))))) (def grouper (k-groups data distance average)) (grouper '(0 10)) ; (([2 3 5] [6 10 11 100 101 102]) ; ; ([2 3 5 6 10 11] [100 101 102])) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Nothing we said above limits us to only having two guesses (grouper '(1 2 3)) ; (([2] [3 5 6 10 11 100 101 102]) ; ([2 3 5 6 10 11] [100 101 102]) ; ([2 3] [5 6 10 11] [100 101 102]) ; ([2 3 5] [6 10 11] [100 101 102]) ; ; ([2 3 5 6] [10 11] [100 101 102])) ;; The more means we start with, the more detailed our clustering. (grouper '(0 1 2 3 4)) ; (([2] [3] [5 6 10 11 100 101 102]) ; ([2] [3 5 6 10 11] [100 101 102]) ; ([2 3] [5 6 10 11] [100 101 102]) ; ([2 3 5] [6 10 11] [100 101 102]) ; ([2] [3 5 6] [10 11] [100 101 102]) ; ; ([2 3] [5 6] [10 11] [100 101 102])) ;; We have to be careful not to start off with too many means, or we just get our data back: (grouper (range 200)) ; (([2] [3] [100] [5] [101] [6] [102] [10] [11])) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Generalizing to Other Spaces ;; In fact, nothing we said above depends on our inputs being numbers ;; We can use any data where we can define a distance, and a method of averaging: ;; One of the easiest things to do this for would be vectors: (defn vec-distance [a b] (reduce + (map #(* % %) (map - a b)))) (defn vec-average [& l] (map #(/ % (count l)) (apply map + l))) (vec-distance [1 2 3][5 6 7]) ; 48 (vec-average [1 2 3][5 6 7]) ; (3 4 5) ;; Here's a little set of vectors (def vector-data '( [1 2 3] [3 2 1] [100 200 300] [300 200 100] [50 50 50])) ;; And choosing three guesses in a fairly simple-minded manner, we can see how the algorithm ;; divides them into three different groups. ((k-groups vector-data vec-distance vec-average) '([1 1 1] [2 2 2] [3 3 3])) ; (([[1 2 3] [3 2 1]] [[100 200 300] [300 200 100] [50 50 50]]) ; ([[1 2 3] [3 2 1] [50 50 50]] ; [[100 200 300] [300 200 100]]) ; ([[1 2 3] [3 2 1]] ; [[100 200 300] [300 200 100]] ; [[50 50 50]])) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Pedantic Footnote ;; Note that the algorithm as described above isn't quite the classic K-means. ;; I don't think the difference is terribly important, and I thought it would complicate the explanation to deal with it. ;; In the usual K-means, if you have two identical means, then you're only supposed to update one of them. ;; Here our two identical guesses are both getting updated (new-means average (point-groups '(0 0) '(0 1 2 3 4) distance) '(0 0)) ; (2 2) ;; Our update function: (defn new-means [average point-groups old-means] (for [o old-means] (if (contains? point-groups o) (apply average (get point-groups o)) o))) ;; Needs to be changed so that if there are two identical means only one of them will be changed: (defn update-seq [sq f] (let [freqs (frequencies sq)] (apply concat (for [[k v] freqs] (if (= v 1) (list (f k)) (cons (f k) (repeat (dec v) k))))))) (defn new-means [average point-groups old-means] (update-seq old-means (fn[o] (if (contains? point-groups o) (apply average (get point-groups o)) o)))) ;; Now only one will get updated at once (new-means average (point-groups '(0 0) '(0 1 2 3 4) distance) '(0 0)) ; (2 0) ;; Now we don't lose groups when the means get aliased. ((k-groups '(0 1 2 3 4) distance average) '(0 1)) ; (([0] [1 2 3 4]) ([0 1] [2 3 4])) ((k-groups '(0 1 2 3 4) distance average) '(0 0)) ; (([0 1 2 3 4]) ([0] [1 2 3 4]) ([0 1] [2 3 4])) ((k-groups vector-data vec-distance vec-average) '([1 1 1] [1 1 1] [1 1 1])) ; ; (([[1 2 3] [3 2 1] [100 200 300] [300 200 100] [50 50 50]]) ; ([[1 2 3] [3 2 1]] [[100 200 300] [300 200 100] [50 50 50]]) ; ([[1 2 3] [3 2 1] [50 50 50]] [[100 200 300] [300 200 100]]) ; ([[1 2 3] [3 2 1]] [[100 200 300] [300 200 100]] [[50 50 50]])) ;; Although it's still possible that a mean never acquires any points, so we can still get out fewer groups than means. ((k-groups '(0 1 2 3 4) distance average) '(0 5 10)) ; (([0 1 2] [3 4]))
Cleaning Old Definitions from the REPL : shred-user
;; I often find myself restarting the clojure repl in order to get rid of old ;; definitions which introduce subtle bugs. ;; Consider: (defn factorial [n] (if (< n 2) 1 (* n (factorial (dec n))))) ; #'user/factorial (factorial 10) ; 3628800 ;; Suppose instead I wanted (factorial2 n) to be (* 1 2 2 2 3 2 4 .... 2 n) (defn factorial2 [n] (if (< n 2) 1 (* 2 n (factorial (dec n))))) ; #'user/factorial2 ;; There is a subtle bug introduced by my failing to rename the recursive call (map factorial2 '(1 2 3 4)) ; (1 4 24 192) ;; But (factorial2 4) should be (* 1 2 2 2 3 2 4) = 192 ;; I suspect that there is an old definition hanging around confusing matters. ;; To find it, I can run the program from the command line, or restart the repl and recompile, or: (doseq [s (map first (ns-interns 'user))](ns-unmap 'user s)) ; nil (defn factorial2 [n] (if (< n 2) 1 (* 2 n (factorial (dec n))))) ; Unable to resolve symbol: factorial in this context ;; Aha! (defn factorial2 [n] (if (< n 2) 1 (* 2 n (factorial2 (dec n))))) ; #'user/factorial2 (factorial2 4) ; 192 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; For convenience I'd like to make that a function: (defn shred-user [] (doseq [s (map first (ns-interns 'user))] (ns-unmap 'user s))) (shred-user) ; nil (shred-user) ; Unable to resolve symbol: shred-user in this context ; bugger! (defn shred-user [] (doseq [s (filter (complement #{'shred-user}) (map first (ns-interns 'user)))] (ns-unmap 'user s))) (shred-user) ; nil (shred-user) ; nil
take-while-unstable
;; I often find myself wanting to iterate something until it converges. (defn collatz [n] (if (even? n) (recur (/ n 2)) (+ 1 (* 3 n)))) (iterate collatz 100) ; (100 76 58 88 34 52 40 16 4 4 4 4 4 4 ...) ;; And so I often find myself writing: (defn take-while-unstable ([sq] (if (empty? sq) '() (cons (first sq) (#'take-while-unstable (rest sq) (first sq))))) ([sq last] (cond (empty? sq) '() (= (first sq) last) '() :else (cons (first sq) (#'take-while-unstable (rest sq) (first sq)))))) ;; Which allows me to stop the iteration once it's settled down. (take-while-unstable (iterate collatz 100)) ; (100 76 58 88 34 52 40 16 4) ;; You can see how it works here: (require 'clojure.contrib.trace) (binding [*print-length* 20] ; taking care not to look at the medusa (clojure.contrib.trace/dotrace (take-while-unstable) (take-while-unstable (iterate collatz 100)))) ;; TRACE t15884: (take-while-unstable (100 76 58 88 34 52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 ...)) ;; TRACE t15885: | (take-while-unstable (76 58 88 34 52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 100) ;; TRACE t15886: | | (take-while-unstable (58 88 34 52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 76) ;; TRACE t15887: | | | (take-while-unstable (88 34 52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 58) ;; TRACE t15888: | | | | (take-while-unstable (34 52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 88) ;; TRACE t15889: | | | | | (take-while-unstable (52 40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 34) ;; TRACE t15890: | | | | | | (take-while-unstable (40 16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 52) ;; TRACE t15891: | | | | | | | (take-while-unstable (16 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 40) ;; TRACE t15892: | | | | | | | | (take-while-unstable (4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 16) ;; TRACE t15893: | | | | | | | | | (take-while-unstable (4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ...) 4) ;; TRACE t15893: | | | | | | | | | => () ;; TRACE t15892: | | | | | | | | => (4) ;; TRACE t15891: | | | | | | | => (16 4) ;; TRACE t15890: | | | | | | => (40 16 4) ;; TRACE t15889: | | | | | => (52 40 16 4) ;; TRACE t15888: | | | | => (34 52 40 16 4) ;; TRACE t15887: | | | => (88 34 52 40 16 4) ;; TRACE t15886: | | => (58 88 34 52 40 16 4) ;; TRACE t15885: | => (76 58 88 34 52 40 16 4) ;; TRACE t15884: => (100 76 58 88 34 52 40 16 4) ;; (100 76 58 88 34 52 40 16 4) ;; Anyway, this function comes in so handy that I thought I'd lazyize it and give it some tests (use 'clojure.test) (with-test (defn take-while-unstable ([sq] (lazy-seq (if-let [sq (seq sq)] (cons (first sq) (take-while-unstable (rest sq) (first sq)))))) ([sq last] (lazy-seq (if-let [sq (seq sq)] (if (= (first sq) last) '() (take-while-unstable sq)))))) (is (= (take-while-unstable '()) '())) (is (= (take-while-unstable '(1)) '(1))) (is (= (take-while-unstable '(1 1)) '(1))) (is (= (take-while-unstable '(1 2)) '(1 2))) (is (= (take-while-unstable '(1 1 2)) '(1))) (is (= (take-while-unstable '(1 1 1)) '(1))) (is (= (take-while-unstable '(1 1 2 1)) '(1))) (is (= (take-while-unstable '(1 2 3 4 5 6 7 7 7 7)) '(1 2 3 4 5 6 7))) (is (= (take 10000 (take-while-unstable (range))) (take 10000 (range)))) (is (= (take-while-unstable (concat (take 10000 (range)) '(10000) (drop 10000 (range)))) (range 10001))) (is (= (take-while-unstable '(nil)) '(nil))) (is (= (take-while-unstable '(nil nil)) '(nil))) (is (= (take-while-unstable '[nil nil]) '(nil))) (is (= (take-while-unstable [ :a :b :c :d :d :e]) '(:a :b :c :d)))) (run-tests) ; {:type :summary, :test 1, :pass 14, :fail 0, :error 0} ;; Hopefully someone will find it useful. Can anyone break it?
Monday, January 17, 2011
A Very Gentle Introduction to Information Theory: Guessing the Entropy
;; A Very Gentle Introduction to Information Theory : Part IV ;; Entropy and Huffman Coding ;; Once again, I'll keep some code from the earlier parts, without explanation ;; Probability distributions (defn combine-keywords [& a] (keyword (apply str (mapcat #(drop 1 (str %)) a)))) (defn combine-distributions ([P] P) ([P1 P2] (into {} (for [[s1 p1] P1 [s2 p2] P2] [(combine-keywords s1 s2) (* p1 p2)]))) ([P1 P2 & Plist] (reduce combine-distributions (cons P1 (cons P2 Plist))))) ;; Coding and Decoding (defn decoder ([code-tree stream] (decoder code-tree code-tree stream)) ([current-code-tree code-tree stream] (lazy-seq (if (keyword? current-code-tree) (cons current-code-tree (decoder code-tree code-tree stream)) (if-let [stream (seq stream)] (if (= (first stream) 1) (decoder (first current-code-tree) code-tree (rest stream)) (decoder (second current-code-tree) code-tree (rest stream)))))))) (defn encoder [code stream] (mapcat code stream)) (defn make-encoder [code] (fn [s] (encoder code s))) (defn make-decoder [code-tree] (fn[s] (decoder code-tree s))) ;; Huffman encoding (defn symbols [prefix code-tree] (if (keyword? code-tree) (list prefix code-tree) (concat (symbols (cons 1 prefix) (first code-tree)) (symbols (cons 0 prefix) (second code-tree))))) (defn make-code [code-tree] (into {} (map (fn[[c s]][s (reverse c)]) (partition 2 (symbols '() code-tree))))) ;; The original make-code-tree was very slow, because it re-sorted the list every ;; iteration. We can speed it up considerably by using a priority queue instead ;; of sorting the list every iteration. Clojure doesn't have one built in, so ;; here's a poor man's version built out of a sorted map of lists. (defn madd [m [k p]] (assoc m p (cons k (m p)))) (defn mpop [m] (let [[k vlist] (first m)] [k (first vlist) (if (empty? (rest vlist)) (dissoc m k) (assoc m k (rest vlist)))])) (defn mcombine [m] (let [[pa a pop1] (mpop m) [pb b pop2] (mpop pop1)] (madd pop2 [[a b] (+ pa pb)]))) (defn make-code-tree [P] (second (mpop (nth (iterate mcombine (reduce madd (sorted-map) P)) (dec (count P)))))) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; (def fair-coin {:H 1 :T 1}) (def unfair-coin {:H 3 :T 1}) (def unfair-triples (combine-distributions unfair-coin unfair-coin unfair-coin)) (def triple-code-tree (make-code-tree unfair-triples)) (def triple-code (make-code triple-code-tree)) ;; We don't need to estimate the cost of a code by generating a long random stream and transmitting it. ;; Given a probability distribution and a code, we can just calculate the expected cost: ;; We make a distribution over the transmitted symbols (defn code-distribution [P code] (for [s (keys P)] [(P s) (code s)])) ;; e.g. (code-distribution unfair-triples triple-code) ;;([27 (0)] [9 (1 1 1)] [9 (1 1 0)] [3 (1 0 0 1 0)] [9 (1 0 1)] [3 (1 0 0 0 1)] [3 (1 0 0 0 0)] [1 (1 0 0 1 1)]) ;; And from that, it's easy to calculate the expected length of a sequence (defn expected-code-length [P code] (let [cd (code-distribution P code)] (/ (reduce + (for [[k v] cd] (* k (count v)))) (reduce + (map first cd))))) ;; So the expected cost per symbol is: (expected-code-length unfair-triples triple-code) ;79/32 ;; or per coin-toss: (float ( / 79/32 3)) ; 0.8229167 ;; So we can get the noise out of our table: (defn cost-for-n-code [ P n ] (let [Pn (apply combine-distributions (repeat n P)) code (make-code (make-code-tree Pn))] (float (/ (expected-code-length Pn code) n)))) (cost-for-n-code unfair-coin 1) ; 1.0 (cost-for-n-code unfair-coin 2) ; 0.84375 (cost-for-n-code unfair-coin 3) ; 0.8229167 (cost-for-n-code unfair-coin 4) ; 0.8183594 (cost-for-n-code unfair-coin 5) ; 0.81777346 (cost-for-n-code unfair-coin 6) ; 0.8186849 (cost-for-n-code unfair-coin 7) ; 0.81685966 (cost-for-n-code unfair-coin 8) ; 0.81575775 (cost-for-n-code unfair-coin 9) ; 0.81493336 (cost-for-n-code unfair-coin 10) ; 0.8141917 (cost-for-n-code unfair-coin 11) ; 0.8137328 (cost-for-n-code unfair-coin 12) ; 0.81351095 ;; It looks like something is converging, although the convergence isn't monotonic. ;; I'm now revising my estimate of the cost of sending the results of a 1:3 process to be about 0.813 ;; But we don't know whether that's the limit of the coding process, or a property of the 1:3 distribution, ;; or whether it's in some way specific to transmitting over a binary channel. ;; Let's look at some other distributions. ;; For the fair coin distribution, huffman coding triples doesn't help at all. (cost-for-n-code fair-coin 1) ; 1.0 (cost-for-n-code fair-coin 2) ; 1.0 (cost-for-n-code fair-coin 3) ; 1.0 (cost-for-n-code fair-coin 4) ; 1.0 ;; But for an even choice between three things, it does: (def triad {:A 1 :B 1 :C 1}) (cost-for-n-code triad 1) ; 1.6666666 (cost-for-n-code triad 2) ; 1.6111112 (cost-for-n-code triad 3) ; 1.6049383 (cost-for-n-code triad 4) ; 1.6049383 (cost-for-n-code triad 5) ; 1.5893004 (cost-for-n-code triad 6) ; 1.5992227 (cost-for-n-code triad 7) ; 1.5895878 (cost-for-n-code triad 8) ; 1.5939262 ;; For a choice between four things, it makes no difference (def quad {:A 1 :B 1 :C 1 :D 1}) (cost-for-n-code quad 1) ; 2.0 (cost-for-n-code quad 2) ; 2.0 (cost-for-n-code quad 3) ; 2.0 (cost-for-n-code quad 4) ; 2.0 (cost-for-n-code quad 5) ; 2.0 ;; For five it's a good thing to do (def quint {:A 1 :B 1 :C 1 :D 1 :E 1}) (cost-for-n-code quint 1) ; 2.4 (cost-for-n-code quint 2) ; 2.36 (cost-for-n-code quint 3) ; 2.3253334 (cost-for-n-code quint 4) ; 2.3404 (cost-for-n-code quint 5) ; 2.337856 (cost-for-n-code quint 6) ; 2.3252373 ;; And again, for the next power of two, no difference. (def octet {:A 1 :B 1 :C 1 :D 1 :E 1 :F 1 :G 1 :H 1}) (cost-for-n-code octet 1) ; 3.0 (cost-for-n-code octet 2) ; 3.0 (cost-for-n-code octet 3) ; 3.0 ;; I think that we might have guessed that it would take three bits to decide between ;; eight equally likely things, and two bits for four things, but what about the other numbers? ;; If 8 = 2*2*2 -> 3, and 4 = 2*2 -> 2, and 2 -> 1, what's the easiest pattern we can fit to that? (defn bits [n] (/ (Math/log n)(Math/log 2))) ;; Also known as log2 (map bits (range 2 10)) ; (1.0 1.5849625007211563 2.0 2.321928094887362 2.584962500721156 2.807354922057604 3.0 3.1699250014423126) ;; So let's make a prediction (def sextet {:A 1 :B 1 :C 1 :D 1 :E 1 :F 1}) (bits 6) ; 2.584962500721156 (cost-for-n-code sextet 1) ; 2.6666667 (cost-for-n-code sextet 2) ; 2.6111112 (cost-for-n-code sextet 3) ; 2.6049383 (cost-for-n-code sextet 4) ; 2.6049383 (cost-for-n-code sextet 5) ; 2.5893004 (cost-for-n-code sextet 6) ; 2.5992227 ;; It looks as though the cost of coding an even distribution using huffman ;; encoding of runs is pretty close to being the logarithm (to base 2) of the number of symbols.
Saturday, January 15, 2011
A Very Gentle Introduction to Information Theory: Huffman Coding
;; A Very Gentle Introduction to Information Theory : Part III ;; Entropy and Huffman Coding ;; Once again, I'll keep some code from the first two parts, without explanation (defn random-stream [P] (let [pseq (vec (mapcat (fn[[k v]](repeat v k )) P))] (for [i (range)] (rand-nth pseq)))) (defn cost [encoder decoder message] (let [coded (encoder message)] (if (= (decoder coded) message) (count coded) :fail))) (defn decoder ([code-tree stream] (decoder code-tree code-tree stream)) ([current-code-tree code-tree stream] (lazy-seq (if (keyword? current-code-tree) (cons current-code-tree (decoder code-tree code-tree stream)) (if-let [stream (seq stream)] (if (= (first stream) 1) (decoder (first current-code-tree) code-tree (rest stream)) (decoder (second current-code-tree) code-tree (rest stream)))))))) (defn encoder [code stream] (mapcat code stream)) (defn make-encoder [code] (fn [s] (encoder code s))) (defn make-decoder [code-tree] (fn[s] (decoder code-tree s))) (defn combine-keywords [& a] (keyword (apply str (mapcat #(drop 1 (str %)) a)))) (defn split-keyword [a] (map #(keyword (str %)) (drop 1 (str a)))) (defn make-combination-encoder [code n] (fn [s] (encoder code (map #(apply combine-keywords %) (partition n s))))) (defn make-combination-decoder [code-tree] (fn [s] (mapcat split-keyword (decoder code-tree s)))) ;; So far we've looked at three probability distributions: (def fair-coin {:H 1 :T 1}) (def unfair-coin {:H 3 :T 1}) (def unfair-pairs {:HH 9, :HT 3, :TH 3, :TT 1}) ;; And two codes: (def fair-code-tree [:H :T]) (def fair-code {:H '(1) :T '(0)}) (def unfair-code-tree [ :HH [ :HT [ :TH :TT]]]) (def unfair-code {:HH '(1) :HT '(0 1) :TH '(0 0 1) :TT '(0 0 0)}) ;; We should add a fourth probability distribution to represent pairs of fair coin toss results (def fair-pairs {:HH 1 :HT 1 :TH 1 :TT 1}) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; We found that (defn estimate-cost [P encoder decoder] (let [n 100000 c (cost encoder decoder (take n (random-stream P)))] (if (number? c) (float (/ c n)) c))) ;; Using the best code we can think of for the fair coin resulted in a transmission cost of 1 (£/symbol) (estimate-cost fair-coin (make-encoder fair-code) (make-decoder fair-code-tree)) ; 1.0 ;; And that that was also the cost for the unfair coin with this code: (estimate-cost unfair-coin (make-encoder fair-code) (make-decoder fair-code-tree)) ; 1.0 ;; But that we could come up with a code for pairs of coin tosses ;; which did substantially better for pairs of unfair coin tosses (estimate-cost unfair-pairs (make-encoder unfair-code) (make-decoder unfair-code-tree)) ; 1.68338 ;; but substantially worse for pairs of fair coin tosses (estimate-cost fair-pairs (make-encoder unfair-code) (make-decoder unfair-code-tree)) ; 2.24722 ;; remember that that's the transmission cost per symbol, and that each symbol represents two coin tosses ;; In case you think there's any sleight of hand going on there, here's how we'd use the pairs code to transmit ;; the original unpaired streams (estimate-cost unfair-coin (make-combination-encoder unfair-code 2) (make-combination-decoder unfair-code-tree)) ; 0.84561 (estimate-cost fair-coin (make-combination-encoder unfair-code 2) (make-combination-decoder unfair-code-tree)) ; 1.12407 ;; Notice that the costs here are per-toss, showing that the unfair code is actually an improvement ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Now it seems that, if we can send the results of unfair-coin more efficiently ;; by considering {:HH 9, :HT 3, :TH 3, :TT 1}, the distribution of pairs of ;; tosses, then we should have a look at the distribution of triples, and work out a code for that: (defn combine-distributions ([P] P) ([P1 P2] (into {} (for [[s1 p1] P1 [s2 p2] P2] [(combine-keywords s1 s2) (* p1 p2)]))) ([P1 P2 & Plist] (reduce combine-distributions (cons P1 (cons P2 Plist))))) (def unfair-triples (combine-distributions unfair-coin unfair-coin unfair-coin)) ;; unfair-triples is {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, :TTH 3, :TTT 1} ;; Now how should we work out a code for this distribution? ;; Huffman tells us that we should combine the two lowest probability events ;; so that {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, :TTH 3, :TTT 1} ;; goes to {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, {:TTH 3, :TTT 1} 4} ;; and then do it again, so that {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, :THT 3, {:TTH 3, :TTT 1} 4} ;; goes to {:HHH 27, :HHT 9, :HTH 9, :HTT 3, :THH 9, {:THT 3, {:TTH 3, :TTT 1} 4} 7} ;; and so on ..... (defn huffman-combine [P] (let [plist (sort-by second P) newelement (into {} (take 2 plist))] (into {} (cons [newelement (reduce + (vals newelement))] (drop 2 plist))))) (nth (iterate huffman-combine unfair-triples) (dec (dec (count unfair-triples)))) ;; {{{:HHT 9, :HTH 9} 18, {:THH 9, {{:TTT 1, :HTT 3} 4, {:THT 3, :TTH 3} 6} 10} 19} 37, :HHH 27} ;; At the end, we get a sort of nested binary probability distribution ;; You could think of this as being a way to generate the triples by tossing strangely biased coins! ;; From that, we can generate our code tree directly by just throwing away the numbers (require 'clojure.walk) (defn make-code-tree [P] (clojure.walk/postwalk #(if (map? %) (into[] (map first %)) %) (nth (iterate huffman-combine P) (dec (dec (count P)))))) (def triple-code-tree (make-code-tree unfair-triples)) ;;[[[:HHT :HTH] [:THH [[:TTT :HTT] [:THT :TTH]]]] :HHH] ;; If we have the decoder, then we can use it to generate the coder! (defn symbols [prefix code-tree] (if (keyword? code-tree) (list prefix code-tree) (concat (symbols (cons 1 prefix) (first code-tree)) (symbols (cons 0 prefix) (second code-tree))))) (defn make-code [code-tree] (into {} (map (fn[[c s]][s (reverse c)]) (partition 2 (symbols '() code-tree))))) (def triple-code (make-code triple-code-tree)) ;; {:HHT (0 0 0), :HTH (0 0 1), :THH (0 1 0), :TTT (0 1 1 0 0), :HTT (0 1 1 0 1), :THT (0 1 1 1 0), :TTH (0 1 1 1 1), :HHH (1)} ;; Let's see how this does (estimate-cost unfair-triples (make-encoder triple-code) (make-decoder triple-code-tree)) ; 2.4615 ;; £2.46 per symbol, or 0.82p per toss ;; So while going from single tosses to pairs allowed us to go from 1->0.85, going from pairs to triples only allowed us to get from 0.85->0.82. ;; Is there an end to this process? (defn bit-rate [P n] (let [Pn (apply combine-distributions (repeat n P)) tree (make-code-tree Pn)] (/ (estimate-cost Pn (make-encoder (make-code tree)) (make-decoder tree)) n))) (bit-rate unfair-coin 1) ; 1.0 (bit-rate unfair-coin 2) ; 0.844435 (bit-rate unfair-coin 3) ; 0.82466 (bit-rate unfair-coin 4) ; 0.8196275 (bit-rate unfair-coin 5) ; 0.818912 (bit-rate unfair-coin 6) ; 0.81896996 (bit-rate unfair-coin 7) ; 0.8166514 (bit-rate unfair-coin 8) ; 0.815995 (bit-rate unfair-coin 9) ; 0.81352335 (bit-rate unfair-coin 10) ; 0.81462896 (bit-rate unfair-coin 11) ; 0.8137691 ;; To me, this is at least suggestive that there might be something fundamental ;; about a cost of 82p to transmit the result of a 3:1 random result over a ;; binary channel.
Subscribe to:
Posts (Atom)