Once More, This Time With Clojure

If you happened to read my post from the other day entitled My New “Top Artists Last 7 Days” Widget, you know that I went through three iterations of getting it going. The final solution, written in Ruby worked well. Until bands like Motörhead, Mötley Crüe and Einstürzende Neubauten showed up in the list. At that point, the HTML parsing library I was using would barf, and processing would stop, leaving the list showing on the blog in an incomplete state. It wasn’t the library’s fault; apparently Ruby still has problems dealing with non-ASCII characters. I did everything I thought I needed to do to tell Ruby that it would be dealing with UTF-8 encoding, but it just kept right on barfing.

I was left with only two choices: stop listening to any band with an umlaut in the name (and God help me if any of my Scandinavian bands popped up, with the Ø or å characters), or rewrite the stupid program, again, in a language that I knew could easily deal with UTF-8.

Since I’ve been working in Clojure a lot lately, it seemed lika the logical choice. I spent about an hour working on it last night, and I ended up with a working program and a bit more Clojure experience. Here’s the program for your edification, with a description to follow:

(ns lastfmfetch.core
(:gen-class))

(require '[clj-http.client :as client])
(import '(java.io PrintStream)
'(org.htmlcleaner HtmlCleaner))

(defn get-artist-and-playcount [cell]
(let [title (.getAttributeByName cell "title")
[match artist playcount] (re-matches #"^(.+), played ([wd]+)s*S*$" title)
playcountStr (if (= playcount "once") "1" playcount)]
[artist playcountStr]))

(defn get-url [cell]
(let [links (.getElementsByName cell "a" true)
a (first links)
href (.getAttributeByName a "href")]
(str "http://last.fm" href)))

(defn fetch-data [filename]
(let [response (client/get "http://www.last.fm/user/joeyGibson/charts?rangetype=week&subtype=artists")
cleaner (HtmlCleaner.)]
(if (= (:status response) 200)
(with-open [out (PrintStream. filename "UTF-8")]
(.println out "<html><head><meta charset="UTF-8"/></head><body><ol>")
(doto (.getProperties cleaner)
(.setOmitComments true)
(.setPruneTags "script,style"))
(when-let [node (.clean cleaner (:body response))]
(let [subjectCells (take 5 (.getElementsByAttValue node "class" "subjectCell" true true))]
(doseq [cell subjectCells]
(let [[artist playcount] (get-artist-and-playcount cell)
url (get-url cell)]
(.println out (str "<li><a href='" url "'>" artist "</a>, Plays: " playcount "</li>"))))))
(.println out "</ol></body></html>")))))

;; Main
(defn -main [& args]
(if (< (count args) 1)
(println "Usage: lastfmfetch <output_file>")
(fetch-data (first args))))

I ended up using a library called clj-http to handle the fetching of the URL. It’s a Clojure wrapper for the Apache HTTP Commons library, and was really easy to use. I’m using Leningen, by the way, so including clj-http was just a matter of including a line in the project.clj file. I also used a Java library called HTMLCleaner, that fixes broken HTML and makes it available as a DOM. Since it is also in Maven Central, it was easy to include by adding another line to the project file.

(defproject lastfmfetch "1.0.0-SNAPSHOT"
:description "Fetch chart data from Last.fm"
:dependencies [[org.clojure/clojure "1.3.0"]
[net.sourceforge.htmlcleaner/htmlcleaner "2.2"]
[ clj-http "0.2.6"]]
:main lastfmfetch.core)

The -main function begins on line 38, but all it really does is check that there is a single command-line argument, and exits with a usage message if there is not. It then calls the fetch-data function, which begins on line 20.

On line 21, we declare two locals; one that will contain the results of fetching the web page, and one that is the HTML cleaner. If the fetch of the URL was successful, the status code will be the standard HTTP 200. If we got that, we then open a PrintStream on the filename given, specifying that it should be encoded with UTF-8. (I’ve been working with Java for a very long time, and I always assumed that since Java strings are Unicode, files created with Java would default to UTF-8. That is not the case. That’s why there’s a second argument when creating the PrintStream, and why I’m not using a PrintWriter.) We then print the first part of the output HTML file, set a couple of options to HTML Cleaner that cause it to strip comments, style and script sections from the HTML, and then start doing the real work.

On line 29, we declare a local called node that will contain the output of HTML Cleaner if it successfully parsed and cleaned the HTML. That’s what when-let does; it assigns the local as long as the function returns something truthy and then executes its body. If that function doesn’t return something truthy, the rest of the code is skipped. We then take the first five elements from the HTML that have an attribute called “class” with a value of “subjectCell”. These are table cells. We then loop over them, extracting the artist and playcount value, and the URL. We do these things in two separate functions.

The function called get-artist-and-playcount, starting on line 8, takes the table cell as input. It then gets the attribute called “title” and uses a regular expression to pull out the artist and playcount values. If the playcount is the word “once,” it converts it to a 1, so all the values are numeric. It then returns the two values as a vector.

The function called get-url, starting on line 14, also takes the table cell as input. It then gets all the “a” elements from the cell (there’s only one), and then gets the “href” attribute’s value, which is the URL.

Back at line 34, we take the three values we extracted with the two support functions and concatenates them together into HTML that will be a single line in an ordered list. We then output all the necessary closing tags to make the HTML well-formed, and we’re done.

While the Clojure code is a bit more dense than the Ruby code, it’s actually four lines shorter. And it handles Unicode characters, which makes me happy.

Advertisements

JUnitLaunchFixer 1.0.4 Released

I’ve just released a new version of my Eclipse plugin called JUnitLaunchFixer. If you’ve never heard of it, it lets you set the default heap size for new debug launchers in Eclipse. You can read more about it in the original announcement. This new version fixes a bug that caused it not to accept new parameters in one case, and it also adds the ability to set a default for the maximum permgen size, which is something I’d been wanting to add for a while.

If you already have it installed, go to Help>Check for Updates in Eclipse, and you’ll see it. If you don’t have it installed, the update site is here:

http://junitlaunchfixer.googlecode.com/svn/trunk/JUnitLaunchFixer-update-site/

 

99 Scala Problems #28 – I Like My Solution Better

I’ve been working through this list of 99 Scala Problems, which is modeled after this list of 99 Prolog Problems. As I’ve been going through them, I have been comparing my solutions to those provided (obviously). Sometimes, my solution is more or less the same as the “official” solution. Sometimes, theirs is better. In the case of problem 28, I think mine is far easier to read and understand.

Problem 28 has two parts. The first part reads:

a) We suppose that a list contains elements that are lists themselves. The objective is to sort the elements of the list according to their length. E.g. short lists first, longer lists later, or vice versa.

Running the function should look like this:

[scala]
scala> lsort(List(List("a", "b", "c"), List("d", "e"), List("f", "g", "h"), List("d", "e"), List("i", "j", "k", "l"), List("m", "n"), List("o")))

res0: List[List[java.lang.String]] = List(List(o), List(d, e), List(d, e), List(m, n), List(a, b, c), List(f, g, h), List(i, j, k, l))
[/scala]

For this part, my solution was almost identical. Here’s what I came up with:

[scala]
def lsort[T](ls: List[List[T]]) = {
ls.sortWith {(a, b) => a.length < b.length}
}
[/scala]

You can see that this function takes a List of type T, and then calls the sortWith method on that list, passing in a function value that sorts the lists based on their length, shortest to longest. The “official” solution was only slightly different:

[scala]
def lsort[A](ls: List[List[A]]): List[List[A]] =
ls sort { _.length < _.length }
[/scala]

Here, they used A instead of T, but that doesn’t affect anything, and they specified the return type, while I left mine inferred. Instead of assigning each bucket of the list to a named variable, as I did, they use the underscore placeholder. The two functions are functionally (get it?) identical, but theirs is a bit shorter because they removed the outer braces, and were able to skip the parameter list, since they used the underscores.

Now, the second part is where I diverge from the official solution. Here’s the problem description:

b) Again, we suppose that a list contains elements that are lists themselves. But this time the objective is to sort the elements according to their length frequency; i.e. in the default, sorting is done ascendingly, lists with rare lengths are placed [first], others with a more frequent length come later.

And the expected call and result is

[scala]
scala> lsortFreq(List(List("a", "b", "c"), List("d", "e"), List("f", "g", "h"), List("d", "e"), List("i", "j", "k", "l"), List("m", "n"), List("o")))

res1: List[List[java.lang.String]] = List(List(i, j, k, l), List(o), List(a, b, c), List(f, g, h), List(d, e), List(d, e), List(m, n))[/scala]

First, let’s look at what they presented as the solution. It referenced functions from other files, but I have included them all here for easy of viewing.

[scala]
def lsortFreq[A](ls: List[List[A]]): List[List[A]] = {
val freqs = Map(encode(ls map { _.length } sort { _ < _ }) map { _.swap }:_*)
ls sort { (e1, e2) => freqs(e1.length) < freqs(e2.length) }
}

def encode[T](ls: List[T]): List[(Int, T)] = {
val packedList = pack(ls)
packedList map {list => (list.length, list.head)}
}

def pack[T](ls: List[T]): List[List[T]] = ls match {
case Nil => Nil
case h :: tail => (h :: tail.takeWhile(_ == h)) :: pack(tail.dropWhile(_ == h))
}
[/scala]

I think this is very confusing code. It’s calling the encode function which does run-length encoding of the passed-in thing. It then uses a Map of these encodings to sort the passed-in list. The presence of five underscores in the first line, obscures where those parameters are coming from, and the final underscore is actually part of the _* method of the Array class!

My solution, while being a longer function, is far more readable, in my opinion. And, it’s the same number of lines as the three-method solution. Here it is

[scala]
def lsortFreq[T](ls: List[List[T]]) = {
val lengthMap = scala.collection.mutable.Map[Int, Int]()

for (l <- ls) {
val len = l.length
if (!lengthMap.contains(len)) {
lengthMap(len) = 1
} else {
lengthMap(len) += 1
}
}

ls sortWith {(a, b) => lengthMap(a.length) < lengthMap(b.length)}
}
[/scala]

In my function, I created a mutable Map and then iterate over the list, getting each item’s length, and then keep a running tally of how many items had that length. The map has these lengths as its keys, and the number of items with that length as its values. Get it? I then sort the original list by having each item in the comparison lookup how many items share its length, and use that as the sort criterion.

I have no idea which of these solutions is more efficient. For small problems like this, I doubt there’s any measurable difference. But I do believe that my solution is easier to read and understand. So much so, in fact, that I think someone who is not familiar with Scala would be able to easily figure out what it’s doing. I don’t know that the same can be said of the other solution.

I got criticized for promoting terse code in this article, so this is my attempt at balance. 🙂

Note: I did change the inputs to these functions from symbols to strings. The code formatter I use on the blog wasn’t colorizing things properly when there were symbols involved.

Procedural vs. Functional

With the rise of Scala and Clojure, there’s been a lot of talk lately about procedural vs. functional styles of coding. Most developers are accustomed to procedural coding, and functional can be hard to get a handle on. I was working through Programming in Scala (again) this morning, and I came upon this function:

// Procedural implementation
def longestWord(words: Array[String]) = {
  var word = words(0)
  var idx = 0

  for (i <- 1 until words.length)
    if (words(i).length > word.length) {
      word = words(i)
      idx = i
    }

  (word, idx)
}

The purpose of this function is to find the longest word in the passed-in array, and return a tuple with that longest word, and its index in the array. You can see that in this function, we have two vars, one for the current longest word, and another for its index in the array. We then use a for expression to walk the array, reassigning word and idx when we find a longer word. This is very much like how you would write this in Java.

I decided to rewrite this function in a more functional style, just to see how my functional chops are coming along. Here’s what I ended up with:

// A more functional implementation
def longestWord(words: Array[String]) =
  (("", -1) /: words.zipWithIndex) {(old, cur) =>
    if (cur._1.length > old._1.length) cur
    else old
  }

First of all, notice how much shorter this function is than the first one. Also, notice that there is only a single expression in the function, so the outer curly braces aren’t necessary. What this expression is doing is calling zipWithIndex on the passed-in array, which results in an array of tuples containing each word and its index. We then call foldLeft using its operator name of /:, with its initial argument being a tuple with an empty string and -1 for an index. What foldLeft does is apply the function value passed to it to pairs of arguments. On the first pass, the arguments are what was passed in and the first element in the array. On the second iteration, the arguments are the result of the first pass and the second element in the array. This then continues through the entire array. What is returned after the final pass will be a tuple that contains the longest word in the array, and its index.

Now, I don’t claim to be a functional master or anything, but I think this is a decent illustration of how the functional style can reduce the lines of code, and the number of mutable variables, while making the code easier to read and understand.

Slides From My Presentation on Operator Overloading In Scala

Last night I spoke at the Atlanta Scala Enthusiats meeting about operator overloading and a little on implicit conversions. I think the talk went well as I got lots of really good questions from the audience, and they laughed at my jokes. This presentation grew out of a blog post I wrote a few months ago entitled Scala Gets Operator Overloading Right; I beefed it up and made some slides and more code samples. Incidentally, if you Google for “scala operator overloading” that blog post is the first result.

For those of you who weren’t there, here are my slides and the code samples that go with them. I wrote these samples against Scala 2.7.7.final. They should work with the latest Scala 2.8, but I haven’t verified this.

And here’s the source code: oopres.zip

What the Heck Is a Tuple, Anyway?

Yesterday I was talking with a friend about Scala and the subject of tuples came up. We both had a bit of a laugh that neither of us was sure how to pronounce it, though we both leaned toward TUH-ple instead of TOO-ple. Anyway, the utility of tuples in Scala was not immediately apparent to him, so I thought I’d take a whack at explaining it here.

A Tuple in Scala is an immutable container used for storing two or more objects, possibly of different types. While a List or Array can only store objects that all have the same type, Tuples can store objects of any type. The most common use of tuples is when you have a method that needs to return more than one value, but creating a class for that return value is more trouble than it’s worth. It’s true that for same-type objects you could return a List, and for different-type objects you could return a List[Any], but both of these have downsides, which we’ll discuss.

Let’s look at a very contrived example. Let’s say you created a function that takes a string and returns the starting index of the first numbers if finds and the numbers themselves. That code might look like this

def reFind(str: String) = {
	val re = """(d+)""".r

	val m = re findFirstMatchIn str

	m match {
		case Some(m) => (m.start, str.substring(m.start, m.end))
		case None => (0, "")
	}
}

(I’ve removed any error checking for brevity.) You can see here that we’re creating a regular expression that looks for one or more numbers grouped together. We then match that against the passed-in string. The matching method returns a Some[Match], so we pattern match against that to see if we actually got a match. If we did, we create a tuple with the starting index of the match, and the match itself, and return it. If not, we return a tuple with 0 for the starting index and an empty string.

Calling this function looks like this

scala> val t = reFind("foo 123 bar")
t: (Int, java.lang.String) = (4,123)

You can see that what was returned is something with type (Int, java.lang.String); that’s actually an instance of Scala’s Tuple2 class. (There’s a synonym for Tuple2, called Pair.)

Now that we have this tuple, what do we do with it? If you want to access the values it contains, you do it in a way that might seem a bit strange at first. To get at the elements, you could do this

scala> val i = t._1
i: Int = 4

scala> val m = t._2
m: java.lang.String = 123

There are two things to point out here. First, unlike Lists and Arrays, you don’t use the () notation. You use a method consisting of an underscore and the index of the part you want. Second, unlike Lists and Arrays, tuples are 1-based instead of 0-based. (According to Programming In Scala, this is a nod to Haskell and ML.) Also notice the types of the vals you are assigning. That’s one of the benefits of using a Tuple instead of something like List[Any]; you still get compile-time type safety. Had you instead written the function like this

def reFind(str: String) = {
	val re = """(d+)""".r

	val m = re findFirstMatchIn str

	m match {
		case Some(m) => List[Any](m.start, str.substring(m.start, m.end))
		case None => List[Any](0, "")
	}
}

and called it, look what happens when you try to store the Int index in a local variable

scala> val l = reFind("foo 123 bar")
l: List[Any] = List(4, 123)

scala> val i: Int = l(0)
<console>:10: error: type mismatch;
 found   : Any
 required: Int
       val i: Int = l(0)
                    ^

You would get a similar error trying to assign the String element to a local String val. That’s the major downside to using a List[Any]. (In the first example I used Scala’s type inference to set the types of the local variables; this time I wanted to be explicit to show the failure.)

As I mentioned earlier, you could define a class just to handle the return values of this function. There is nothing wrong with that solution, and some will find it superior to using a tuple, because you can assign meaningful names to the elements. You could define it like this

class ReResult(val index: Int, val part: String)

def reFind(str: String) = {
	val re = """(d+)""".r

	val m = re findFirstMatchIn str

	m match {
		case Some(m) => new ReResult(m.start, str.substring(m.start, m.end))
		case None => new ReResult(0, "")
	}
}

and call it like this

scala> val l = reFind("foo 123 bar")
l: ReResult = ReResult@57c52e72

scala> val i: Int = l.index
i: Int = 4

scala> val m: String = l.part
m: String = 123

If you think this is more maintainable, then by all means, use it. If you just want to easily return more than one value from a function, then consider using a tuple.

Another point on tuples is that you can assign all the elements of a tuple to local variables in a single step, rather than using multiple calls. So this is equivalent to all the assignments from the earlier examples

scala> val (i: Int, m: String) = l
i: Int = 4
m: String = 123

Depending on what you’re doing, this could be a useful way to get at the elements.

And one more thing. There are tuple classes that range from two elements all the way up to twenty-two. The classes are named Tuple2, Tuple3 … Tuple22. The () notation for creating tuples applies all the way up to twenty-two arguments, so you rarely need to actually use the class names. For example,

scala> val t = (23, "foo", 18.0)
t: (Int, java.lang.String, Double) = (23,foo,18.0)

scala> t.getClass
res31: java.lang.Class[_] = class scala.Tuple3

scala> val t1 = ('a', "quick", 23, "year-old", """foxy""".r, List(1, 2, 3))
t1: (Char, java.lang.String, Int, java.lang.String, scala.util.matching.Regex, List[Int]) = (a,quick,23,year-old,foxy,List(1, 2, 3))

scala> t1.getClass
res32: java.lang.Class[_] = class scala.Tuple6

I’m not going to provide an example of creating a Tuple22; that is left as an exercise. 🙂 I would argue that if you need more than three elements, you really should define a class to hold them. I think that beyond three elements it gets difficult to keep them straight. Tuples are great for holding two or three pieces of information, but don’t go crazy with them.

2 Solutions To Project Euler Problem #1

In an effort to not go a whole month without blogging, and in the interest of posting some code samples, I give you two solutions to Project Euler: Problem #1. If you’ve never heard of it, Project Euler (pronounced “oiler” after the Swiss mathematician Leonhard Paul Euler) is a set of increasingly difficult programming challenges. Participants can write their programs in any language and the only goal is to solve the problems and learn something. There are no prizes and you don’t have to show your work.

I had looked at this project years ago, and I swear I thought I had already solved some of them, but maybe I only thought about doing it. Anyway, I have two solutions to the first problem; one in Groovy and the other in Scala. Here, then, is how Project Euler describes the problem

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

So, the goal is to take a series of numbers from 0 to 1000, exclusive, find all the numbers divisible by 3 or 5 and add them together. Here’s the Groovy solution.

def subList = (0..<1000).findAll {it % 3 == 0 || it % 5 == 0} 

def sum = subList.inject(0) {i, sum -> sum + i}

println "Sum = ${sum}"

This is a very simple program, and I could have written it as a one-line program. I broke it up into a few lines for clarity. As you can see, the first line creates an exclusive range from 0 to 1000. It then calls the findAll method on that range, passing in a closure that will return true if the passed-in digit from the range, called “it” here, is evenly divisible by 3 or 5. The result of findAll is another collection, containing only those values that passed the divisibility test. We then take that list, passing 0 into the inject method, which will neatly sum the values up and return that value. Easy peasy.

Now here’s the Scala version. You’ll notice it is very similar to the Groovy solution.

val subList = for {
    i <- List.range(0, 1000)
    if i % 3 == 0 || i % 5 == 0
} yield i 

val sum = subList.foldLeft(0) {(i, j) => i + j}

println("Sum = " + sum)

I used a sequence comprehension to generate the sublist here. The bit beginning with “for” generates an exclusive range from 0 to 1000, which is then iterated over, assigning each value to “i”. Then, if it is divisible by 3 or 5, it is yielded up by the comprehension. The result is a collection of just those numbers that we want, assigned to the val called subList. We then call foldLeft on that sublist, doing exactly what we did in the Groovy solution. Again, pretty simple.

Now, I could have solved this one in an almost identical fashion to the Groovy solution by using the filter method of lists, but I wanted to solve it first using a list comprehension. Here is the second solution

val subList = List.range(0, 1000) filter {i => i % 3 == 0 || i % 5 == 0}

val sum = subList.foldLeft(0) {(i, j) => i + j}

println("Sum = " + sum)

I timed the solutions and all three finished in just over a second. The second Scala solution seemed ever-so-slightly faster than the other two.

As I get time, I will work on additional problems from the site and post the answers here. I don’t know that I’ll always do solutions in two languages, but I might.