If you happened to read my post from the other day entitled My New “Top Artists Last 7 Days” Widget, you know that I went through three iterations of getting it going. The final solution, written in Ruby worked well. Until bands like Motörhead, Mötley Crüe and Einstürzende Neubauten showed up in the list. At that point, the HTML parsing library I was using would barf, and processing would stop, leaving the list showing on the blog in an incomplete state. It wasn’t the library’s fault; apparently Ruby still has problems dealing with non-ASCII characters. I did everything I thought I needed to do to tell Ruby that it would be dealing with UTF-8 encoding, but it just kept right on barfing.
I was left with only two choices: stop listening to any band with an umlaut in the name (and God help me if any of my Scandinavian bands popped up, with the Ø or å characters), or rewrite the stupid program, again, in a language that I knew could easily deal with UTF-8.
Since I’ve been working in Clojure a lot lately, it seemed lika the logical choice. I spent about an hour working on it last night, and I ended up with a working program and a bit more Clojure experience. Here’s the program for your edification, with a description to follow:
(ns lastfmfetch.core (:gen-class)) (require '[clj-http.client :as client]) (import '(java.io PrintStream) '(org.htmlcleaner HtmlCleaner)) (defn get-artist-and-playcount [cell] (let [title (.getAttributeByName cell "title") [match artist playcount] (re-matches #"^(.+), played ([wd]+)s*S*$" title) playcountStr (if (= playcount "once") "1" playcount)] [artist playcountStr])) (defn get-url [cell] (let [links (.getElementsByName cell "a" true) a (first links) href (.getAttributeByName a "href")] (str "http://last.fm" href))) (defn fetch-data [filename] (let [response (client/get "http://www.last.fm/user/joeyGibson/charts?rangetype=week&subtype=artists") cleaner (HtmlCleaner.)] (if (= (:status response) 200) (with-open [out (PrintStream. filename "UTF-8")] (.println out "<html><head><meta charset="UTF-8"/></head><body><ol>") (doto (.getProperties cleaner) (.setOmitComments true) (.setPruneTags "script,style")) (when-let [node (.clean cleaner (:body response))] (let [subjectCells (take 5 (.getElementsByAttValue node "class" "subjectCell" true true))] (doseq [cell subjectCells] (let [[artist playcount] (get-artist-and-playcount cell) url (get-url cell)] (.println out (str "<li><a href='" url "'>" artist "</a>, Plays: " playcount "</li>")))))) (.println out "</ol></body></html>"))))) ;; Main (defn -main [& args] (if (< (count args) 1) (println "Usage: lastfmfetch <output_file>") (fetch-data (first args))))
I ended up using a library called clj-http to handle the fetching of the URL. It’s a Clojure wrapper for the Apache HTTP Commons library, and was really easy to use. I’m using Leningen, by the way, so including clj-http was just a matter of including a line in the project.clj file. I also used a Java library called HTMLCleaner, that fixes broken HTML and makes it available as a DOM. Since it is also in Maven Central, it was easy to include by adding another line to the project file.
(defproject lastfmfetch "1.0.0-SNAPSHOT" :description "Fetch chart data from Last.fm" :dependencies [[org.clojure/clojure "1.3.0"] [net.sourceforge.htmlcleaner/htmlcleaner "2.2"] [ clj-http "0.2.6"]] :main lastfmfetch.core)
The -main function begins on line 38, but all it really does is check that there is a single command-line argument, and exits with a usage message if there is not. It then calls the fetch-data function, which begins on line 20.
On line 21, we declare two locals; one that will contain the results of fetching the web page, and one that is the HTML cleaner. If the fetch of the URL was successful, the status code will be the standard HTTP 200. If we got that, we then open a PrintStream on the filename given, specifying that it should be encoded with UTF-8. (I’ve been working with Java for a very long time, and I always assumed that since Java strings are Unicode, files created with Java would default to UTF-8. That is not the case. That’s why there’s a second argument when creating the PrintStream, and why I’m not using a PrintWriter.) We then print the first part of the output HTML file, set a couple of options to HTML Cleaner that cause it to strip comments, style and script sections from the HTML, and then start doing the real work.
On line 29, we declare a local called node that will contain the output of HTML Cleaner if it successfully parsed and cleaned the HTML. That’s what when-let does; it assigns the local as long as the function returns something truthy and then executes its body. If that function doesn’t return something truthy, the rest of the code is skipped. We then take the first five elements from the HTML that have an attribute called “class” with a value of “subjectCell”. These are table cells. We then loop over them, extracting the artist and playcount value, and the URL. We do these things in two separate functions.
The function called get-artist-and-playcount, starting on line 8, takes the table cell as input. It then gets the attribute called “title” and uses a regular expression to pull out the artist and playcount values. If the playcount is the word “once,” it converts it to a 1, so all the values are numeric. It then returns the two values as a vector.
The function called get-url, starting on line 14, also takes the table cell as input. It then gets all the “a” elements from the cell (there’s only one), and then gets the “href” attribute’s value, which is the URL.
Back at line 34, we take the three values we extracted with the two support functions and concatenates them together into HTML that will be a single line in an ordered list. We then output all the necessary closing tags to make the HTML well-formed, and we’re done.
While the Clojure code is a bit more dense than the Ruby code, it’s actually four lines shorter. And it handles Unicode characters, which makes me happy.