Once More, This Time With Clojure

If you happened to read my post from the other day entitled My New “Top Artists Last 7 Days” Widget, you know that I went through three iterations of getting it going. The final solution, written in Ruby worked well. Until bands like Motörhead, Mötley Crüe and Einstürzende Neubauten showed up in the list. At that point, the HTML parsing library I was using would barf, and processing would stop, leaving the list showing on the blog in an incomplete state. It wasn’t the library’s fault; apparently Ruby still has problems dealing with non-ASCII characters. I did everything I thought I needed to do to tell Ruby that it would be dealing with UTF-8 encoding, but it just kept right on barfing.

I was left with only two choices: stop listening to any band with an umlaut in the name (and God help me if any of my Scandinavian bands popped up, with the Ø or å characters), or rewrite the stupid program, again, in a language that I knew could easily deal with UTF-8.

Since I’ve been working in Clojure a lot lately, it seemed lika the logical choice. I spent about an hour working on it last night, and I ended up with a working program and a bit more Clojure experience. Here’s the program for your edification, with a description to follow:

(ns lastfmfetch.core
(:gen-class))

(require '[clj-http.client :as client])
(import '(java.io PrintStream)
'(org.htmlcleaner HtmlCleaner))

(defn get-artist-and-playcount [cell]
(let [title (.getAttributeByName cell "title")
[match artist playcount] (re-matches #"^(.+), played ([wd]+)s*S*$" title)
playcountStr (if (= playcount "once") "1" playcount)]
[artist playcountStr]))

(defn get-url [cell]
(let [links (.getElementsByName cell "a" true)
a (first links)
href (.getAttributeByName a "href")]
(str "http://last.fm" href)))

(defn fetch-data [filename]
(let [response (client/get "http://www.last.fm/user/joeyGibson/charts?rangetype=week&subtype=artists")
cleaner (HtmlCleaner.)]
(if (= (:status response) 200)
(with-open [out (PrintStream. filename "UTF-8")]
(.println out "<html><head><meta charset="UTF-8"/></head><body><ol>")
(doto (.getProperties cleaner)
(.setOmitComments true)
(.setPruneTags "script,style"))
(when-let [node (.clean cleaner (:body response))]
(let [subjectCells (take 5 (.getElementsByAttValue node "class" "subjectCell" true true))]
(doseq [cell subjectCells]
(let [[artist playcount] (get-artist-and-playcount cell)
url (get-url cell)]
(.println out (str "<li><a href='" url "'>" artist "</a>, Plays: " playcount "</li>"))))))
(.println out "</ol></body></html>")))))

;; Main
(defn -main [& args]
(if (< (count args) 1)
(println "Usage: lastfmfetch <output_file>")
(fetch-data (first args))))

I ended up using a library called clj-http to handle the fetching of the URL. It’s a Clojure wrapper for the Apache HTTP Commons library, and was really easy to use. I’m using Leningen, by the way, so including clj-http was just a matter of including a line in the project.clj file. I also used a Java library called HTMLCleaner, that fixes broken HTML and makes it available as a DOM. Since it is also in Maven Central, it was easy to include by adding another line to the project file.

(defproject lastfmfetch "1.0.0-SNAPSHOT"
:description "Fetch chart data from Last.fm"
:dependencies [[org.clojure/clojure "1.3.0"]
[net.sourceforge.htmlcleaner/htmlcleaner "2.2"]
[ clj-http "0.2.6"]]
:main lastfmfetch.core)

The -main function begins on line 38, but all it really does is check that there is a single command-line argument, and exits with a usage message if there is not. It then calls the fetch-data function, which begins on line 20.

On line 21, we declare two locals; one that will contain the results of fetching the web page, and one that is the HTML cleaner. If the fetch of the URL was successful, the status code will be the standard HTTP 200. If we got that, we then open a PrintStream on the filename given, specifying that it should be encoded with UTF-8. (I’ve been working with Java for a very long time, and I always assumed that since Java strings are Unicode, files created with Java would default to UTF-8. That is not the case. That’s why there’s a second argument when creating the PrintStream, and why I’m not using a PrintWriter.) We then print the first part of the output HTML file, set a couple of options to HTML Cleaner that cause it to strip comments, style and script sections from the HTML, and then start doing the real work.

On line 29, we declare a local called node that will contain the output of HTML Cleaner if it successfully parsed and cleaned the HTML. That’s what when-let does; it assigns the local as long as the function returns something truthy and then executes its body. If that function doesn’t return something truthy, the rest of the code is skipped. We then take the first five elements from the HTML that have an attribute called “class” with a value of “subjectCell”. These are table cells. We then loop over them, extracting the artist and playcount value, and the URL. We do these things in two separate functions.

The function called get-artist-and-playcount, starting on line 8, takes the table cell as input. It then gets the attribute called “title” and uses a regular expression to pull out the artist and playcount values. If the playcount is the word “once,” it converts it to a 1, so all the values are numeric. It then returns the two values as a vector.

The function called get-url, starting on line 14, also takes the table cell as input. It then gets all the “a” elements from the cell (there’s only one), and then gets the “href” attribute’s value, which is the URL.

Back at line 34, we take the three values we extracted with the two support functions and concatenates them together into HTML that will be a single line in an ordered list. We then output all the necessary closing tags to make the HTML well-formed, and we’re done.

While the Clojure code is a bit more dense than the Ruby code, it’s actually four lines shorter. And it handles Unicode characters, which makes me happy.

Advertisements

My New “Top Artists Last 7 Days” Widget

Note Redux: I changed my approach, yet again. Scroll farther down to see the latest.

Note: I changed my approach on this, so scroll down to see how I’m doing it now.

I’ve been wanting a widget or an auto-post on the blog for a while that would show my most-listened-to bands over the previous week. Tumblr users have had something like this for a while, and there were efforts to do this for WordPress before, but they either don’t seem to work with the latest versions of WP, or they only pulled top tracks (not artists), or they pulled album covers, instead of text. All of that is to say that I couldn’t find anything pre-made to use.

So, I had to roll my own. I did so in about 10 minutes using the PHP Code Widget and the script on this page. The only drawback to this is you have to get a developer account with last.fm, but it’s free, so no big deal there. I installed the PHP Code Widget , then pasted the script into a new widget. The only changes I had to make were to replace the appropriate bits in the script with my info, and to escape a couple of double-quotes. Now if you look down the right side of the blog, below the Twitter and Facebook links, you’ll see a rolling record of my top-artists. In case you were wondering what I’ve been listening to. 🙂

The only thing I’m not sure about is how this will work with the two levels of caching I use (WP Super Cache and Cloudflare). I suppose we’ll see in the next few days, eh?

11/23/2011 Update: I decided that I didn’t like the way I was doing this, for a couple of reasons. First, each time someone viewed the page, it would be making a call to Last.fm for my stats. This is too often. Also, the values returned using the developer API were at odds with what you can get just going through the web. So what I did was write a Ruby script to pull the feed once a day, parse it and output HTML to a file. I then used the PHP Code Widget to include it. Far simpler, in my opinion.

Here’s the Ruby code:

[ruby]
#!/usr/bin/ruby

require ‘rexml/document’
require ‘open-uri’

include REXML

open("http://ws.audioscrobbler.com/2.0/user/joeyGibson/weeklyartistchart.xml") do |http|
response = http.read
doc = REXML::Document.new response

index = 0

File.open(ARGV[0], "w") do |out|
out.write("<html><head>n")
out.write("<meta charset="UTF-8"/>n")
out.write("<body><ol>n")

doc.elements.each("weeklyartistchart/artist") do |artist|
break if index == 5

out.write "<li><a href="#{artist.elements[‘url’].text}">#{artist.elements[‘name’].text}</a>, Plays: #{artist.elements[‘playcount’].text}</li>n"

index += 1
end

out.puts("</ol></body></html>n")
end
end
[/ruby]

and here’s the PHP that loads it:

[php]
<?php include("/tmp/artists.html"); ?>
[/php]

That’s it.

11/26/2011 Update: Well, I’ve changed it again. I discovered that the RSS feed I was pulling is not updated with any sort of frequency. It certainly doesn’t represent the “last seven days” as it claims to. At any rate, it differs greatly from what Last.fm shows on the web. So I decided to grab the HTML and pull out the interesting bits. I wrote another Ruby script, this time using Hpricot to parse the HTML, which took about 10 minutes. So now, what you see on the right should be the current values for the “last seven days.” Here’s the latest script:

[ruby]
#!/usr/local/bin/ruby

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’

open("http://www.last.fm/user/your-username-here/charts?rangetype=week&subtype=artists&quot;) do |http|
doc = Hpricot.parse(http.read)

count = 0

File.open(ARGV[0], "w") do |out|
out.write("<html><head>n")
out.write("<meta charset="UTF-8"/>n")
out.write("<body><ol>n")

doc.search("td[@class=subjectCell]").each do |subjectCell|
break if count == 5

artistString = subjectCell.get_attribute("title")

artistString =~ /^(.+), played (d+) times$/
artist = $1
playCount = $2

subjectCell.search("a").each do |a|
url = a.get_attribute("href")
url = "http://last.fm#{url}"

str = "#{artist}, #{url}, #{playCount}"

out.write "<li><a href="#{url}">#{artist}</a>, Plays: #{playCount}</li>n"
end

count += 1
end

out.puts("</ol></body></html>n")
end
end
[/ruby]

I’m hopeful this is the last change.

My Varied Musical Tastes

I use the last.fm application on my Mac to “scrobble” what songs I’m listening to. This allows me to keep a record of what’s been playing, but it mostly allows me to be a complete exhibitionist and show “the world” what sort of music I like. When I indicate that I “love” a song, that even shows up on my Friendfeed page, for added coverage.

This morning, I happened to be on my last.fm page and I noticed the graph below, showing my top 15 artists by the number of songs I’ve played by them.

top15

Pretty interesting, eh? I’ve always had diverse musical tastes, but Frank is still the man. If you go to my page, the little arrow icon indicates that you can listen to some of that artist’s work. Shame on them for not having anything by Eilen Jewell!