Scala’s Nice Regex Class

I’m a big fan of regular expressions, because they let you parse text in very concise, and sometimes complicated, ways. Though I agree with jwz about regular expressions in lots of cases, I still use them frequently. Perl was the first language that really let me use regex, way back in 1991. After that I used various implementations in C, Python, Ruby, Java and various other languages. While I was glad that Java 5 finally added regex support, I was disappointed at the implementation. It’s kind of clunky, and because Java doesn’t have regex syntax baked into the language, and no support for “raw” strings, you end up with a regex with twice as many backslashes as necessary.

Last night while reading Programming In Scala, I came upon the discussion of Scala’s regex class. Scala has a raw string, so you have exactly as many backslashes in your pattern as necessary, but each regex you create also defines a Scala extractor, so you can easily bind local variables to groups within the expression.

First, let’s look at the Java code.

[java]Pattern emailParser = Pattern.compile("([\w\d\-\_]+)(\+\d+)?@([\w\d\-\.]+)");

String s = "zippy@scalaisgreat.com";

Matcher m = emailParser.matcher(s);

if (m.matches())
{
String name = m.group(1);
String num = m.group(2);
String domain = m.group(3);

System.out.printf("Name: [%s], Num: [%s], Domain: [%s]n", name, num, domain);
}[/java]

That’s pretty simple, though the double-backslashes really annoy me. Running this code results in the local variables name and domain getting assigned parts of the email address. The variable called num is assigned null, because the email address didn’t contain a plus sign followed by a number. Now, here’s the same program in Scala.

[scala highlight=”1, 5″]val EmailParser = """([wd-_]+)(+d+)?@([wd-.]+)""".r

val s = "zippy@scalaisgreat.com"

val EmailParser(name, num, domain) = s

printf("Name: %s, Domain: %sn", name, domain)[/scala]

In line 1, we are calling the “r” method on a raw string. This method converts the String into a Regex and returns it. We’re then assigning it to a local val called EmailParser. Also notice line 5. In that one line, we are declaring three local vals and assigning them whatever the groups in the regex matched, or null if they didn’t match. Just like with the Java example, num will be null since there was no plus sign followed by a number. If you change the email address in either example to something like “zippy+23@scalaisgreat.com”, then all three variables will be assigned parts of the string.

Do you have to have this level of support to do regex? No. Does it make things a lot nicer? Indeed.

Now, I just discovered that in Scala if the regex doesn’t match at all, then a MatchError is thrown. That’s sort of a bummer, because it means you’ll have to add a try/catch around your code. Still, I like the extractor syntax that binds regex groups to local variables in one step.

You can see some more examples of regex in Scala over here.

5 thoughts on “Scala’s Nice Regex Class

  1. This is pretty cool, though I still can’t get over the lack of support of regex literals (especially given the support for XML literals).

    Even still, I think Ruby and PERL have much cleaner support for regex.

    • Yeah, I wish Scala had literal regex syntax. Then I could have written val EmailParser = /([wd-_]+)(+d+)?@([wd-.]+)/ which would be even nicer.

  2. One thing that is not perfect is that you name the variable starting with upper case letter. It’s like doing variable and type definition at the same time. The code is very nice but it just feels a little not correct. Though maybe using it longer will make it feel more natural to me.

    • Matt, I agree with you, it does feel a bit strange. But because extractors mimic the matching abilities of case classes, I guess that’s why they decided to present it that way. However, it’s just a convention, so you don’t have to use the uppercase first letter. In my example, emailParser would have worked just like EmailParser.

  3. > Also notice line 5. In that one line, we are declaring three local vals and assigning them whatever the groups in the regex matched, or null if they didn’t match.

    Is it just me, or is someone else too wondering why you don’t get an Option object back when a group doesn’t match? I presume there’s a good reason for this, so if somebody knows please enlighten me.

Comments are closed.