Friday, 26 April 2013

Secrets of the Scala Lexer 1: \uuuuunicode

As part of developing Scalariform, I ended up writing my own Scala parser and lexer. I bumped into a couple of quirky features in the syntax that I might share in this occasional blog series.

First up is Unicode escapes. Both Java and Scala support encoding Unicode characters as hexadecimal sequences:

val arrow = "\u2190"

You might also be aware that this encoding isn't a feature only of string or character literals (as is the case for "\n" and "\r" etc), but can also occur in any position in the source file. The translation from encoded to unencoded characters is conceptually the first stage in parsing Scala, performed before the characters are recognised as tokens. For example, the following two snippets encode the same valid Scala statement:

val x = 42
\u0076\u0061\u006c\u0020\u0078\u0020\u003d\u0020\u0034\u0032

What's even more strange is that you can add arbitrary number of "u"s to your escape sequence, still happily accepted by the compiler:

\u0076\uu0061\uuu006c\uuuu0020\uuuuu0078\uuuuuu0020\uuuuuuu003d\uuuuuuuu0020\uuuuuuuuuu0034\uuuuuuuuuuu0032

Cool. So, what is the purpose of this feature? Well, Scala borrowed it from Java, and Java's excuse is documented in the spec:
The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.
I can perhaps imagine the intention in a Java 1.0 world. Maybe you've got some source processing that only works with ASCII. You can encode your source as above, pump it through your tool, and then unencode it, and you've got your unicode characters and unicode escapes preserved unscathed by the process.

I'd be interested to know if anyone has ever used this technique in Java; I'd certainly wager that nobody has or will ever use it for Scala. In my humble opinion, it's something of a misfeature in 2013 — creating puzzlers, and complicating the implementation of language tools.


No comments:

Post a Comment