First up is Unicode escapes. Both Java and Scala support encoding Unicode characters as hexadecimal sequences:
val arrow = "\u2190"
You might also be aware that this encoding isn't a feature only of string or character literals (as is the case for "\n" and "\r" etc), but can also occur in any position in the source file. The translation from encoded to unencoded characters is conceptually the first stage in parsing Scala, performed before the characters are recognised as tokens. For example, the following two snippets encode the same valid Scala statement:
val x = 42
\u0076\u0061\u006c\u0020\u0078\u0020\u003d\u0020\u0034\u0032
What's even more strange is that you can add arbitrary number of "u"s to your escape sequence, still happily accepted by the compiler:
\u0076\uu0061\uuu006c\uuuu0020\uuuuu0078\uuuuuu0020\uuuuuuu003d\uuuuuuuu0020\uuuuuuuuuu0034\uuuuuuuuuuu0032
Cool. So, what is the purpose of this feature? Well, Scala borrowed it from Java, and Java's excuse is documented in the spec:
The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extrau
- for example,\uxxxx
becomes\uuxxxx
- while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a singleu
each.
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multipleu
's are present to a sequence of Unicode characters with one feweru
, while simultaneously converting each escape sequence with a singleu
to the corresponding single Unicode character.
I can perhaps imagine the intention in a Java 1.0 world. Maybe you've got some source processing that only works with ASCII. You can encode your source as above, pump it through your tool, and then unencode it, and you've got your unicode characters and unicode escapes preserved unscathed by the process.
I'd be interested to know if anyone has ever used this technique in Java; I'd certainly wager that nobody has or will ever use it for Scala. In my humble opinion, it's something of a misfeature in 2013 — creating puzzlers, and complicating the implementation of language tools.
I'd be interested to know if anyone has ever used this technique in Java; I'd certainly wager that nobody has or will ever use it for Scala. In my humble opinion, it's something of a misfeature in 2013 — creating puzzlers, and complicating the implementation of language tools.
- Secrets of the Scala Lexer 1: \uuuuunicode
- Secrets of the Scala Lexer 2: Blank Lines in Comments
No comments:
Post a Comment