Monday 29 April 2013

Secrets of the Scala Lexer 2: Blank Lines in Comments

Consider the following Scala snippet:
@tailrec
/*
 Comment 1
 Comment 2
 */
def foo: Int = ???
This compiles correctly. Consider what happens if we insert a blank line between Comment 1 and 2:
@tailrec
/*
 Comment 1

 Comment 2
*/
def foo: Int = ???
This time, we get a compile error ("expected start of definition"). So why is it that we can get syntax errors based solely on whether there is a blank line inside a multi-line comment?

The gory details can be found in the Scala Language Specification §1.2, but the summary is:
  • To support semicolon inference, new-line characters are sometimes interpreted as a special new line token called "nl". The rules for when this occurs are moderately complex, but end up working quite intuitively in practice.
  • Two new line tokens can be inserted by the compiler in the following case: "if two tokens are separated by at least one completely blank line (i.e a line which contains no printable characters), then two nl tokens are inserted."
  • At certain places in the syntax, an optional single newline token is accepted -- this includes after an annotation. This is also done to support semicolon inference.
  • However, two new line tokens are not permitted in some places (including after an annotation). I believe the intention is that blank line is a clear sign that the code after should be separated from the code before.
So by adding a completely blank line in the comment, two new line tokens are inserted instead of one, as per the above rule, and that is not permitted by the syntax after an annotation.

Maybe this behaviour should be changed to ignore completely blank lines inside comments?

Updated (1/May/2013): I've raised this as issue SI-7434.

Other posts in this series:

Friday 26 April 2013

Secrets of the Scala Lexer 1: \uuuuunicode

As part of developing Scalariform, I ended up writing my own Scala parser and lexer. I bumped into a couple of quirky features in the syntax that I might share in this occasional blog series.

First up is Unicode escapes. Both Java and Scala support encoding Unicode characters as hexadecimal sequences:

val arrow = "\u2190"

You might also be aware that this encoding isn't a feature only of string or character literals (as is the case for "\n" and "\r" etc), but can also occur in any position in the source file. The translation from encoded to unencoded characters is conceptually the first stage in parsing Scala, performed before the characters are recognised as tokens. For example, the following two snippets encode the same valid Scala statement:

val x = 42
\u0076\u0061\u006c\u0020\u0078\u0020\u003d\u0020\u0034\u0032

What's even more strange is that you can add arbitrary number of "u"s to your escape sequence, still happily accepted by the compiler:

\u0076\uu0061\uuu006c\uuuu0020\uuuuu0078\uuuuuu0020\uuuuuuu003d\uuuuuuuu0020\uuuuuuuuuu0034\uuuuuuuuuuu0032

Cool. So, what is the purpose of this feature? Well, Scala borrowed it from Java, and Java's excuse is documented in the spec:
The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.
I can perhaps imagine the intention in a Java 1.0 world. Maybe you've got some source processing that only works with ASCII. You can encode your source as above, pump it through your tool, and then unencode it, and you've got your unicode characters and unicode escapes preserved unscathed by the process.

I'd be interested to know if anyone has ever used this technique in Java; I'd certainly wager that nobody has or will ever use it for Scala. In my humble opinion, it's something of a misfeature in 2013 — creating puzzlers, and complicating the implementation of language tools.