Monday 29 April 2013

Secrets of the Scala Lexer 2: Blank Lines in Comments

Consider the following Scala snippet:
@tailrec
/*
 Comment 1
 Comment 2
 */
def foo: Int = ???
This compiles correctly. Consider what happens if we insert a blank line between Comment 1 and 2:
@tailrec
/*
 Comment 1

 Comment 2
*/
def foo: Int = ???
This time, we get a compile error ("expected start of definition"). So why is it that we can get syntax errors based solely on whether there is a blank line inside a multi-line comment?

The gory details can be found in the Scala Language Specification §1.2, but the summary is:
  • To support semicolon inference, new-line characters are sometimes interpreted as a special new line token called "nl". The rules for when this occurs are moderately complex, but end up working quite intuitively in practice.
  • Two new line tokens can be inserted by the compiler in the following case: "if two tokens are separated by at least one completely blank line (i.e a line which contains no printable characters), then two nl tokens are inserted."
  • At certain places in the syntax, an optional single newline token is accepted -- this includes after an annotation. This is also done to support semicolon inference.
  • However, two new line tokens are not permitted in some places (including after an annotation). I believe the intention is that blank line is a clear sign that the code after should be separated from the code before.
So by adding a completely blank line in the comment, two new line tokens are inserted instead of one, as per the above rule, and that is not permitted by the syntax after an annotation.

Maybe this behaviour should be changed to ignore completely blank lines inside comments?

Updated (1/May/2013): I've raised this as issue SI-7434.

Other posts in this series:

Friday 26 April 2013

Secrets of the Scala Lexer 1: \uuuuunicode

As part of developing Scalariform, I ended up writing my own Scala parser and lexer. I bumped into a couple of quirky features in the syntax that I might share in this occasional blog series.

First up is Unicode escapes. Both Java and Scala support encoding Unicode characters as hexadecimal sequences:

val arrow = "\u2190"

You might also be aware that this encoding isn't a feature only of string or character literals (as is the case for "\n" and "\r" etc), but can also occur in any position in the source file. The translation from encoded to unencoded characters is conceptually the first stage in parsing Scala, performed before the characters are recognised as tokens. For example, the following two snippets encode the same valid Scala statement:

val x = 42
\u0076\u0061\u006c\u0020\u0078\u0020\u003d\u0020\u0034\u0032

What's even more strange is that you can add arbitrary number of "u"s to your escape sequence, still happily accepted by the compiler:

\u0076\uu0061\uuu006c\uuuu0020\uuuuu0078\uuuuuu0020\uuuuuuu003d\uuuuuuuu0020\uuuuuuuuuu0034\uuuuuuuuuuu0032

Cool. So, what is the purpose of this feature? Well, Scala borrowed it from Java, and Java's excuse is documented in the spec:
The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.
I can perhaps imagine the intention in a Java 1.0 world. Maybe you've got some source processing that only works with ASCII. You can encode your source as above, pump it through your tool, and then unencode it, and you've got your unicode characters and unicode escapes preserved unscathed by the process.

I'd be interested to know if anyone has ever used this technique in Java; I'd certainly wager that nobody has or will ever use it for Scala. In my humble opinion, it's something of a misfeature in 2013 — creating puzzlers, and complicating the implementation of language tools.


Friday 15 June 2012

ASCII Art Testing

Suppose we're unit testing a shortest-path graph algorithm. We might write a test like the following:
val graph = makeGraph(
  edge("A", "B", weight = 7),
  edge("A", "F", weight = 14),
  edge("B", "C", weight = 10),
  edge("B", "D", weight = 15),
  edge("C", "D", weight = 11),
  edge("C", "F", weight = 2),
  edge("D", "E", weight = 6),
  edge("E", "F", weight = 9))
 
val path = graph.findShortestPathBetween("A", "E")
  
// Assertions ...
Although this is a reasonable way to define a graph, whenever someone tries to understand this test, they'll inevitably resort to sketching the graph to get a grip on its structure. We might even include a little ASCII-art diagram as a comment to save readers the effort of having to draw it out themselves. The downside is that, like all comments, the diagram could become out of sync with the code.

ASCII Art to the Rescue!

What we can do instead is to use ASCII art to directly specify the graph, and have a diagram parser to translate it into the appropriate graph object at runtime:
 val graph = makeGraph("""
             +-+             
    ---------|E|----------   
    |        +-+         |   
 [9]|                 [6]|   
    |                    |   
   +-+  [2]  +-+  [11]  +-+  
   |F|-------|C|--------|D|  
   +-+       +-+        +-+  
    |         |          |   
[14]|     [10]|      [15]|   
    |         |          |   
   +-+       +-+         |   
   |A|-------|B|----------   
   +-+  [7]  +-+       """)
The aim is to clearly communicate the data we're working with but still be precisely machine-parsable. The fact that your tests look 10 times as awesome is a bonus extra!

This DSL is for building graphs, but I've used the same sort of principle for 2D games (think Sudoku or Tetris), and others have used it for taxonomies (see ASCII Art as a DSL for Unit Tests) and image processing (see Building Image Unit Tests). You could go as far as to argue that, whenever you think pictorially about a domain, the tests should also be specified pictorially.

Depending on the domain, perhaps what is really needed is to define test-case data in a custom editor designed for that purpose. Still, ASCII art has the advantage that it's directly embeddable into source code and plays nice with version control.

A tool like JavE is invaluable, and most IDEs/editors support modes that can make working with ASCII art easier (Eclipse Block Selection, for example).

Parser

I've published a basic graph diagram parser on Github: ascii-graphs. It can be used like this:
import com.github.mdr.ascii._

val diagram = Diagram("""
             +-+             
    ---------|E|----------   
    |        +-+         |   
 [9]|                 [6]|   
    |                    |   
   +-+  [2]  +-+  [11]  +-+  
   |F|-------|C|--------|D|  
   +-+       +-+        +-+  
    |         |          |   
[14]|     [10]|      [15]|   
    |         |          |   
   +-+       +-+         |   
   |A|-------|B|----------   
   +-+  [7]  +-+      """)

// Print all the vertices neighbouring A along with the edge weight:
for {
  box ← diagram.allBoxes.find(_.text == "A")
  (edge, otherBox) ← box.connections()
  label ← edge.label
} println(box + " ==> " + label + " ==> " + otherBox)

Thursday 10 November 2011

Dropping Stanford’s Online AI Class

I have decided to stop following Stanford’s online AI course to focus on the sibling Machine Learning and Database classes. Simply put, I feel that I’m learning a great deal more from the ML and DB classes, and the AI class isn’t really giving me a sufficient return on investment for my time. I’d like to jot down a few thoughts as to why this is the case.

Now, it may seem churlish to criticise a course being put out for free, so let me first clarify that I do heartily appreciate the generous efforts of Sebastian Thrun and Peter Norvig in running this; it must take an inordinate amount of time and effort to do so. I’m also convinced that online courses like this are a tremendously important future direction for education. Even so, I believe that it’s appropriate to evaluate and give constructive criticism -- much like reviewing a piece of open source software built by volunteers, for example.

I think the biggest weakness of the AI course is the lack of any practical assignments. Both the ML and DB classes have hands-on assignments for the week’s topics, and it really helps to understand and motivate the material. The AI class lacks any equivalent; indeed, for some of the weeks, only half the topics appeared even on the homework problems. Clearly, if I were sufficiently motivated, I could go and implement these things for myself, but part of the purpose of a course is as external motivation. After a hard day at work, it’s often only the threat of a deadline that will get me to study ;-)

The homeworks that have been set have been, in my opinion, rather low quality, with lots of ambiguity or errors requiring clarification after the fact...or even giving the answer to one question in the next! While I’ve scored well on them, I don’t necessarily feel it corresponds to any deep understanding of the material. It’s quite possible to answer many homework questions just by mimicking the mechanics of a lecture video by rote without any deep comprehension of what’s going on.

The lectures themselves are divided up into lots of short videos with quizzes at the end. Unfortunately, I found the constant quizzes to be simply annoying. I understand that asking questions to provoke the listener to think through a topic is good pedagogic technique, but it can definitely be overused. Rather than prompting me to think through a question, my response often ended up being “how the %£$! should I know?”  Further, because of this, the videos aren’t really set-up for offline viewing. Most of my study time is on the train commuting, so this was a big downside. For the ML and DB classes, the videos are divided into a few chunks with download links, which makes them practical for offline use.

Some of the pacing of material could be better -- the unit on probability that rapidly went from easy examples of a biased coin to some involved Bayesian probability calculations seemed particularly ambitious. There have also been a few technical issues caused by the large number of people using the site -- this one is entirely forgivable, given the size of the class, but the ML and DB class infrastructure has been very stable by comparison.

Again, to reiterate, I strongly support the concept behind these courses and appreciate the hard work that has gone into them. I’m also aware that there are plenty of people who are really enjoying the class. If I had more time available, I probably would stick with it, but it’s not really worth it for me at this point.