How do U-SQL use Roslyn?

Compiler theory

Before looking specifically to the U-SQL compiler, let’s start with a little bit of theory. How does compiler generally work?

In most cases, compiler defines a grammar that allows to parse the code into a tree.
Trees are a very good data structure to represent code that is easy to analyze / update using visitor pattern.

In compiler theory, this tree is called AST (Abstract Syntax Tree).

For example, with Roslyn, we get the following SyntaxTree parsing “Console.WriteLine(“Hello” + “World”);”

Building the syntax tree allows to check that the syntax is correct and to have a good data structure to modify the code at the compilation process (think to async / yield / etc. on C#).

Then compilers generally check the semantic in a pass (named binder in U-SQL).

In the previous sample, it means checking that

  • Usage of the AddExpression between string and string is allowed
  • Determine that string + string returns a string
  • Console.WriteLine with a string as parameter is an existing single method in the code context.

Finally, compilers generally have an error reporter step.

In U-SQL, we have 2 kinds of AST building in 2 steps

At the beginning, we use Yacc to parse U-SQL code and build our U-SQL AST.

This AST identifies expression parts.

Then we transform these expressions into C# code and get a Roslyn SyntaxTree on them.

How do we get a Roslyn SyntaxTree from U-SQL?

Imagine the following U-SQL query:

@Q = SELECT r * r * Math.[PI] AS Area
FROM (VALUES (1.0, 2.0)) AS T(r);

OUTPUT @Q TO "sample.out"
USING Outputters.Csv();

 

In the binder pass, we generate a C# code from this query using query expressions after making them syntactically correct in C#.

For example, in our sample, we will remove the square brackets around PI (in U-SQL only SQL tokens are supposed to be upper case).

Note that, Roslyn parsing is very tolerant with unknown syntax nodes. So it would also be possible to parse expressions without updating them first and then update the SyntaxTree itself to fix C# incorrect syntax.

After transforming these expressions into a correct C# code (for the syntax point of view at least), we generate a string that looks like the following code for our sample:

namespace Microsoft.USql
{
   class C
   {
      void M()
      {
         var EXTRACT0 = r * r * Math.PI;
      }
      double r;
   }
}

At this point, we can just parse this code using SyntaxFactory.ParseSyntaxTree to get our SyntaxTree.

Now that we got the SyntaxTree, we can play with Roslyn.

2 thoughts on “How do U-SQL use Roslyn?”

    1. The Yacc grammar per se is not publicly available. However, we are planning on publishing a fairly complete grammar (excluding the C# part) as part of our reference documentation. A not-yet complete beta version you can see at http://aka.ms/usql_reference.

Leave a Reply

Your email address will not be published. Required fields are marked *