How to use annotations in Roslyn

TextSpan SyntaxTree annotation

In my previous post, I explained how to get a SyntaxTree from a U-SQL query.

Now we will see in future posts that we can change the tree significantly.

Why changing the tree? The main reason is to help the optimizer generating the best plan as possible.

For this, we are doing constant folding for example (I will write about it later). This means that we try to do some calculation in the compiler itself in order to give constants to the optimizer.

For example, imagine that we write a query like this:

DECLARE @X = 1 – 1;
@Q = SELECT (1.0 + 2 + @X) / @X AS Foo
    
FROM (VALUES(1)) AS T(r);

U-SQL, as a smart compiler, avoids useless operations at runtime by folding constants and transform the query to:

@Q = SELECT double.PositiveInfinity AS Foo
    
FROM (VALUES(1)) AS T(r);

For the following query, the compilation will fail (division by 0).

DECLARE @X = 1 – 1;
@Q = SELECT (1 + 2) + 1 / @X AS Foo
    
FROM (VALUES(1)) AS T(r);

Roslyn can give us the TextSpan of a node. The TextSpan struct includes the starting position of an expression and its length.

But at this point, the expression will no longer be “(1 + 2) + 1 / @X” but will be “3 + 1 / 0”. So if we want to give a pertinent error position to the user (“(1 + 2) + 1 ### / @X”) we need to have a way to find the original expression position and ideally, it would be great to have a way to get the original node.

To update the tree, we generally use the visitor pattern.
Roslyn provides the CSharpSyntaxRewriter for this.

Roslyn syntax nodes are immutable.
So that means that the BinaryExpressionSyntax associated to the division “(1.0 + 2 + @X) / @X” can’t be the same instance than “3.0 / 0” if we replace “(1.0 + 2 + @X)” by “3.0” and “@X” by “0”.

Note that it’s also true when you update a parent node. E.g. In the following sample e1 and n1 are not the same instance than e2 and n2.

var e1 = SyntaxFactory.IdentifierName("a");
var n1 = SyntaxFactory.IdentifierName("b");
var mae = SyntaxFactory.MemberAccessExpression(SyntaxKind.SimpleMemberAccessExpression, e1, n1);
var e2 = mae.Expression;
var n2 = mae.Name;

So we can’t just use a basic dictionary for example.

In order to keep an information on a SyntaxNode even after tree update, Roslyn provides a way using SyntaxAnnotation.

SyntaxAnnotation have 2 properties: Kind and Data.

So if you want to add a Boolean information you can use considering the value true if the annotation is present and false if it is not.

node.WithAdditionalAnnotations(new SyntaxAnnotation("USQLAnnotation"));

If you want to add a value information you can use

node.WithAdditionalAnnotations(new SyntaxAnnotation("USQLAnnotation", "foo"));

This is very easy to use but the fact that SyntaxAnnotation is a sealed class and that Data is a string limits the usage.

So as a workaround we can create our own annotations logic.

Basically we can use a Singleton with a Dictionary containing a list of our custom annotations as value.

However, as I wrote previously, in order to keep the annotation on tree after transformation we need to use Roslyn SyntaxAnnotation.

So we define a string key that is our dictionary key and our annotation data.

In our Singleton, we define two methods:

  • AddAnnotation that adds the SyntaxAnnotation with the key on the node and adds the custom annotation with the key in the dictionary
  • GetAnnotations that gets the key from the node SyntaxAnnotation and then return our custom annotations from the dictionary.
internal interface IUSqlAnnotation
{
}

internal class USqlAnnotationPool
{
  
private const string SyntaxAnnotationKey = "USQL";

   private Dictionary<string, List<IUSqlAnnotation>> annotations = new Dictionary<string, List<IUSqlAnnotation>>();
  
private int counter;

   private USqlAnnotationPool()
   {
   }

   private static readonly USqlAnnotationPool instance = new USqlAnnotationPool();
  
public static USqlAnnotationPool Instance
   {
      
get{ return instance; }
   }

   public T AddAnnotation<T>(T node, IUSqlAnnotation annotation)
      
where T : SyntaxNode
   {
      
var annotationKey = node.GetAnnotations("USQL").SingleOrDefault()?.Data;

       if (annotationKey == null)
       {
           annotationKey = SyntaxAnnotationKey + (++
this.counter).ToString();
          
this.annotations.Add(annotationKey, new List<IUSqlAnnotation>{ annotation });
           node = node.WithAdditionalAnnotations(
new SyntaxAnnotation(SyntaxAnnotationKey, annotationKey));
       }
      
else
       {
          
this.annotations[annotationKey].Add(annotation);
       }

       return node;
   }

   public IEnumerable<IUSqlAnnotation> GetAnnotations(SyntaxNode node)
   {
      
var annotationKey = node.GetAnnotations("USQL").SingleOrDefault()?.Data;
      
if (annotationKey == null)
       {
          
yield break;
       }

       foreach (var annotation in this.annotations[annotationKey])
       {
          
yield return annotation;
       }
   }
}

 

Then we can define two extension methods to make usage easier: AddAnnotation and GetAnnotations<T>.

internal static class RoslynExtensions
{
    
internal static T AddAnnotation<T>(this T node, IUSqlAnnotation annotation)
        
where T : SyntaxNode
    {

        
return USqlAnnotationPool.Instance.AddAnnotation(node, annotation);
    
}

    internal static IEnumerable<T> GetAnnotations<T>(this SyntaxNode node)
        
where T : IUSqlAnnotation
    {
        
return USqlAnnotationPool.Instance.GetAnnotations(node).OfType<T>();
    }
}

Finally, to keep original tree TextSpan, we can define our TextSpanAnnotation and add it to our SyntaxTree nodes using a SyntaxRewriter.

internal struct TextSpanAnnotation : IUSqlAnnotation
{
   public TextSpan TextSpan{ get; set; }
}

public class TextSpanAnnotator : CSharpSyntaxRewriter
{
  
private bool useDefaultAnnotation = true;

   public override SyntaxNode Visit(SyntaxNode node)
   {
      
if (node == null)
       {
          
return null;
       }

       node = base.Visit(node);
      
if (!useDefaultAnnotation)
       {
          
this.useDefaultAnnotation = true;
          
return node;
       }

       return AddAnnotation(node, GetTextSpan(node.GetFirstToken(), node.GetLastToken()), useDefaultAnnotation: true);
   }

   private TextSpan GetTextSpan(SyntaxToken firstToken, SyntaxToken lastToken)
   {

      
return new TextSpan(firstToken.SpanStart, lastToken.Span.End – firstToken.SpanStart);
   }

   private SyntaxNode AddAnnotation(SyntaxNode node, TextSpan span, bool useDefaultAnnotation = false)
   {
      
this.useDefaultAnnotation = useDefaultAnnotation;
      
return node.AddAnnotation(new TextSpanAnnotation{ TextSpan = span });
   }
}

Then we can improve the TextSpan position using specific Visit override:

   public override SyntaxNode VisitBinaryExpression(BinaryExpressionSyntax node)
   {
      
return AddAnnotation(base.VisitBinaryExpression(node), node.OperatorToken.Span);
   }

Finally, to add our TextSpanAnnotation into our tree nodes, we just need to use our TextSpanAnnotator:

syntaxTree = SyntaxFactory.SyntaxTree(new TextSpanAnnotator().Visit(syntaxTree.GetRoot()));

Then to get the position of a node in the original SyntaxTree, we can use the following code:

node.GetAnnotations<TextSpanAnnotation>().Single().TextSpan

Note that contrary to Roslyn annotations, adding a new IUSqlAnnotation to a node that already has one will not generate a new node.

This is probably better for performance but this is sometimes less convenient.

In this post, you saw how we extend the Roslyn annotation logic using our USqlAnnotationPool class.
With it, we are now capable to add whatever information we want on our SyntaxTree nodes.
These information are persisted with immutable syntax nodes along the different transformations we will apply on our tree and you will see in future posts that we are using this mechanism in many different contexts.

How do U-SQL use Roslyn?

Compiler theory

Before looking specifically to the U-SQL compiler, let’s start with a little bit of theory. How does compiler generally work?

In most cases, compiler defines a grammar that allows to parse the code into a tree.
Trees are a very good data structure to represent code that is easy to analyze / update using visitor pattern.

In compiler theory, this tree is called AST (Abstract Syntax Tree).

For example, with Roslyn, we get the following SyntaxTree parsing “Console.WriteLine(“Hello” + “World”);”

Building the syntax tree allows to check that the syntax is correct and to have a good data structure to modify the code at the compilation process (think to async / yield / etc. on C#).

Then compilers generally check the semantic in a pass (named binder in U-SQL).

In the previous sample, it means checking that

  • Usage of the AddExpression between string and string is allowed
  • Determine that string + string returns a string
  • Console.WriteLine with a string as parameter is an existing single method in the code context.

Finally, compilers generally have an error reporter step.

In U-SQL, we have 2 kinds of AST building in 2 steps

At the beginning, we use Yacc to parse U-SQL code and build our U-SQL AST.

This AST identifies expression parts.

Then we transform these expressions into C# code and get a Roslyn SyntaxTree on them.

How do we get a Roslyn SyntaxTree from U-SQL?

Imagine the following U-SQL query:

@Q = SELECT r * r * Math.[PI] AS Area
FROM (VALUES (1.0, 2.0)) AS T(r);

OUTPUT @Q TO "sample.out"
USING Outputters.Csv();

 

In the binder pass, we generate a C# code from this query using query expressions after making them syntactically correct in C#.

For example, in our sample, we will remove the square brackets around PI (in U-SQL only SQL tokens are supposed to be upper case).

Note that, Roslyn parsing is very tolerant with unknown syntax nodes. So it would also be possible to parse expressions without updating them first and then update the SyntaxTree itself to fix C# incorrect syntax.

After transforming these expressions into a correct C# code (for the syntax point of view at least), we generate a string that looks like the following code for our sample:

namespace Microsoft.USql
{
   class C
   {
      void M()
      {
         var EXTRACT0 = r * r * Math.PI;
      }
      double r;
   }
}

At this point, we can just parse this code using SyntaxFactory.ParseSyntaxTree to get our SyntaxTree.

Now that we got the SyntaxTree, we can play with Roslyn.

Hello world!

Microsoft announced Azure Data Lake services for analytics in the cloud in September.

As part of it, Microsoft released a new language: U-SQL.

For a developer point of view, U-SQL is a new SQL language with expressions written in C#.

My name’s Matthieu Mezil and I’m a developer on the U-SQL compiler team.

You can find more details on the motivation for U-SQL, some of our inspiration, and design philosophy behind the language, and a few examples of the major aspects of the language on Michael post

As one of Roslyn first fans, I will write about coding and mainly meta-programming in this blog, including the way U-SQL compiler uses Roslyn for the C# part of the language.