Quantcast
Channel: Andrew Lock | .NET Escapades
Viewing all articles
Browse latest Browse all 743

Introducing collection expressions in C#12: Behind the scenes of collection expressions - Part 1

$
0
0

This series take an in-depth look at collection expressions, which were introduced with C#12. This series looks primarily at the code that's generated when you use collection expressions in your application, to understand how collection expressions work behind the scenes.

In this first post, I provide an introduction to C#12 collection expressions. There's already plenty of good introductions to collection expressions around the internet, including a post on the .NET blog, but one more can't hurt!

Classic collection initializers

We've had "collection initializers" in C# since C# 3.0. These use the {} pattern to initialize any IEnumerable implementation that has an Add() method. For example, this creates a new List<int> and initializes it with 4 values:

var values = new List<string> { "1", "2", "3", "4", "5" };

Behind the scenes, the compiler emits code that looks a bit like this:

List<string> values = new List<string>();
list.Add("1");
list.Add("2");
list.Add("3");
list.Add("4");
list.Add("5");

Array initialization in C# is special, in that you can initialize them in even more ways than other collections, though they look similar to standard collection initializers:

var values1 = new[] { "1", "2", "3", "4" };
var values2 = new string[4] { "1", "2", "3", "4" };
string[] values3 = { "1", "2", "3", "4" };

These work differently to collection initializers; there's no Add() method that's being called here. Instead, the compiler generates the same initialization code that you would if you had to do everything by hand:

string[] array = new string[4];
array[0] = "1";
array[1] = "2";
array[2] = "3";
array[3] = "4";

I intentionally didn't use a primitive type like int in the example above. If you create and int[] using an array initializers, the compiler loads the data of the array directly from a constant series of bytes, which is more efficient. We'll look in more detail at this mechanism in subsequent posts.

So we've looked at collection initializers for general collections (like List<T>) and arrays, but there's also stackalloc expressions, which have become much more useful in the world of Span<T>, as you don't need to use unsafe code:

Span<int> array = stackalloc []{ 1, 2, 3, 4 };

A stackalloc expression allocates a memory on the stack, in contrast to a standard new int[] which allocates memory on the heap. A stack-allocated memory block created in a method is automatically discarded when the method returns, so it does not add any pressure on the garbage collector. However, use with caution: if you stacalloc to much you'll cause a StackOverflowException and crash your application!

Behind the scenes, the compiler turns this stacalloc initializer into some unsafe code such as the following:

unsafe
{
    byte* num = stackalloc byte[16];
    *(int*)num = 1;
    *(int*)(num + 4) = 2;
    *(int*)(num + (nint)2 * (nint)4) = 3;
    *(int*)(num + (nint)3 * (nint)4) = 4;
    Span<int> array = new Span<int>(num, 4);
}

The code above is creating an array on the stack, and then moving through each element (using some pointer arithmetic), setting each value.

Unifying syntax with collection expressions

So we've seen that there's at least three distinct scenarios where we're initializing collections:

  • Arrays
  • Collections like List<T>
  • ReadOnlySpan<T> with stackalloc

Each of these requires a slightly different syntax for example:

int[] array = new[] { 1, 2, 3, 4 }; // One of several options!

List<int> list = new() { 1, 2, 3, 4 };
HashSet<int> hashset = new() { 1, 2, 3, 4 }; 

ReadOnlySpan<int> span = stackalloc [] { 1, 2, 3, 4 };

That's all a bit messy and annoying. Collection expressions, introduced in C#12, provide a simplified, unified, syntax across all these different collection types. For example:

int[] array = [1, 2, 3, 4]

List<int> list = [1, 2, 3, 4];
HashSet<int> hashset = [1, 2, 3, 4];

ReadOnlySpan<int> span = [ 1, 2, 3, 4 ];

The consistency of collection expressions across all the collection types is a real boon, but it's not the only advantage. Collection expressions can give performance benefits (which we'll look at in later posts) as well as additional features compared to collection initializers.

To reiterate, collection and array initializers use the "old" syntax new [] {}/new () {}, while collection expressions use the "new" syntax [ ].

We'll start by looking at an area where collection expressions can be used where collection initializers just can't.

Inferring interface types with collection expressions

Imagine you want to create a collection, but all you care about is that it implements IEnumerable<int>. You have to decide for yourself which backing type to use:

IEnumerable<int> list1 = new List<int> { 1, 2, 3, 4 };
IEnumerable<int> list2 = new HashSet<int> { 1, 2, 3, 4 };
IEnumerable<int> list3 = new int[] { 1, 2, 3, 4 };

So which should you use? Does it matter? If all you need to do is enumerate the list, then it probably shouldn't matter which type you choose, right? So what's the correct option?

It's also somewhat annoyingly verbose, as you have to write both the collection type and the IEnumerable<int> variable type you want.

With collection expressions, you can defer the choice to the compiler instead. Instead of explicitly specifying the backing type, you can leave it up to the compiler. And an added bonus is the extra terseness of collection expressions:

IEnumerable<int> ienumerable = [ 1, 2, 3, 4 ];
IList<int> ilist = [ 1, 2, 3, 4 ];
IReadOnlyCollection<int> icollection = [ 1, 2, 3, 4 ];

Behind the scenes, the compiler creates a collection that implements the required interface completely transparently, so you don't need to think about it.

Of course you might wonder what the collection is that's created behind the scenes, as I did. Stay tuned for the rest of the series, because the answer to that is "it depends"!

It's worth pointing out that while the compiler will automatically choose a concrete type for an interface collection, you do need to specify some type. You can't, for example, use var:

var values = [ 1, 2, 3, 4 ]; // ❌ Does not compile, CS9176: There is no target type for the collection expression
Sum(values);

Sum([ 1, 2, 3, 4 ]);  // ✅ This is fine

int Sum(IEnumerable<int> v) => v.Sum();

The problem is that the way the C# compiler works, it can't infer that the type for values should be IEnumerable<int>, so it throws an error. It's possible that this could change in a future version of C#, but it would likely be solved by, for example, always choosing int[] in this situation, which isn't necessarily optimal, so I wouldn't hold your breath.

Efficient automatic stackalloc for ReadOnlySpan

It's a similar story for ReadOnlySpan<T> and Span<T> instances too if you are only using collection initializers. Ff you just need some data in a Span<T> or ReadOnlySpan<T>, then with collection initializers you need to decide where to put that data and then grab the Span<T> from it:

Span<int> spans2 = stackalloc[] { 1, 2, 3, 4 }; // stackalloc an array
Span<int> spans3 = new[] { 1, 2, 3, 4 }; // allocate on the heap
Span<string> spans4 = new[] { "1", "2", "3", "4" }; // can't use stackalloc in this case

It's not a big decision to make in this case, as there's probably only 2 sensible options, but it's still something extra to think about. Plus you can't stackalloc the string[] without jumping through a bunch of InlineArray hoops.

With collection expressions you can, again, delegate the decision to the compiler, and it will do the Right Thing™.

ReadOnlySpan<int> readonlyspans = [ 1, 2, 3, 4 ];
Span<string> spans = [ "1", "2", "3", "4" ];

Later in the series you'll see that these cases of collection expressions in particular are heavily optimised!

Collection expressions make refactoring simpler

The examples I've shown so far have all been assigning collection expressions to variables, but you can use collection expressions directly as method arguments too, so you can do things like this:

using System.Linq;
using System.Collections.Generic;

// create a method that takes an IEnumerable
int Sum(IEnumerable<int> values) => values.Sum();

// Call the method using collection expressions
Sum([1, 2, 3, 4]);

A nice benefit of this pattern in particular is that if I change the signature of Sum(), I don't need to change the call-site. Contrast that for a moment with if you were using collection initializers:

// if the method takes an array... 
int Sum1(int[] values) => values.Sum();
Sum1(new [] { 1, 2, 3, 4 }); // ...you have to use array syntax (one of several syntaxes!)

// if the method takes an IEnumerable<T>... 
int Sum2(IEnumerable<int> values) => values.Sum();
Sum2(new List<int> { 1, 2, 3, 4 }); // ...you have to use an explicit type e.g. List<T> or similar

// if the method takes a ReadOnlySpan<T>... 
int Sum3(ReadOnlySpan<int> values)
{
    // You can use foreach with IReadOnlySpan<T>
    // but it doesn't implement IEnumerable<T>, so can't
    // use the Linq convenience methods here!
    var total = 0;
    foreach (var value in values) 
    {
        total += value;
    }

    return total;
}

Sum3(new []{ 1, 2, 3, 4 }); // ...you have to choose between a standard array, 
Sum3(stackalloc int[] { 1, 2, 3, 4 }); // ... or use a stackalloc'd array (for example)

If we use collection expressions instead, then we can use the exact same syntax to call all three Sum() implementations:

Sum1([ 1, 2, 3, 4 ]);
Sum2([ 1, 2, 3, 4 ]);
Sum3([ 1, 2, 3, 4 ]);

And the compiler will use the most efficient implementation it can to create a collection of the required type.

That may seem like a small thing, and to an extent it is, but it's all these little convenience aspects that make collection expressions such a neat feature overall!

Empty collections

Another feature of collection expressions is that the compiler explicitly recognizes the empty collection syntax, so instead of writing something like this:

var empty = new int[]{}; // You should generally never do this...
var empty = Array.Empty<int>(); // ...instead, prefer this!

you can now use [] to generate an appropriate empty version of the collection:

int[] empty = [];

Collection expressions, again, have two main benefits over explicit initializers:

  • The compiler can choose the most efficient way to create the empty collection, choosing Array.Empty<int>() for example (or equivalent).
  • You can use a consistent syntax for all collection types.

The following shows a whole bunch of collection types, and how you can use [] to create an empty version of all of them. The comment for each line shows the code that the compiler generates for the specific type:

int[] array = []; // Array.Empty<int>()

HashSet<int> hashset = []; // new HashSet<int>()
List<int> list = []; // new List<int>()

IEnumerable<int> ienumerable = [];  // Array.Empty<int>()
ICollection<int> icollection = []; // new List<int>()
IList<int> ilist = []; // new List<int>()

IReadOnlyCollection<int> readonlycollection = [];  // Array.Empty<int>()
IReadOnlyList<int> readonlyList = [];  // Array.Empty<int>()

Span<int> span = []; // default(Span<int>)
ReadOnlySpan<int> readonlyspan = []; // default(ReadOnlySpan<int>)

ImmutableArray<int> immutablearray = []; // ImmutableCollectionsMarshal.AsImmutableArray(Array.Empty<int>())
ImmutableList<int> immutablelist = []; // ImmutableList.Create(default(ReadOnlySpan<int>));

As you can see, the compiler is as efficient as it can be; if the type is mutable, such as a HashSet<T> or List<T> then it has no option than to create a new instance of the type, but if it can get away with using a non-allocating version, such as Array.Empty<int>(), then it will!

Building collections from others with the spread element

So far we've seen two benefits of collection expressions:

  • Consistent syntax
  • Efficient compiler-generated implementations

The other big feature in collection expressions is the spread element, ... This gives you the ability to more easily create collections from other collection instances.

The spread functionality (sometimes also called "splat") has been in other languages like Python, JavaScript, and Ruby for a long time, so it's nice to see it arrive in C# finally.

As a concrete example, lets say you have two IEnumerable<T> collections, and you want to concatenate them as an array. That's pretty easy to do with LINQ, as there's extension methods for doing exactly that:

int[] ConcatAsArray(IEnumerable<int> first, IEnumerable<int> second)
{
    return first.Concat(second).ToArray();
}

Great, but what if you now need want to work with ReadOnlySpan<T> instead of IEnumerable<T>? Unfortunately, as we discussed before, ReadOnlySpan<T> doesn't implement IEnumerable<T>, so we might do something like this instead:

int[] ConcatAsArray(ReadOnlySpan<int> first, ReadOnlySpan<int> second)
{
    var list = new List<int>(first.Length + second.Length);
    list.AddRange(first);
    list.AddRange(second);
    return list.ToArray();
}

Which isn't terrible, but it's still annoying to have to think about for each different collection type. With collection expressions we get a nice short cut, that can be used with all supported collection types by using the spread operator. Both of the above overloads could be implemented in the same way:

int[] ConcatAsArray(IEnumerable<int> first, IEnumerable<int> second)
    => [..first, ..second];

int[] ConcatAsArray(ReadOnlySpan<int> first, ReadOnlySpan<int> second)
    => [..first, ..second];

And again, the consistency of collection expressions means that if you change the parameters or the return type of ConcatAsArray(), you don't need to change the collection expression at all, it just works!

The .. element means "write all the values from the collection", so to give another example:

int[] array = [1, 2, 3, 4]
IEnumerable<int> oddValues = array.Where(int.IsOddInteger); // 1, 3
int[] evenValues = [..array.Where(int.IsEvenInteger)]; // 2, 4 in array (using spread)
int[] allValues = [..oddValues, ..evenValues]; // 1, 3, 2, 4

The code above uses the spread element several times, but in each case it means "write all the elements of the collection". So in the final step, allValues contains all the elements from oddValues followed by all the values from evenValues.

You can also mix single values and spread collections together in your collection expression, for example:

int[] arr = [1, 2, 3, 4];
int[] myValues = [ 0, ..arr, 5 , 6]; // 0, 1, 2, 3, 4, 5, 6

and the final result is the combination as though you had iterated through arr and added each value.

Note that the spread element .. is different to the range operator (e.g. .. in 1..3 or 2..^) which is used to index into an array. However, you can combine them, using a range to select a subset of elements, and then spreading them into a collection expression:

int[] primes = [1, 2, 3, 5, 7, 9, 11];
int[] some = [0, ..primes[1..^1]]; // 0, 2, 3, 5, 7, 9

This code takes the 1st to N-1th element of the primes array (i.e. 2, 3, 4, 5, 7, 9) using the 1..^1 range operator, and then uses spread .. in the collection expression.

Collection expressions add a nice symmetry to creating collections (which can be particularly useful when refactoring from one collection to another), and they make combining collections much simpler with the spread element. But collection expressions aren't just about syntax. An important part is that collection expressions give the option for the compiler to optimize the code it generates. In the next post we'll take a look at how that works.

Summary

In this post I provided an introduction to collection expressions, contrasting them with collection initializers, array initializers and stackalloc initialization. I showed how having a single unified collection expression syntax makes refactoring your code easier and allows the compiler to generate optimised, specialised, code. Finally I showed how the spread element .. can be used in collection expressions to more easily build new collections from existing collections.

In the next post we look behind the scenes at the code the compiler actually generates when you use collection expressions in your code.


Viewing all articles
Browse latest Browse all 743

Trending Articles