Groovy  Apache Groovy

Classifying Iris Flowers with Deep Learning, Groovy and GraalVM

by paulk

Posted on Saturday June 25, 2022 at 10:52AM in Technology

iris_description.pngA classic data science dataset captures flower characteristics of Iris flowers. It captures the width and length of the sepals and petals for three species (Setosa, Versicolor, and Virginica).

The Iris project in the groovy-data-science repo is dedicated to this example. It includes a number of Groovy scripts and a Jupyter/BeakerX notebook highlighting this example comparing and contrasting various libraries and various classification algorithms.

Technologies/libraries covered
Data manipulationWekaTablesawEncogJSATDatavecTribuo
ClassificationWekaSmileEncogTribuoJSATDeep Learning4JDeep Netts
VisualizationXChartTablesaw Plot.lyJavaFX
Main aspects/algorithms coveredReading csv, dataframes, visualization, exploration, naive bayes, logistic regression, knn regression, softmax regression, decision trees, support vector machine
Other aspects/algorithms coveredneural networks, multilayer perceptron, PCA

Feel free to browse these other examples and the Jupyter/BeakerX notebook if you are interested in any of these additional techniques.


For this blog, let's just look at the Deep Learning examples. We'll look at solutions using Encog, Eclipse DeepLearning4J and Deep Netts (with standard Java and as a native image using GraalVM) but first a brief introduction.

Deep Learning

Deep learning falls under the branches of machine learning and artificial intelligence. It involves multiple layers (hence the "deep") of an artificial neural network. There are lots of ways to configure such networks and the details are beyond the scope of this blog post, but we can give some basic details. We will have four input nodes corresponding to the measurements of our four characteristics. We will have three output nodes corresponding to each possible class (species). We will also have one or more additional layers in between.


Each node in this network mimics to some degree a neuron in the human brain. Again, we'll simplify the details. Each node has multiple inputs, which are given a particular weight, as well as an activation function which will determine whether our node "fires". Training the model is a process which works out what the best weights should be.


The math involved for converting inputs to output for any node isn't too hard. We could write it ourselves (as shown here using matrices and Apache Commons Math for a digit recognition example) but luckily we don't have to. The libraries we are going to use do much of the work for us. They typically provide a fluent API which let's us specify, in a somewhat declarative way, the layers in our network.

Just before exploring our examples, we should pre-warn folks that while we do time running the examples, no attempt was made to rigorously ensure that the examples were identical across the different technologies. The different technologies support slightly different ways to set up their respective network layers. The parameters were tweaked so that when run there was typically at most one or two errors in the validation. Also, the initial parameters for the runs can be set with random or pre-defined seeds. When random ones are used, each run will have slightly different errors. We'd need to do some additional alignment of examples and use a framework like JMH if we wanted to get a more rigorous time comparison between the technologies. Never-the-less, it should give a very rough guide as to the speed to the various technologies.


Encog is a pure Java machine learning framework that was created in 2008. There is also a C# port for .Net users. Encog is a simple framework that supports a number of advanced algorithms not found elsewhere but isn't as widely used as other more recent frameworks.

The complete source code for our Iris classification example using Encog is here, but the critical piece is:

def model = new EncogModel(data).tap {
selectMethod(data, TYPE_FEEDFORWARD)
report = new ConsoleStatusReportable()
holdBackValidation(0.3, true, 1001) // test with 30%

def bestMethod = model.crossvalidate(5, true) // 5-fold cross-validation

println "Training error: " + pretty(calculateRegressionError(bestMethod, model.trainingDataset)) println "Validation error: " + pretty(calculateRegressionError(bestMethod, model.validationDataset))

When we run the example, we see:

paulk@pop-os:/extra/projects/iris_encog$ time groovy -cp "build/lib/*" IrisEncog.groovy 
1/5 : Fold #1
1/5 : Fold #1/5: Iteration #1, Training Error: 1.43550735, Validation Error: 0.73302237
1/5 : Fold #1/5: Iteration #2, Training Error: 0.78845427, Validation Error: 0.73302237
5/5 : Fold #5/5: Iteration #163, Training Error: 0.00086231, Validation Error: 0.00427126
5/5 : Cross-validated score:0.10345818553910753
Training error:  0.0009
Validation error:  0.0991
Prediction errors:
predicted: Iris-virginica, actual: Iris-versicolor, normalized input: -0.0556, -0.4167,  0.3898,  0.2500
Confusion matrix:            Iris-setosa     Iris-versicolor      Iris-virginica
         Iris-setosa                  19                   0                   0
     Iris-versicolor                   0                  15                   1
      Iris-virginica                   0                   0                  10

real	0m3.073s
user	0m9.973s
sys	0m0.367s

We won't explain all of the stats, but it basically says we have a pretty good model with low errors in prediction. If you see the green and purple points in the notebook image earlier in this blog, you'll see there are some points which are going to be hard to predict correctly all the time. The confusion matrix shows that the model predicted one flower incorrectly on the validation dataset.

One very nice aspect of this library is that it is a single jar dependency!

Eclipse DeepLearning4j

Eclipse DeepLearning4j is a suite of tools for running deep learning on the JVM. It has support for scaling up to Apache Spark as well as some integration with python at a number of levels. It also provides integration to GPUs and C/++ libraries for native integration.

The complete source code for our Iris classification example using DeepLearning4J is here, with the main part shown below:

MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.activation(Activation.TANH) // global activation
.updater(new Sgd(0.1))
.layer(new DenseLayer.Builder().nIn(numInputs).nOut(3).build())
.layer(new DenseLayer.Builder().nIn(3).nOut(3).build())
.layer(new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.activation(Activation.SOFTMAX) // override activation with softmax for this layer

def model = new MultiLayerNetwork(conf)

model.listeners = new ScoreIterationListener(100)

1000.times { }

def eval = new Evaluation(3)
def output = model.output(test.features)
eval.eval(test.labels, output)
println eval.stats()

When we run this example, we see:

paulk@pop-os:/extra/projects/iris_dl4j$ time groovy -cp "build/lib/*" IrisDl4j.groovy 
[main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
[main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 4
[main] INFO org.nd4j.nativeblas.Nd4jBlas - Number of threads used for OpenMP BLAS: 4
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CPU]; OS: [Linux]
[main] INFO org.deeplearning4j.optimize.listeners.ScoreIterationListener - Score at iteration 0 is 0.9707752535968273
[main] INFO org.deeplearning4j.optimize.listeners.ScoreIterationListener - Score at iteration 100 is 0.3494968712782093
[main] INFO org.deeplearning4j.optimize.listeners.ScoreIterationListener - Score at iteration 900 is 0.03135504326480282

========================Evaluation Metrics========================
 # of classes:    3
 Accuracy:        0.9778
 Precision:       0.9778
 Recall:          0.9744
 F1 Score:        0.9752
Precision, recall & F1: macro-averaged (equally weighted avg. of 3 classes)

=========================Confusion Matrix=========================
  0  1  2
 18  0  0 | 0 = 0
  0 14  0 | 1 = 1
  0  1 12 | 2 = 2

Confusion matrix format: Actual (rowClass) predicted as (columnClass) N times

real	0m5.856s
user	0m25.638s
sys	0m1.752s

Again the stats tell us that the model is good. One error in the confusion matrix for our testing dataset.
DeepLearning4J does have an impressive range of technologies that can be used to enhance performance in certain scenarios. For this example, I enabled AVX (Advanced Vector Extensions) support but didn't try using the CUDA/GPU support nor make use of any Apache Spark integration. The GPU option might have sped up the application but given the size of the dataset and the amount of calculations needed to train our network, it probably wouldn't have sped up much. For this little example, the overheads of putting the plumbing in place to access native C++ implementations and so forth, outweighed the gains. Those features generally would come into their own for much larger datasets or massive amounts of calculations; tasks like intensive video processing spring to mind.

The downside of the impressive scaling options is the added complexity. The code was slightly more complex than the other technologies we look at in this blog based around certain assumptions in the API which would be needed if we wanted to make use of Spark integration even though we didn't here. The good news is that once the work is done, if we did want to use Spark, that would now be relatively straight forward.

The other increase in complexity is the number of jar files needed in the classpath. I went with the easy option of using the nd4j-native-platform dependency plus added the org.nd4j:nd4j-native:1.0.0-M2:linux-x86_64-avx2 dependency for AVX support. This made my life easy but brought in over 170 jars including many for unneeded platforms. Having all those jars is great if users of other platforms want to also try the example but it can be a little troublesome with certain tooling that breaks with long command lines on certain platforms. I could certainly do some more work to shrink those dependency lists if it became a real problem.

[For the interested reader, the groovy-data-science repo has other DeepLearning4J examples. The Weka library can wrap DeepLearning4J as shown for this Iris example here. There are also two variants of the digit recognition example we alluded to earlier using one and two layer neural networks.]

Deep Netts

Deep Netts is a company offering a range of products and services related to deep learning. Here we are using the free open-source Deep Netts community edition pure java deep learning library. It provides support for the Java Visual Recognition API (JSR381). The expert group from JSR381 released their final spec earlier this year, so hopefully we'll see more compliant implementations soon.

The complete source code for our Iris classification example using Deep Netts is here and the important part is below:

var splits = dataSet.split(0.7d, 0.3d)  // 70/30% split
var train = splits[0]
var test = splits[1]

var neuralNet = FeedForwardNetwork.builder()
.addFullyConnectedLayer(5, ActivationType.TANH)
.addOutputLayer(numOutputs, ActivationType.SOFTMAX)

neuralNet.trainer.with {
maxError = 0.04f
learningRate = 0.01f
momentum = 0.9f
optimizer = OptimizerType.MOMENTUM


new ClassifierEvaluator().with {
println "CLASSIFIER EVALUATION METRICS\n${evaluate(neuralNet, test)}"
println "CONFUSION MATRIX\n$confusionMatrix"

When we run this command we see:

paulk@pop-os:/extra/projects/iris_graalvm$ time groovy -cp "build/lib/*" Iris.groovy 
16:49:27.089 [main] INFO deepnetts.core.DeepNetts - ------------------------------------------------------------------------
16:49:27.091 [main] INFO deepnetts.core.DeepNetts - TRAINING NEURAL NETWORK
16:49:27.091 [main] INFO deepnetts.core.DeepNetts - ------------------------------------------------------------------------
16:49:27.100 [main] INFO deepnetts.core.DeepNetts - Epoch:1, Time:6ms, TrainError:0.8584314, TrainErrorChange:0.8584314, TrainAccuracy: 0.5252525
16:49:27.103 [main] INFO deepnetts.core.DeepNetts - Epoch:2, Time:3ms, TrainError:0.52278274, TrainErrorChange:-0.33564866, TrainAccuracy: 0.52820516
16:49:27.911 [main] INFO deepnetts.core.DeepNetts - Epoch:3031, Time:0ms, TrainError:0.029988592, TrainErrorChange:-0.015680967, TrainAccuracy: 1.0
16:49:27.911 [main] INFO deepnetts.core.DeepNetts - Total Training Time: 820ms
16:49:27.911 [main] INFO deepnetts.core.DeepNetts - ------------------------------------------------------------------------
Accuracy: 0.95681506 (How often is classifier correct in total)
Precision: 0.974359 (How often is classifier correct when it gives positive prediction)
F1Score: 0.974359 (Harmonic average (balance) of precision and recall)
Recall: 0.974359 (When it is actually positive class, how often does it give positive prediction)

                          none    Iris-setosaIris-versicolor Iris-virginica
           none              0              0              0              0
    Iris-setosa              0             14              0              0
Iris-versicolor              0              0             18              1
 Iris-virginica              0              0              0             12

real	0m3.160s
user	0m10.156s
sys	0m0.483s

This is faster than DeepLearning4j and similar to Encog. This is to be expected given our small data set and isn't indicative of performance for larger problems.

Another plus is the dependency list. It isn't quite the single jar situation as we saw with Encog but not far off. There is the Encog jar, the JSR381 VisRec API which is in a separate jar, and a handful of logging jars.

Deep Netts with GraalVM

Another technology we might want to consider if performance is important to us is GraalVM. GraalVM is a high-performance JDK distribution designed to speed up the execution of applications written in Java and other JVM languages. We'll look at creating a native version of our Iris Deep Netts application. We used GraalVM 22.1.0 Java 17 CE and Groovy 4.0.3. We'll cover just the basic steps but there are other places for additional setup info and troubleshooting help like here, here and here.

Groovy has two natures. It's dynamic nature supports adding methods at runtime through metaprogramming and interacting with method dispatch processing through missing method interception and other tricks. Some of these tricks make heavy use of reflection and dynamic class loading and cause problems for GraalVM which is trying to determine as much information as it can at compile time. Groovy's static nature has a more limited set of metaprogramming capabilities but allows bytecode much closer to Java to be produced. Luckily, we aren't relying on any dynamic Groovy tricks for our example. We'll compile it up using static mode:

paulk@pop-os:/extra/projects/iris_graalvm$ groovyc -cp "build/lib/*" --compile-static Iris.groovy

Next we build our native application:

paulk@pop-os:/extra/projects/iris_graalvm$ native-image --report-unsupported-elements-at-runtime \ --initialize-at-run-time=groovy.grape.GrapeIvy, \ --initialize-at-build-time --no-fallback -H:ConfigurationFileDirectories=conf/ -cp ".:build/lib/*" Iris

We told GraalVM to initialize GrapeIvy at runtime (to avoid needing Ivy jars in the classpath since Groovy will lazily load those classes only if we use @Grab statements). We also did the same for the RandomWeights class to avoid it being locked into a random seed fixed at compile time.

Now we are ready to run our application:

paulk@pop-os:/extra/projects/iris_graalvm$ time ./iris ... CLASSIFIER EVALUATION METRICS Accuracy: 0.93460923 (How often is classifier correct in total) Precision: 0.96491224 (How often is classifier correct when it gives positive prediction) F1Score: 0.96491224 (Harmonic average (balance) of precision and recall) Recall: 0.96491224 (When it is actually positive class, how often does it give positive prediction) CONFUSION MATRIX none Iris-setosaIris-versicolor Iris-virginica none 0 0 0 0 Iris-setosa 0 21 0 0 Iris-versicolor 0 0 20 2 Iris-virginica 0 0 0 17 real 0m0.131s user 0m0.096s sys 0m0.029s

We can see here that the speed has dramatically increased. This is great, but we should note, that using GraalVM often involves some tricky investigation especially for Groovy which by default has its dynamic nature. There are a few features of Groovy which won't be available when using Groovy's static nature and some libraries might be problematical. As an example, Deep Netts has log4j2 as one of its dependencies. At the time of writing, there are still issues using log4j2 with GraalVM. We excluded the log4j-core dependency and used log4j-to-slf4j backed by logback-classic to sidestep this problem.

[Update: I put the Deep Netts GraalVM iris application with some more detailed instructions into its own subproject.]


We have seen a few different libraries for performing deep learning classification using Groovy. Each has its own strengths and weaknesses. There are certainly options to cater for folks wanting blinding fast startup speeds through to options which scale to massive computing farms in the cloud.

Using Groovy with Apache Wayang and Apache Spark

by paulk

Posted on Sunday June 19, 2022 at 01:01PM in Technology

wayang.pngApache Wayang (incubating) is an API for big data cross-platform processing. It provides an abstraction over other platforms like Apache Spark and Apache Flink as well as a default built-in stream-based "platform". The goal is to provide a consistent developer experience when writing code regardless of whether a light-weight or highly-scalable platform may eventually be required. Execution of the application is specified in a logical plan which is again platform agnostic. Wayang will transform the logical plan into a set of physical operators to be executed by specific underlying processing platforms.

Whiskey Clustering

groovy.pngWe'll take a look at using Apache Wayang with Groovy to help us in the quest to find the perfect single-malt Scotch whiskey. The whiskies produced from 86 distilleries have been ranked by expert tasters according to 12 criteria (Body, Sweetness, Malty, Smoky, Fruity, etc.). We'll use a KMeans algorithm to calculate the centroids. This is similar to the KMeans example in the Wayang documentation but instead of 2 dimensions (x and y coordinates), we have 12 dimensions corresponding to our criteria. The main point is that it is illustrative of typical data science and machine learning algorithms involving iteration (the typical map, filter, reduce style of processing).


KMeans is a standard data-science clustering technique. In our case, it groups whiskies with similar characteristics (according to the 12 criteria) into clusters. If we have a favourite whiskey, chances are we can find something similar by looking at other instances in the same cluster. If we are feeling like a change, we can look for a whiskey in some other cluster. The centroid is the notional "point" in the middle of the cluster. For us it reflects the typical measure of each criteria for a whiskey in that cluster.

Implementation Details

We'll start with defining a Point record:

record Point(double[] pts) implements Serializable {
static Point fromLine(String line) { new Point(line.split(',')[2..-1]*.toDouble() as double[]) }

We've made it Serializable (more on that later) and included a fromLine factory method to help us make points from a CSV file. We'll do that ourselves rather than rely on other libraries which could assist. It's not a 2D or 3D point for us but 12D corresponding to the 12 criteria. We just use a double array, so any dimension would be supported but the 12 comes from the number of columns in our data file.

We'll define a related TaggedPointCounter record. It's like a Point but tracks a cluster Id and count used when clustering the "points":

record TaggedPointCounter(double[] pts, int cluster, long count) implements Serializable {
TaggedPointCounter plus(TaggedPointCounter that) {
new TaggedPointCounter((0..<pts.size()).collect{ pts[it] + that.pts[it] } as double[], cluster, count + that.count)

TaggedPointCounter average() {
new TaggedPointCounter(pts.collect{ double d -> d/count } as double[], cluster, 0)

We have plus and average methods which will be helpful in the map/reduce parts of the algorithm.

Another aspect of the KMeans algorithm is assigning points to the cluster associated with their nearest centroid. For 2 dimensions, recalling pythagoras' theorem, this would be the square root of x squared plus y squared, where x and y are the distance of a point from the centroid in the x and y dimensions respectively. We'll do the same across all dimensions and define the following helper class to capture this part of the algorithm:

class SelectNearestCentroid implements ExtendedSerializableFunction<Point, TaggedPointCounter> {
Iterable<TaggedPointCounter> centroids

void open(ExecutionContext context) {
centroids = context.getBroadcast("centroids")

TaggedPointCounter apply(Point p) {
def minDistance = Double.POSITIVE_INFINITY
def nearestCentroidId = -1
for (c in centroids) {
def distance = sqrt((0..<p.pts.size()).collect{ p.pts[it] - c.pts[it] }.sum{ it ** 2 } as double)
if (distance < minDistance) {
minDistance = distance
nearestCentroidId = c.cluster
new TaggedPointCounter(p.pts, nearestCentroidId, 1)

In Wayang parlance, the SelectNearestCentroid class is a UDF, a User-Defined Function. It represents some chunk of functionality where an optimization decision can be made about where to run the operation.

Once we get to using Spark, the classes in the map/reduce part of our algorithm will need to be serializable. Method closures in dynamic Groovy aren't serializable. We have a few options to avoid using them. I'll show one approach here which is to use some helper classes in places where we might typically use method references. Here are the helper classes:

class Cluster implements SerializableFunction<TaggedPointCounter, Integer> {
Integer apply(TaggedPointCounter tpc) { tpc.cluster() }

class Average implements SerializableFunction<TaggedPointCounter, TaggedPointCounter> {
TaggedPointCounter apply(TaggedPointCounter tpc) { tpc.average() }

class Plus implements SerializableBinaryOperator<TaggedPointCounter> {
TaggedPointCounter apply(TaggedPointCounter tpc1, TaggedPointCounter tpc2) { }

Now we are ready for our KMeans script:

int k = 5
int iterations = 20

// read in data from our file
def url = WhiskeyWayang.classLoader.getResource('whiskey.csv').file
def pointsData = new File(url).readLines()[1..-1].collect{ Point.fromLine(it) }
def dims = pointsData[0].pts().size()

// create some random points as initial centroids
def r = new Random()
def initPts = (1..k).collect { (0..<dims).collect { r.nextGaussian() + 2 } as double[] }

// create planbuilder with Java and Spark enabled
def configuration = new Configuration()
def context = new WayangContext(configuration)
def planBuilder = new JavaPlanBuilder(context, "KMeans ($url, k=$k, iterations=$iterations)")

def points = planBuilder
.loadCollection(pointsData).withName('Load points')

def initialCentroids = planBuilder
.loadCollection((0..<k).collect{ idx -> new TaggedPointCounter(initPts[idx], idx, 0) })
.withName("Load random centroids")

def finalCentroids = initialCentroids
.repeat(iterations, currentCentroids -> SelectNearestCentroid())
.withBroadcast(currentCentroids, "centroids").withName("Find nearest centroid")
.reduceByKey(new Cluster(), new Plus()).withName("Add up points")
.map(new Average()).withName("Average points")

println 'Centroids:'
finalCentroids.each { c ->
println "Cluster$c.cluster: ${c.pts.collect{ sprintf('%.3f', it) }.join(', ')}"

Here, k is the desired number of clusters, and iterations is the number of times to iterate through the KMeans loop. The pointsData variable is a list of Point instances loaded from our data file. We'd use the readTextFile method instead of loadCollection if our data set was large. The initPts variable is some random starting positions for our initial centroids. Being random, and given the way the KMeans algorithm works, it is possible that some of our clusters may have no points assigned.

Our algorithm works by assigning, at each iteration, all the points to their closest current centroid and then calculating the new centroids given those assignments. Finally, we output the results.

Running with the Java streams-backed platform

As we mentioned earlier, Wayang selects which platform(s) will run our application. It has numerous capabilities whereby cost functions and load estimators can be used to influence and optimize how the application is run. For our simple example, it is enough to know that even though we specified Java or Spark as options, Wayang knows that for our small data set, the Java streams option is the way to go.

Since we prime the algorithm with random data, we expect the results to be slightly different each time the script is run, but here is one output:

> Task :WhiskeyWayang:run
Cluster0: 2.548, 2.419, 1.613, 0.194, 0.097, 1.871, 1.742, 1.774, 1.677, 1.935, 1.806, 1.613
Cluster2: 1.464, 2.679, 1.179, 0.321, 0.071, 0.786, 1.429, 0.429, 0.964, 1.643, 1.929, 2.179
Cluster3: 3.250, 1.500, 3.250, 3.000, 0.500, 0.250, 1.625, 0.375, 1.375, 1.375, 1.250, 0.250
Cluster4: 1.684, 1.842, 1.211, 0.421, 0.053, 1.316, 0.632, 0.737, 1.895, 2.000, 1.842, 1.737 ...

Which if plotted looks like this:

WhiskeyWayang Centroid Spider Plot

If you are interested, check out the examples in the repo links at the end of this article to see the code for producing this centroid spider plot or the Jupyter/BeakerX notebook in this project's github repo.

Running with Apache Spark

spark.pngGiven our small dataset size and no other customization, Wayang will choose the Java streams based solution. We could use Wayang optimization features to influence which processing platform it chooses, but to keep things simple, we'll just disable the Java streams platform in our configuration by making the following change in our code:


Now when we run the application, the output will be something like this (a solution similar to before but with 1000+ extra lines of Spark and Wayang log information - truncated for presentation purposes):

[main] INFO org.apache.spark.SparkContext - Running Spark version 3.3.0
[main] INFO org.apache.spark.util.Utils - Successfully started service 'sparkDriver' on port 62081.
Cluster4: 1.414, 2.448, 0.966, 0.138, 0.034, 0.862, 1.000, 0.483, 1.345, 1.690, 2.103, 2.138
Cluster0: 2.773, 2.455, 1.455, 0.000, 0.000, 1.909, 1.682, 1.955, 2.091, 2.045, 2.136, 1.818
Cluster1: 1.762, 2.286, 1.571, 0.619, 0.143, 1.714, 1.333, 0.905, 1.190, 1.952, 1.095, 1.524
Cluster2: 3.250, 1.500, 3.250, 3.000, 0.500, 0.250, 1.625, 0.375, 1.375, 1.375, 1.250, 0.250
Cluster3: 2.167, 2.000, 2.167, 1.000, 0.333, 0.333, 2.000, 0.833, 0.833, 1.500, 2.333, 1.667
[shutdown-hook-0] INFO org.apache.spark.SparkContext - Successfully stopped SparkContext
[shutdown-hook-0] INFO org.apache.spark.util.ShutdownHookManager - Shutdown hook called


A goal of Apache Wayang is to allow developers to write platform-agnostic applications. While this is mostly true, the abstractions aren't perfect. As an example, if I know I am only using the streams-backed platform, I don't need to worry about making any of my classes serializable (which is a Spark requirement). In our example, we could have omitted the "implements Serializable" part of the TaggedPointCounter record, and we could have used a method reference TaggedPointCounter::average instead of our Average helper class. This isn't meant to be a criticism of Wayang, after all if you want to write cross-platform UDFs, you might expect to have to follow some rules. Instead, it is meant to just indicate that abstractions often have leaks around the edges. Sometimes those leaks can be beneficially used, other times they are traps waiting for unknowing developers.

To summarise, if using the Java streams-backed platform, you can run the application on JDK17 (which uses native records) as well as JDK11 and JDK8 (where Groovy provides emulated records). Also, we could make numerous simplifications if we desired. When using the Spark processing platform, the potential simplifications aren't applicable, and we can run on JDK8 and JDK11 (Spark isn't yet supported on JDK17).


We have looked at using Apache Wayang to implement a KMeans algorithm that runs either backed by the JDK streams capabilities or by Apache Spark. The Wayang API hid from us some of the complexities of writing code that works on a distributed platform and some of the intricacies of dealing with the Spark platform. The abstractions aren't perfect but they certainly aren't hard to use and provide extra protection should we wish to move between platforms. As an added bonus, they open up numerous optimization possibilities.

Apache Wayang is an incubating project at Apache and still has work to do before it graduates but lots of work has gone on previously (it was previously known as Rheem and was started in 2015). Platform agnostic applications is a holy grail that has been desired for many years but is hard to achieve. It should be exciting to see how far Apache Wayang progresses in achieving this goal.

More Information

  • Repo containing the source code: WhiskeyWayang
  • Repo containing similar examples using a variety of libraries including Apache Commons CSV, Weka, Smile, Tribuo and others: Whiskey
  • A similar example using Apache Spark directly but with a built-in parallelized KMeans from the spark-mllib library rather than a hand-crafted algorithm: WhiskeySpark
  • A similar example using Apache Ignite directly but with a built-in clustered KMeans from the ignite-ml library rather than a hand-crafted algorithm: WhiskeyIgnite

GPars meets Virtual Threads

by paulk

Posted on Wednesday June 15, 2022 at 11:28AM in Technology

gpars-rgb.pngAn exciting preview feature coming in JDK19 is Virtual Threads (JEP 425). In my experiments so far, virtual threads work well with my favourite Groovy parallel and concurrency library GPars. GPars has been around a while (since Java 5 and Groovy 1.8 days) but still has many useful features. Let's have a look at a few examples.

If you want to try these out, make sure you have a recent JDK19 (currently EA) and enable preview features with your Groovy tooling.

Parallel Collections

First a refresher, to use the GPars parallel collections feature with normal threads, use the GParsPool.withPool method as follows:

withPool {
assert [1, 2, 3].collectParallel{ it ** 2 } == [1, 4, 9] }

For any Java readers, don't get confused with the collectParallel method name. Groovy's collect method (naming inspired by Smalltalk) is the equivalent of Java's map method. So, the equivalent Groovy code using the Java streams API would be something like:

assert [1, 2, 3].parallelStream().map(n -> n ** 2).collect(Collectors.toList()) == [1, 4, 9]

Now, let's bring virtual threads into the picture. Luckily, GPars parallel collection facilities provide a hook for using an existing custom executor service. This makes using virtual threads for such code easy:

withExistingPool(Executors.newVirtualThreadPerTaskExecutor()) {
assert [1, 2, 3].collectParallel{ it ** 2 } == [1, 4, 9]

Nice! But let's move onto some areas examples which might be less familiar to Java developers.

GPars has additional features for providing custom thread pools and the remaining examples rely on those features. The current version of GPars doesn't have a DefaultPool constructor that takes a vanilla executor service, so, we'll write our own class:

class VirtualPool implements Pool {
private final ExecutorService pool = Executors.newVirtualThreadPerTaskExecutor()
int getPoolSize() { pool.poolSize }
void execute(Runnable task) { pool.execute(task) }
ExecutorService getExecutorService() { pool }

It is essentially a delegate from the GPars Pool interface to the virtual threads executor service.

We'll use this in the remaining examples.


Agents provide a thread-safe non-blocking wrapper around an otherwise potentially mutable shared state object. They are inspired by agents in Clojure.

In our case we'll use an agent to "protect" a plain ArrayList. For this simple case, we could have used some synchronized list, but in general, agents eliminate the need to find thread-safe implementation classes or indeed care at all about the thread safety of the underlying wrapped object.

def mutableState = []     // a non-synchronized mutable list
def agent = new Agent(mutableState)

agent.attachToThreadPool(new VirtualPool()) // omit line for normal threads

agent { it << 'Dave' } // one thread updates list
agent { it << 'Joe' } // another thread also updating
assert agent.val.size() == 2


Actors allow for a message passing-based concurrency model. The actor model ensures that at most one thread processes the actor's body at any time. The GPars API and DSLs for actors are quite rich supporting many features. We'll look at a simple example here.

GPars manages actor thread pools in groups. Let's create one backed by virtual threads:

def vgroup = new DefaultPGroup(new VirtualPool())

Now we can write an encrypting and decrypting actor pair as follows:

def decryptor = {
loop {
react { String message ->
reply message.reverse()

def console = {
decryptor << 'lellarap si yvoorG'
react {
println 'Decrypted message: ' + it

console.join() // output: Decrypted message: Groovy is parallel


Dataflow offers an inherently safe and robust declarative concurrency model. Dataflows are also managed via thread groups, so we'll use vgroup which we created earlier.

We have three logical tasks which can run in parallel and perform their work. The tasks need to exchange data and they do so using dataflow variables. Think of dataflow variables as one-shot channels safely and reliably transferring data from producers to their consumers.

def df = new Dataflows()

vgroup.task {
df.z = df.x + df.y

vgroup.task {
df.x = 10

vgroup.task {
df.y = 5

assert df.z == 15

The dataflow framework works out how to schedule the individual tasks and ensures that a task's input variables are ready when needed.


We have had a quick glimpse at using virtual threads with Groovy and GPars. It is very early days, so expect much more to emerge in this space once virtual threads are released in preview in production versions of JDK19 and eventually beyond a preview feature.

Groovy 4.0.3 Released

by paulk

Posted on Wednesday June 15, 2022 at 08:16AM in Technology

groovy.pngDear community,

The Apache Groovy team is pleased to announce version 4.0.3 of Apache Groovy.
Apache Groovy is a multi-faceted programming language for the JVM.
Further details can be found at the website.

This release is a maintenance release of the GROOVY_4_0_X branch.
It is strongly encouraged that all users using prior
versions on this branch upgrade to this version.

This release includes 40 bug fixes/improvements as outlined in the changelog:

Sources, convenience binaries, downloadable documentation and an SDK
bundle can be found at:
We recommend you verify your installation using the information on that page.

Jars are also available within the major binary repositories.

We welcome your help and feedback and in particular want
to thank everyone who contributed to this release.

For more information on how to report problems, and to get involved,
visit the project website at

Best regards,

The Apache Groovy team.