Groovy  Apache Groovy

Reading and Writing CSV files with Groovy

by paulk


Posted on Monday July 25, 2022 at 02:26PM in Technology


In this post, we'll look at reading and writing CSV files using Groovy.

Aren't CSV files just text files?

For simple cases, we can treat CSV files no differently than we would other text files. Suppose we have the following data that we would like to write to a CSV file:

def data = [
['place', 'firstname', 'lastname', 'team'],
['1', 'Lorena', 'Wiebes', 'Team DSM'],
['2', 'Marianne', 'Vos', 'Team Jumbo Visma'],
['3', 'Lotte', 'Kopecky', 'Team SD Worx']
]

Groovy uses File or Path objects similar to Java. We'll use a File object here and, for our purposes, we'll just use a temporary file since we are just going to read it back in and check it against our data. Here is how to create a temporary file:

def file = File.createTempFile('FemmesStage1Podium', '.csv')

Writing our CSV (in this simple example) is as simple as joining the data with commas and the lines with line separator character(s):

file.text = data*.join(',').join(System.lineSeparator())

Here we "wrote" the entire file contents in one go but there are options for writing a line or character or byte at a time.

Reading the data in is just as simple. We read the lines and split on commas:

assert file.readLines()*.split(',') == data

In general, we might want to further process the data. Groovy provides nice options for this too. Suppose we have the following existing CSV file:
HommesOverall.png
We can read in the file and select various columns of interest with code like below:

def file = new File('HommesStageWinners.csv')
def rows = file.readLines().tail()*.split(',')
int total = rows.size()
Set names = rows.collect { it[1] + ' ' + it[2] }
Set teams = rows*.getAt(3)
Set countries = rows*.getAt(4)
String result = "Across $total stages, ${names.size()} riders from " +
"${teams.size()} teams and ${countries.size()} countries won stages."
assert result == 'Across 21 stages, 15 riders from 10 teams and 9 countries won stages.'

Here, the tail() method skips over the header line. Column 0 contains the stage number which we ignore. Column 1 contains the first name, column 2 the last name, column 3 the team, and column 4 the country of the rider. We store away the full names, teams and countries in sets to remove duplicates. We then create an overall result message using the size of those sets.

While for this simple example, the coding was fairly simple, it isn't recommended to hand process CSV files in this fashion. The details for CSV can quickly get messy. What if the values themselves contain commas or newlines? Perhaps we can surround in double quotes but then what if the value contains a double quote? And so forth. For this reason, CSV libraries are recommended.

We'll look at three shortly, but first let's summarise some of the highlights of the tour by looking at multiple winners. Here is some code which summarises our CSV data:

def byValueDesc = { -it.value }
def bySize = { k, v -> [k, v.size()] }
def isMultiple = { it.value > 1 }
def multipleWins = { Closure select -> rows
.groupBy(select)
.collectEntries(bySize)
.findAll(isMultiple)
.sort(byValueDesc)
.entrySet()
.join(', ')
}
println 'Multiple wins by country:\n' + multipleWins{ it[4] }
println 'Multiple wins by rider:\n' + multipleWins{ it[1] + ' ' + it[2] }
println 'Multiple wins by team:\n' + multipleWins{ it[3] }

This summary has nothing in particular to do with CSV files but is summarised in honour of the great riding during the tour! Here's the output:

MultipleWins.png

Okay, now let's look at our three CSV libraries.

Commons CSV

The Apache Commons CSV library makes writing and parsing CSV files easier. Here is the code for writing our CSV which makes use of the CSVPrinter class:

file.withWriter { w ->
new CSVPrinter(w, CSVFormat.DEFAULT).printRecords(data)
}

And here is the code for reading it back in which uses the RFC4180 parser factory singleton:

file.withReader { r ->
assert RFC4180.parse(r).records*.toList() == data
}

There are other singleton factories for tab-separated values and other common formats and builders to let you set a whole variety of options like escape characters, quote options, whether to use an enum to define header names, and whether to ignore empty lines or nulls.

For our more elaborate example, we have a tiny bit more work to do. We'll use the builder to tell the parser to skip the header row. We could have chosen to use the tail() trick we used earlier but we decided to use the parser features instead. The code would look like this:

file.withReader { r ->
def rows = RFC4180.builder()
.setHeader()
.setSkipHeaderRecord(true)
.build()
.parse(r)
.records
assert rows.size() == 21
assert rows.collect { it.firstname + ' ' + it.lastname }.toSet().size() == 15
assert rows*.team.toSet().size() == 10
assert rows*.country.toSet().size() == 9
}

You can see here that we have used column names rather than column numbers during our processing. Using column names is another advantage of using the CSV library; it would be quite a lot of work to do that aspect by hand. Also note that, for simplicity, we didn't create the entire result message as in the earlier example. Instead, we just checked the size of all of the relevant sets that we calculated previously.

OpenCSV

The OpenCSV library handles the messy CSV details when needed but doesn't get in the way for simple cases. For our first example, the CSVReader and CSVWriter classes will be suitable. Here is the code for writing our CSV file in the same way as earlier:

file.withWriter { w ->
new CSVWriter(w).writeAll(data.collect{ it as String[] })
}

And here is the code for reading data:

file.withReader { r ->
assert new CSVReader(r).readAll() == data
}

If we look at the produced file, it is already a little fancier than earlier with double quotes around all data:

FemmesPodiumStage1.png

If we want to do more elaborate processing, the CSVReaderHeaderAware class is aware of the initial header row and its column names. Here is our more elaborate example which processed some of the data further:

file.withReader { r ->
def rows = []
def reader = new CSVReaderHeaderAware(r)
while ((next = reader.readMap())) rows << next
assert rows.size() == 21
assert rows.collect { it.firstname + ' ' + it.lastname }.toSet().size() == 15
assert rows*.team.toSet().size() == 10
assert rows*.country.toSet().size() == 9
}

You can see here that we have again used column names rather than column numbers during our processing. For simplicity, we followed the same style as in the Commons CSV example and just checked the size of all of the relevant sets that we calculated previously.

OpenCSV also supports transforming CSV files into JavaBean instances. First, we define our target class (or annotate an existing domain class):

class Cyclist {
@CsvBindByName(column = 'firstname')
String first
@CsvBindByName(column = 'lastname')
String last
@CsvBindByName
String team
@CsvBindByName
String country
}

For two of the columns, we've indicated that the column name in the CSV file doesn't match our class property. The annotation attribute caters for that scenario.

Then, we can use this code to convert our CSV file into a list of domain objects:

file.withReader { r ->
List<Cyclist> rows = new CsvToBeanBuilder(r).withType(Cyclist).build().parse()
assert rows.size() == 21
assert rows.collect { it.first + ' ' + it.last }.toSet().size() == 15
assert rows*.team.toSet().size() == 10
assert rows*.country.toSet().size() == 9
}

OpenCSV has many options we didn't show. When writing files you can specify the separator and quote characters, when reading CSV you can specify column positions, types, and validate data.

Jackson Databind CSV

The Jackson Databind library supports the CSV format (as well as many others).

Writing CSV files from existing data is simple as shown here for running example:

file.withWriter { w ->
new CsvMapper().writeValue(w, data)
}

This writes the data into our temporary file as we saw with previous examples. One minor difference is that by default, just the values containing spaces will be double quoted but like the other libraries, there are many configuration options to tweak such settings.

Reading the data can be achieved using the following code:

def mapper = new CsvMapper().readerForListOf(String).with(CsvParser.Feature.WRAP_AS_ARRAY)
file.withReader { r ->
assert mapper.readValues(r).readAll() == data
}

Our more elaborate example is done in a similar way:

def schema = CsvSchema.emptySchema().withHeader()
def mapper = new CsvMapper().readerForMapOf(String).with(schema)
file.withReader { r ->
def rows = mapper.readValues(r).readAll()
assert rows.size() == 21
assert rows.collect { it.firstname + ' ' + it.lastname }.toSet().size() == 15
assert rows*.team.toSet().size() == 10
assert rows*.country.toSet().size() == 9
}

Here, we tell the library to make use of our header row and store each row of data in a map.

Jackson Databind also supports writing to classes including JavaBeans as well as records. Let's create a record to hold our cyclist information:

@JsonCreator
record Cyclist(
@JsonProperty('stage') int stage,
@JsonProperty('firstname') String first,
@JsonProperty('lastname') String last,
@JsonProperty('team') String team,
@JsonProperty('country') String country) {
String full() { "$first $last" }
}

Note that again we can indicate where our record component names may not match the names used in the CSV file, we simply supply the alternate name when specifying the property. There are other options like indicating that a field is required or giving its column position but we don't need those options for our example. We've also added a full() helper method to return the full name of the cyclist.

Groovy will use native records on platforms that support it (JDK16+) or emulated records on earlier platforms.

Now we can write our code for record deserialization:

def schema = CsvSchema.emptySchema().withHeader()
def mapper = new CsvMapper().readerFor(Cyclist).with(schema)
file.withReader { r ->
List<Cyclist> records = mapper.readValues(r).readAll()
assert records.size() == 21
assert records*.full().toSet().size() == 15
assert records*.team.toSet().size() == 10
assert records*.country.toSet().size() == 9
}

Conclusion

We have looked at writing and reading CSV files to Strings and domain classes and records. We had a look at handling simple cases by hand and also looked at the OpenCSV, Commons CSV and Jackson Databind CSV libraries.

Code for these examples:
https://github.com/paulk-asert/CsvGroovy

Code for other examples of using Groovy for Data Science:
https://github.com/paulk-asert/groovy-data-science




Groovy release train: 4.0.4, 3.0.12, 2.5.18

by paulk


Posted on Sunday July 24, 2022 at 12:55PM in Technology


groovy.pngIt's been a productive time for the Apache Groovy project recently. We recently released versions 4.0.4, 3.0.12 and 2.5.18 with 42, 21 and 15 fixes and improvements respectively. Two quick highlights for the 4.0.4 release before getting into more details about the release.

Eric Milles has been interacting for many months with the team from the hephaestus project in particular Stefanos Chaliasos and Thodoris Sotiropoulos. You can think of hephaestus as a fuzzying tool for type checkers and they have been putting Groovy's static compiler through its paces finding plenty of edge cases for us to assess. We still have some work to do but we have made significant improvements and would welcome any feedback. If you're interested, consider diving further into the research behind hephaestus.

We've also had some great contributions from Sandip Chitale for Groovy's Object Browser. You can access this from a number of ways including the :inspect command in groovysh or in the GroovyConsole via the Script->Inspect Last or Script->Inspect Variables menu items. It's also hooked into the AST Browser if you're exploring code produced by the Groovy compiler.

2022-07-24 22_23_16-Support launching of ObjectExplore when property rows are double clic… by sandip.png

Please find more details about the 4.0.4 release below.


Dear community,


The Apache Groovy team is pleased to announce version 4.0.4 of Apache Groovy.
Apache Groovy is a multi-faceted programming language for the JVM.
Further details can be found at the https://groovy.apache.org website.

This release is a maintenance release of the GROOVY_4_0_X branch.
It is strongly encouraged that all users using prior
versions on this branch upgrade to this version.

This release includes 42 bug fixes/improvements as outlined in the changelog:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318123&version=12351811

Sources, convenience binaries, downloadable documentation and an SDK
bundle can be found at: https://groovy.apache.org/download.html
We recommend you verify your installation using the information on that page.

Jars are also available within the major binary repositories.

We welcome your help and feedback and in particular want
to thank everyone who contributed to this release.

For more information on how to report problems, and to get involved,
visit the project website at https://groovy.apache.org/

Best regards,

The Apache Groovy team.




Comparators and Sorting in Groovy

by paulk


Posted on Thursday July 21, 2022 at 03:51PM in Technology


2022-07-22 01_05_29-s-l300.webp (300×291).pngThis blog post is inspired by the Comparator examples in the excellent Collections Refuelled talk and blog by Stuart Marks. That blog from 2017 highlights improvements in the Java collections library in Java 8 and 9 including numerous Comparator improvements. It is now 5 years old but still highly recommended for anyone using the Java collections library.

Rather than have a Student class as per the original blog example, we'll have a Celebrity class (and later record) which has the same first and last name fields and an additional age field. We'll compare initially by last name with nulls before non-nulls and then by first name and lastly by age.

As with the original blog, we'll cater for nulls, e.g. a celebrity known by a single name.

The Java comparator story recap

JavaTransparent.pngOur Celebrity class if we wrote it in Java would look something like:

public class Celebrity {                    // Java
private String firstName;
private String lastName;
private int age;

public Celebrity(String firstName, int age) {
this(firstName, null, age);
}

public Celebrity(String firstName, String lastName, int age) {
this.firstName = firstName;
this.lastName = lastName;
this.age = age;
}

public int getAge() {
return age;
}

public void setAge(int age) {
this.age = age;
}

public String getFirstName() {
return firstName;
}

public void setFirstName(String firstName) {
this.firstName = firstName;
}

public String getLastName() {
return lastName;
}

public void setLastName(String lastName) {
this.lastName = lastName;
}

@Override
public String toString() {
return "Celebrity{" +
"firstName='" + firstName +
(lastName == null ? "" : "', lastName='" + lastName) +
"', age=" + age +
'}';
}
}

It would look much nicer as a Java record (JDK16+) but we'll keep with the spirit of the original blog example for now. This is fairly standard boilerplate and in fact was mostly generated by IntelliJ IDEA. The only slightly interesting aspect is that we tweaked the toString method to not display null last names.

On JDK 8 with the old-style comparator coding, a main application which created and sorted some celebrities might look like this:

import java.util.ArrayList;            // Java
import java.util.Collections;
import java.util.List;

public class Main {
public static void main(String[] args) {
List<Celebrity> celebrities = new ArrayList<>();
celebrities.add(new Celebrity("Cher", "Wang", 63));
celebrities.add(new Celebrity("Cher", "Lloyd", 28));
celebrities.add(new Celebrity("Alex", "Lloyd", 47));
celebrities.add(new Celebrity("Alex", "Lloyd", 37));
celebrities.add(new Celebrity("Cher", 76));
Collections.sort(celebrities, (c1, c2) -> {
String f1 = c1.getLastName();
String f2 = c2.getLastName();
int r1;
if (f1 == null) {
r1 = f2 == null ? 0 : -1;
} else {
r1 = f2 == null ? 1 : f1.compareTo(f2);
}
if (r1 != 0) {
return r1;
}
int r2 = c1.getFirstName().compareTo(c2.getFirstName());
if (r2 != 0) {
return r2;
}
return Integer.compare(c1.getAge(), c2.getAge());
});
System.out.println("Celebrities:");
celebrities.forEach(System.out::println);
}
}

When we run this example, the output looks like this:

Celebrities:
Celebrity{firstName='Cher', age=76}
Celebrity{firstName='Alex', lastName='Lloyd', age=37}
Celebrity{firstName='Alex', lastName='Lloyd', age=47}
Celebrity{firstName='Cher', lastName='Lloyd', age=28}
Celebrity{firstName='Cher', lastName='Wang', age=63}

As pointed out in the original blog, this code is rather tedious and error-prone and can be improved greatly with comparator improvements in JDK8:

import java.util.Arrays;             // Java
import java.util.List;

import static java.util.Comparator.comparing;
import static java.util.Comparator.naturalOrder;
import static java.util.Comparator.nullsFirst;

public class Main {
public static void main(String[] args) {
List<Celebrity> celebrities = Arrays.asList(
new Celebrity("Cher", "Wang", 63),
new Celebrity("Cher", "Lloyd", 28),
new Celebrity("Alex", "Lloyd", 47),
new Celebrity("Alex", "Lloyd", 37),
new Celebrity("Cher", 76));
celebrities.sort(comparing(Celebrity::getLastName, nullsFirst(naturalOrder())).
thenComparing(Celebrity::getFirstName).thenComparing(Celebrity::getAge));
System.out.println("Celebrities:");
celebrities.forEach(System.out::println);
}
}

The original blog also points out the convenience factory methods from JDK9 for list creation which you might be tempted to consider here. For our case, we will be sorting in place, so the immutable lists returned by those methods won't help us here but Arrays.asList isn't much longer than List.of and works well for this example.

As well as being much shorter, the comparing and thenComparing methods and built-in comparators like nullsFirst and naturalOrdering allow for far simpler composability. The sort within array list is also more efficient than the sort that would have been used with the Collections.sort method on earlier JDKs. The output when running the example is the same as previously.

The Groovy comparator story

groovy.pngAt about the same time that Java was evolving its comparator story Groovy added some complementary features to tackle many of the same problems. We'll look at some of those features and also how the JDK improvements we saw above features can be used instead if preferred.

First off, let's create a Groovy Celebrity record:

@Sortable(includes = 'last,first,age')
@ToString(ignoreNulls = true, includeNames = true)
record Celebrity(String first, String last = null, int age) {}

And create our list of celebrities:

var celebrities = [
new Celebrity("Cher", "Wang", 63),
new Celebrity("Cher", "Lloyd", 28),
new Celebrity("Alex", "Lloyd", 47),
new Celebrity("Alex", "Lloyd", 37),
new Celebrity(first: "Cher", age: 76)
]

The record definition is nice and concise. It would look good in recent Java versions too. A nice aspect of the Groovy solution is that it will provide emulated records on earlier JDKs and it also has some nice declarative transforms to tweak the record definition. We could leave off the @ToString annotation and we'd get a standard record-style toString. Or we could add a toString method to our record definition similar to what was done in the Java example. Using @ToString allows us to remove null last names from the toString in a declarative way. We'll cover the @Sortable annotation a little later.

First off, Groovy's spaceship operator <=> allows us to write a nice compact version of the "tedious" code in the first Java version. It looks like this:

celebrities.sort { c1, c2 ->
c1.last <=> c2.last ?: c1.first <=> c2.first ?: c1.age <=> c2.age
}
println 'Celebrities:\n' + celebrities.join('\n')

And the output looks like this:

Celebrities:
Celebrity(first:Cher, age:76)
Celebrity(first:Alex, last:Lloyd, age:37)
Celebrity(first:Alex, last:Lloyd, age:47)
Celebrity(first:Cher, last:Lloyd, age:28)
Celebrity(first:Cher, last:Wang, age:63)

We'd have a tiny bit more work to do if we wanted nulls last but the defaults work well for the example at hand.

We can alternatively, make use of the "new in JDK8" methods mentioned earlier:

celebrities.sort(comparing(Celebrity::last, nullsFirst(naturalOrder())).
thenComparing(c -> c.first).thenComparing(c -> c.age))

But this is where we should come back and further explain the @Sortable annotation. That annotation is associated with an Abstract Syntax Tree (AST) transformation, or just transform for short, which provides us with an automatic compareTo method that takes into account the record's properties (and likewise if it was a class). Since we provided an includes annotation attribute and provided a list of property names, the order of those names determines the priority of the properties used in the comparator. We could equally include just some of the names in that list or alternatively provide an excludes annotation attribute and just mention that properties we don't want included.

It also adds Comparable<Celebrity> to the list of implemented interfaces for our record. So, what does all this mean? It means we can just write:

celebrities.sort()

The transform associated with the @Sortable annotation also provides some additional comparators for us. To sort by age, we can use one of those comparators:

celebrities.sort(Celebrity.comparatorByAge())

Which gives this output:

Celebrities:
Celebrity(first:Cher, last:Lloyd, age:28)
Celebrity(first:Alex, last:Lloyd, age:37)
Celebrity(first:Alex, last:Lloyd, age:47)
Celebrity(first:Cher, last:Wang, age:63)
Celebrity(first:Cher, age:76)

In addition to the sort method, Groovy provides a toSorted method which sorts a copy of the list, leaving the original unchanged. So, to create a list sorted by first name we can use this code:

var celebritiesByFirst = celebrities.toSorted(Celebrity.comparatorByFirst())

Which if output in a similar way to previous examples gives:

Celebrities:
Celebrity(first:Alex, last:Lloyd, age:37)
Celebrity(first:Alex, last:Lloyd, age:47)
Celebrity(first:Cher, last:Lloyd, age:28)
Celebrity(first:Cher, last:Wang, age:63)
Celebrity(first:Cher, age:76)

If you are a fan of functional style programming, you might consider using List.of to define the original list and then only toSorted method calls in further processing.

Mixing in some language integrated queries

Groovy also has a GQuery (aka GINQ) capability which provides a SQL inspired DSL for working with collections. We can use GQueries to examine and order our collection. Here is an example:

println GQ {
from c in celebrities
select c.first, c.last, c.age
}

Which has this output:

+-------+-------+-----+
| first | last  | age |
+-------+-------+-----+
| Cher  |       | 76  |
| Alex  | Lloyd | 37  |
| Alex  | Lloyd | 47  |
| Cher  | Lloyd | 28  |
| Cher  | Wang  | 63  |
+-------+-------+-----+

In this case, it's using the natural ordering which @Sortable gives us.

Or we can sort by age:

println GQ {
from c in celebrities
orderby c.age
select c.first, c.last, c.age
}

Which has this output:

+-------+-------+-----+
| first | last  | age |
+-------+-------+-----+
| Cher  | Lloyd | 28  |
| Alex  | Lloyd | 37  |
| Alex  | Lloyd | 47  |
| Cher  | Wang  | 63  |
| Cher  |       | 76  |
+-------+-------+-----+

Or we can sort by last name descending and then age:

println GQ {
from c in celebrities
orderby c.last in desc, c.age
select c.first, c.last, c.age
}

Which has this output:

+-------+-------+-----+
| first | last  | age |
+-------+-------+-----+
| Cher  | Wang  | 63  |
| Cher  | Lloyd | 28  |
| Alex  | Lloyd | 37  |
| Alex  | Lloyd | 47  |
| Cher  |       | 76  |
+-------+-------+-----+

Conclusion

We have seen a little example of using comparators in Groovy. All the great JDK capabilities are available as well as the spaceship operator, the sort and toSorted methods, and the @Sortable AST transformation.


Testing your Java with Groovy, Spock, JUnit5, Jacoco, Jqwik and Pitest

by paulk


Posted on Friday July 15, 2022 at 08:26AM in Technology


spock-main-logo.pngThis blog post covers a common scenario seen in the Groovy community which is projects which use Java for their production code and Groovy for their tests. This can be a low risk way for Java shops to try out and become more familiar with Groovy. We'll write our initial tests using the Spock testing framework and we'll use JUnit5 later with our jqwik tests. You can usually use your favorite Java testing libraries if you switch to Groovy.

The system under test

For illustrative purposes, we will test a Java mathematics utility function sumBiggestPair. Given three numbers, it finds the two biggest and then adds them up. An initial stab at the code for this might look something like this:

public class MathUtil {

public static int sumBiggestPair(int a, int b, int c) {
int op1 = a;
int op2 = b;
if (c > a) {
op1 = c;
} else if (c > b) {
op2 = c;
}
return op1 + op2;
}


private MathUtil(){}
}

Testing with Spock

An initial test could look like this:

class MathUtilSpec extends Specification {
def "sum of two biggest numbers"() {
expect:
MathUtil.sumBiggestPair(2, 5, 3) == 8
}
}

When we run this test, all tests pass:

test result success image
But if we look at the coverage report, generated with Jacoco, we see that our test hasn't covered all lines of code:

incomplete coverage image

We'll swap to use Spock's data-driven feature and include an additional testcase:

    def "sum of two biggest numbers"(int a, int b, int c, int d) {
expect:
MathUtil.sumBiggestPair(a, b, c) == d

where:
a | b | c | d
2 | 5 | 3 | 8
5 | 2 | 3 | 8
}

We can check our coverage again:

2022-07-14 22_35_22-MathUtil.java.png

That is a little better. We now have 100% line coverage but not 100% branch coverage. Let's add one more testcase:

    def "sum of two biggest numbers"(int a, int b, int c, int d) {
expect:
MathUtil.sumBiggestPair(a, b, c) == d

where:
a | b | c | d
2 | 5 | 3 | 8
5 | 2 | 3 | 8
5 | 4 | 1 | 9
}

And now we can see that we have reached 100% line coverage and 100% branch coverage:

2022-07-14 22_33_45-MathUtil.java.png

At this point, we might be very confident in our code and ready to ship it to production. Before we do, we'll add one more testcase:

def "sum of two biggest numbers"(int a, int b, int c, int d) {
expect:
MathUtil.sumBiggestPair(a, b, c) == d

where:
a | b | c | d
2 | 5 | 3 | 8
5 | 2 | 3 | 8
5 | 4 | 1 | 9
3 | 2 | 6 | 9
}

When we re-run our tests, we discover that the last testcase failed!:

2022-07-14 22_49_23-Test results - MathUtilSpec.png

And examining the testcase, we can indeed see that there is a flaw in our algorithm. Basically, having the else logic doesn't cater for when c is greater than both a and b!

2022-07-14 22_49_58-Test results - MathUtilSpec.png

We succumbed to faulty expectations of what 100% coverage would give us.

A 100% code coverage example

The good news is that we can fix this. Here is an updated algorithm:

public static int sumBiggestPair(int a, int b, int c) {
int op1 = a;
int op2 = b;
if (c > Math.min(a, b)) {
op1 = c;
op2 = Math.max(a, b);
}
return op1 + op2;
}

With this new algorithm, all 4 testcases now pass and we again have 100% line and branch coverage.

> Task :SumBiggestPairPitest:test
 Test sum of two biggest numbers [Tests: 4/4/0/0] [Time: 0.317 s]
 Test util.MathUtilSpec [Tests: 4/4/0/0] [Time: 0.320 s]
 Test Gradle Test Run :SumBiggestPairPitest:test [Tests: 4/4/0/0]

But haven't we been here before? How can we be sure there isn't some additional test cases that might reveal another flaw in our algorithm? We could keep writing lots more testcases but we'll look at two other techniques that can help.

Mutation testing with Pitest

An interesting but not widely used technique is mutation testing. It probably deserves to be more widely used. It can test the quality of a testsuite but has the drawback of sometimes being quite resource intensive. It modifies (mutates) production code and re-runs your testsuite. If your test suite still passes with modified code, it possibly indicates that your testsuite is lacking sufficient coverage. Earlier, we had an algorithm with a flaw and our testsuite didn't initially pick it up. You can think of mutation testing as adding a deliberate flaw and seeing whether your testsuite is good enough to detect that flaw.

If your a fan of test-driven development (TDD), it espouses a rule that not a single line of production code should be added unless a failing test forces that line to be added. A corollary is that if you change a single line of production code in any meaningful way, that some test should fail.

So, let's have a look at what mutation testing says about our initial flawed algorithm. We'll use Pitest (also known as PIT). We'll go back to our initial algorithm and the point where we erroneously thought we had 100% coverage. When we run Pitest, we get the following result:

2022-07-14 23_57_17-index.html.png

And looking at the code we see:

2022-07-14 23_53_51-MathUtil.java.html.png

With output including some statistics:

================================================================================
- Statistics
================================================================================
>> Line Coverage: 7/8 (88%)
>> Generated 6 mutations Killed 4 (67%)
>> Mutations with no coverage 0. Test strength 67%
>> Ran 26 tests (4.33 tests per mutation)

What is this telling us? Pitest mutated our code in ways that you might expect to break it but our testsuite passed (survived) in a couple of instances. That means one of two things. Either, there are multiple valid implementations of our algorithm and Pitest found one of those equivalent solutions, or our testsuite is lacking some key testcases. In our case, we know that the testsuite was insufficient.

Let's run it again but this time with all of our tests and the corrected algorithm.

2022-07-15 00_11_21-MathUtil.java.html.png

The output when running the test has also changed slightly:

================================================================================
- Statistics
================================================================================
>> Line Coverage: 6/7 (86%)
>> Generated 4 mutations Killed 3 (75%)
>> Mutations with no coverage 0. Test strength 75%
>> Ran 25 tests (6.25 tests per mutation)

Our warnings from Pitest have reduced but not gone completely away and our test strength has gone up but is still not 100%. It does mean that we are in better shape than before. But should we be concerned?

It turns out in this case, we don't need to worry (too much). As an example, an equally valid algorithm for our function under test would be to replace the conditional with "c >= Math.min(a, b)". Note the greater-than-equals operator rather than just greater-than. For this algorithm, a different path would be taken for the case when c equals a or b, but the end result would be the same. So, that would be an inconsequential or equivalent mutation. In such a case, there may be no additional testcase that we can write to keep Pitest happy. We have to be aware of this possible outcome when using this technique.

Finally, let's look at our build file that ran Spock, Jacoco and Pitest:

plugins {
id 'info.solidsoft.pitest' version '1.7.4'
}
apply plugin: 'groovy'

repositories {
mavenCentral()
}

dependencies {
implementation "org.apache.groovy:groovy-test-junit5:4.0.3"
testImplementation("org.spockframework:spock-core:2.2-M3-groovy-4.0") {
transitive = false
}
}

pitest { junit5PluginVersion = '1.0.0' pitestVersion = '1.9.2' timestampedReports = false
targetClasses = ['util.*']
}

tasks.named('test') {
useJUnitPlatform()
}

The astute reader might note some subtle hints which show that the latest Spock versions run on top of the JUnit 5 platform.

Using Property-based Testing

Property-based testing is another technology which probably deserves much more attention. Here we'll use jqwik which runs on top of JUnit5 but you might also like to consider Genesis which provides random generators and especially targets Spock.

Earlier, we looked at writing more tests to make our coverage stronger. Property-based testing can often lead to writing less tests. Instead, we generate many random tests automatically and see whether certain properties hold.

Previously, we fed in the inputs and the expected output. For property-based testing, the inputs are typically randomly-generated values, we don't know the output. So, instead of testing directly against some known output, we'll just check various properties of the answer.

As an example, here is a test we could use:

@Property
void "result should be bigger than any individual and smaller than sum of all"(
@ForAll @IntRange(min = 0, max = 1000) Integer a,
@ForAll @IntRange(min = 0, max = 1000) Integer b,
@ForAll @IntRange(min = 0, max = 1000) Integer c) {
def result = sumBiggestPair(a, b, c)
assert [a, b, c].every { individual -> result >= individual }
assert result <= a + b + c
}

The @ForAll annotations indicate places where jqwik will insert random values. The @IntRange annotation indicates that we want the random values to be contained between 0 and 1000.

Here we are checking that (at least for small positive numbers) adding the two biggest numbers should be greater than or equal to any individual number and should be less than or equal to adding all three of the numbers. These are necessary but insufficient properties to ensure our system works.

When we run this we see the following output in the logs:

                              |--------------------jqwik-------------------- tries = 1000                  | # of calls to property checks = 1000                 | # of not rejected calls generation = RANDOMIZED       | parameters are randomly generated after-failure = PREVIOUS_SEED | use the previous seed when-fixed-seed = ALLOW       | fixing the random seed is allowed edge-cases#mode = MIXIN       | edge cases are mixed in edge-cases#total = 125        | # of all combined edge cases edge-cases#tried = 117        | # of edge cases tried in current run seed = -311315135281003183    | random seed to reproduce generated values

So, we wrote 1 test and 1000 testcases were executed. The number of tests run is configurable. We won't go into the details here. This looks great at first glance. It turns out however, that this particular property is not very discriminating in terms of the bugs it can find. This test passes for both our original flawed algorithm as well as the fixed one. Let's try a different property:

@Property
void "sum of any pair should not be greater than result"(
@ForAll @IntRange(min = 0, max = 1000) Integer a,
@ForAll @IntRange(min = 0, max = 1000) Integer b,
@ForAll @IntRange(min = 0, max = 1000) Integer c) {
def result = sumBiggestPair(a, b, c)
assert [a + b, b + c, c + a].every { sumOfPair -> result >= sumOfPair }
}

If we calculate the biggest pair, then surely it must be greater than or equal to any arbitrary pair. Trying this on our flawed algorithm gives: 

org.codehaus.groovy.runtime.powerassert.PowerAssertionError:
    assert [a + b, b + c, c + a].every { sumOfPair -> result >= sumOfPair }
            | | |  | | |  | | |  |
            1 1 0  0 2 2  2 3 1  false

                              |--------------------jqwik-------------------- tries = 12                    | # of calls to property checks = 12                   | # of not rejected calls generation = RANDOMIZED       | parameters are randomly generated after-failure = PREVIOUS_SEED | use the previous seed when-fixed-seed = ALLOW       | fixing the random seed is allowed edge-cases#mode = MIXIN       | edge cases are mixed in edge-cases#total = 125        | # of all combined edge cases edge-cases#tried = 2          | # of edge cases tried in current run seed = 4830696361996686755    | random seed to reproduce generated values

Shrunk Sample (6 steps) -----------------------   arg0: 1   arg1: 0   arg2: 2

Original Sample ---------------   arg0: 247   arg1: 32   arg2: 267

  Original Error   --------------   org.codehaus.groovy.runtime.powerassert.PowerAssertionError:     assert [a + b, b + c, c + a].every { sumOfPair -> result >= sumOfPair }             | | |  | | |  | | |  |             | | 32 32| 267| | |  false             | 279    299  | | 247             247           | 514                           267

Not only did it find a case which highlighted the flaw, but it shrunk it down to a very simple example. On our fixed algorithm, the 1000 tests pass!

The previous property can be refactored a little to not only calculate all three pairs but then find the maximum of those. This simplifies the condition somewhat:

@Property
void "result should be the same as alternative oracle implementation"(
@ForAll @IntRange(min = 0, max = 1000) Integer a,
@ForAll @IntRange(min = 0, max = 1000) Integer b,
@ForAll @IntRange(min = 0, max = 1000) Integer c) {
assert sumBiggestPair(a, b, c) == [a+b, a+c, b+c].max()
}

This approach, where an alternative implementation is used, is known as a test oracle. The alternative implementation might be less efficient, so not ideal for production code, but fine for testing. When revamping or replacing some software, the oracle might be the existing system. When run on our fixed algorithm, we again have 1000 testcases passing.

Let's go one step further and remove our @IntRange boundaries on the Integers:

@Property
void "result should be the same as alternative oracle implementation"(@ForAll Integer a, @ForAll Integer b, @ForAll Integer c) {
assert sumBiggestPair(a, b, c) == [a+b, a+c, b+c].max()
}

When we run the test now, we might be surprised:

  org.codehaus.groovy.runtime.powerassert.PowerAssertionError:
    assert sumBiggestPair(a, b, c) == [a+b, a+c, b+c].max()
           |              |  |  |  |   |||  |||  |||  |
           -2147483648    0  1  |  |   0|1  0||  1||  2147483647
                                |  |    1    ||   |2147483647
                                |  false     ||   -2147483648
                                2147483647   |2147483647
                                             2147483647
Shrunk Sample (13 steps)
------------------------
  arg0: 0
  arg1: 1
  arg2: 2147483647

It fails! Is this another bug in our algorithm? Possibly? But it could equally be a bug in our property test. Further investigation is warranted.

It turns out that our algorithm suffers from Integer overflow when trying to add 1 to Integer.MAX_VALUE. Our test partially suffers from the same problem but when we call max(), the negative value will be discarded. There is no always correct answer as to what should happen in this scenario. We go back to the customer and check the real requirement. In this case, let's assume the customer was happy for the overflow to occur - since that is what would happen if performing the operation long-hand in Java. With that knowledge we should fix our test to at least pass correctly when overflow occurs.

We have a number of options to fix this. We already saw previously we can use @IntRange. This is one way to "avoid" the problem and we have a few similar approaches which do the same. We could use a more confined data type, e.g. Short:

@Property
void checkShort(@ForAll Short a, @ForAll Short b, @ForAll Short c) {
assert sumBiggestPair(a, b, c) == [a+b, a+c, b+c].max()
}

Or we could use a customised provider method:

@Property
void checkIntegerConstrainedProvider(@ForAll('halfMax') Integer a,
@ForAll('halfMax') Integer b,
@ForAll('halfMax') Integer c) {
assert sumBiggestPair(a, b, c) == [a+b, a+c, b+c].max()
}

@Provide
Arbitrary<Integer> halfMax() {
int halfMax = Integer.MAX_VALUE >> 1
return Arbitraries.integers().between(-halfMax, halfMax)
}

But rather than avoiding the problem, we could change our test so that it allowed for the possibility of overflow within sumBiggestPair but didn't compound the problem with its own overflow. E.g. we could use Long's to do our calculations within our test:

@Property
void checkIntegerWithLongCalculations(@ForAll Integer a, @ForAll Integer b, @ForAll Integer c) {
def (al, bl, cl) = [a, b, c]*.toLong()
assert sumBiggestPair(a, b, c) == [al+bl, al+cl, bl+cl].max().toInteger()
}

Finally, let's again look at our Gradle build file:

apply plugin: 'groovy'

repositories {
mavenCentral()
}

dependencies {
testImplementation project(':SumBiggestPair')
testImplementation "org.apache.groovy:groovy-test-junit5:4.0.3"
testImplementation "net.jqwik:jqwik:1.6.5"
}

test {
useJUnitPlatform {
includeEngines 'jqwik'
}
}

More information

The examples in this blog post are excerpts from the following repo:

https://github.com/paulk-asert/property-based-testing

Versions used: Gradle 7.5, Groovy 4.0.3, jqwik 1.6.5, pitest 1.9.2, Spock 2.2-M3-groovy-4.0, Jacoco 0.8.8. Tested with JDK 8, 11, 17, 18.

There are many sites with valuable information about the technologies covered here. There are also some great books. Books on Spock include: Spock: Up and Running, Java Testing with Spock, and Spocklight Notebook. Books on Groovy include: Groovy in Action and Learning Groovy 3. If you want general information about using Java and Groovy together, consider Making Java Groovy. And there's a section on mutation testing in Practical Unit Testing With Testng And Mockito. The most recent book for property testing is for the Erlang and Elixir languages.

Conclusion

We have looked at testing Java code using Groovy and Spock with some additional tools like Jacoco, jqwik and Pitest. Generally using Groovy to test Java is a straight-forward experience. Groovy also lends itself to writing testing DSLs which allow non-hard-core programmers to write very simple looking tests; but that's a topic for another blog!


Parsing JSON with Groovy

by paulk


Posted on Sunday July 10, 2022 at 02:00PM in Technology


json logoGroovy has excellent support for processing a range of structured data formats like JSON, TOML, YAML, etc. This blog post looks at JSON.

There is quite good documentation on this topic as part of the Groovy documentation. There are also numerous online sources for more details including Groovy - JSON tutorial, Working with JSON in Groovy, and Groovy Goodness: Relax... Groovy Will Parse Your Wicked JSON to name just a few. This post does a quick summary and provides more setup information and details about various options.

Batteries included experience

If you have installed the Groovy installation zip (or .msi on windows), you will have the `groovy-json` module which includes JsonSlurper, so the bulk of the examples shown here and in the other mentioned links should work out of the box.

JsonSlurper is the main class for parsing JSON.

JsonSlurper in GroovyConsole

This example shows parsing JSON embedded in a string but there are other methods for parsing files, URLs and other streams.

Another example using groovysh:

paulk@pop-os:~$ groovysh
Groovy Shell (4.0.3, JVM: 18.0.1)
Type ':help' or ':h' for help.
--------------------------------------------------------------------------------------------------
groovy:000> new groovy.json.JsonSlurper().parseText('{ "myList": [1, 3, 5] }').myList
===> [1, 3, 5]

Or using a Jupyter/BeakerX notebook:

JsonSlurper in Jupyter notebook

Similarly, if you point your IDE to your Groovy distribution, you should be able to run the examples directly.

Gradle

If you are using a build tool like Gradle, you may prefer to reference your dependencies from a dependency repository rather than having a locally installed distribution.

Suppose you have the following test using JsonSlurper in the file src/test/groovy/JsonTest.groovy:

import groovy.json.JsonSlurper
import org.junit.Test

class JsonTest {
@Test
void testJson() {
def text = '{"person":{"name":"Guillaume","age":33,"pets":["dog","cat"]}}'
def json = new JsonSlurper().parseText(text)
assert json.person.pets.size() == 2
}
}

You can reference the relevant Groovy dependencies, in our case groovy-json and groovy-test, in a build.gradle file like below:

apply plugin: 'groovy'

repositories {
mavenCentral()
}

dependencies {
testImplementation "org.apache.groovy:groovy-json:4.0.3" // for JsonSlurper
testImplementation "org.apache.groovy:groovy-test:4.0.3" // for tests
}

Both these artifacts bring in the core groovy artifact transitively, so there's no need to reference that explicitly.

Running gradle test should run the tests and produce a report:

Test results for Class JsonTest.png

You can if you prefer, use the groovy-all artifact like this:

apply plugin: 'groovy'

repositories {
mavenCentral()
}

dependencies {
testImplementation "org.apache.groovy:groovy-all:4.0.3"
}

This artifact contains no jars but has all of the common Groovy modules as transitive dependencies.

[Note: In early Groovy 4 versions you may have needed to reference Groovy as a platform, e.g.: testImplementation platform("org.apache.groovy:groovy-all:4.0.1"). This is now only required when using the groovy-bom artifact.]

Maven

When using the Maven build tool, you would instead create a pom.xml like this and make use of two plugins: gmavenplus-plugin and maven-surefire-plugin:

<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>myGroupId</groupId>
<artifactId>groovy-json-maven</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.gmavenplus</groupId>
<artifactId>gmavenplus-plugin</artifactId>
<version>1.13.0</version>
<executions>
<execution>
<goals>
<goal>compileTests</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>3.0.0-M7</version>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.groovy</groupId>
<artifactId>groovy-json</artifactId>
<version>4.0.3</version>
</dependency>
<dependency>
<groupId>org.apache.groovy</groupId>
<artifactId>groovy-test</artifactId>
<version>4.0.3</version>
</dependency>
</dependencies>
</project>

Alternatively, you could once again reference the groovy-all artifact as per this alternate build file:

<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>myGroupId</groupId>
<artifactId>groovy-json-maven</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.gmavenplus</groupId>
<artifactId>gmavenplus-plugin</artifactId>
<version>1.13.0</version>
<executions>
<execution>
<goals>
<goal>compileTests</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>3.0.0-M7</version>
<dependencies>
<dependency>
<groupId>org.apache.maven.surefire</groupId>
<artifactId>surefire-junit47</artifactId>
<version>3.0.0-M7</version>
</dependency>
</dependencies>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.groovy</groupId>
<artifactId>groovy-all</artifactId>
<type>pom</type>
<version>4.0.3</version>
</dependency>
</dependencies>
</project>

When referencing the groovy-all artifact, we specify that it is a pom artifact using "<type>pom</type>". We also needed to configure the surefire plugin to use JUnit4. The groovy-all artifact also brings in JUnit5 support and the surefire plugin will use that by default and not find our test.

Running the test should yield:

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running JsonTest
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.36 s - in JsonTest

Advanced features

To cater for different scenarios, JsonSlurper is powered by several internal implementation classes. You don't access these classes directly but rather set a parser type when instantiating your slurper.

TypeWhen to use
CHAR_BUFFERDefault, least surprise parser with eager parsing of ints, dates, etc.
INDEX_OVERLAYFor REST calls, WebSocket messages, AJAX, inter process communication. Fastest parser which uses indexes to some existing char buffer.
CHARACTER_SOURCEFor handling larger JSON files.
LAXAllows comments and no quotes or single quotes in numerous situations.

Here is an example:

import groovy.json.JsonSlurper
import static groovy.json.JsonParserType.*

def slurper = new JsonSlurper(type: LAX)
def json = slurper.parseText('''{person:{'name':"Guillaume","age":33,"pets":["dog" /* ,"cat" */]}}''')
assert json.person.pets == ['dog']

Note the missing quotes for the person key, the single quotes for the name key and "cat" has been commented out. These changes wouldn't be allowed by a strict JSON parser.

Other JSON libraries

Groovy doesn't require you to use the groovy-json classes. You can use your favourite Java library with Groovy. You'll still benefit from many of Groovy's short-hand notations.

Here's an example using Gson:

@Grab('com.google.code.gson:gson:2.9.0')
import com.google.gson.JsonParser

def parser = new JsonParser()
def json = parser.parse('{"person":{"name":"Guillaume","age":33,"pets":["dog","cat"]}}')
assert json.person.pets*.asString == ['dog', 'cat']

Here's an example using the Jackson JSON support:

@Grab('com.fasterxml.jackson.core:jackson-databind:2.13.3')
import com.fasterxml.jackson.databind.ObjectMapper

def text = '{"person":{"name":"Guillaume","age":33,"pets":["dog","cat"]}}'
def json = new ObjectMapper().readTree(text)
assert json.person.pets*.asText() == ['dog', 'cat']

Integrated query

Groovy 4 also supports language integrated query syntax, known as GINQ or GQuery. We can use that with JSON too.

Suppose we have information in JSON format about fruits, their prices (per 100g) and the concentration of vitamin C (per 100g):

{
"prices": [
{"name": "Kakuda plum", "price": 13},
{"name": "Camu camu", "price": 25},
{"name": "Acerola cherries", "price": 39},
{"name": "Guava", "price": 2.5},
{"name": "Kiwifruit", "price": 0.4},
{"name": "Orange", "price": 0.4}
],
"vitC": [
{"name": "Kakuda plum", "conc": 5300},
{"name": "Camu camu", "conc": 2800},
{"name": "Acerola cherries", "conc": 1677},
{"name": "Guava", "conc": 228},
{"name": "Kiwifruit", "conc": 144},
{"name": "Orange", "conc": 53}
]
}

Now, suppose we are on a budget and want to select the most cost-effective fruits to buy to help us achieve our daily vitamin C requirements. We join the prices and vitC information and order by most cost-effective fruit. We’ll select the top 2 in case our first choice isn’t in stock when we go shopping. Our GQuery processing looks like this:

def jsonFile = new File('fruit.json')
def json = new JsonSlurper().parse(jsonFile)
assert GQ {
from p in json.prices
join c in json.vitC on c.name == p.name
orderby c.conc / p.price in desc
limit 2
select p.name
}.toList() == ['Kakuda plum', 'Kiwifruit']

We can see, for this data, Kakadu plums followed by Kiwifruit are our best choices.

Quick performance comparison

As a very crude measure of performance, JsonSlurper with all 4 parser types as well as Gson and Jackson were used to parse the timezone values from https://github.com/flowcommerce/json-reference and check that the current timezone in Brisbane is the same as the timezone in Sydney. The json file is by no means huge. It has just under 3000 lines and is under 60K in size. The best time taken (including compilation time) after 4 runs was taken - definitely a micro-benchmark which shouldn't be taken too seriously, but might be a rough guide. Just for fun, a native version of the Groovy JsonSlurper script with the type set to INDEX_OVERLAY was made using GraalVM. It's timings are included too.

$ time groovy GroovyJsonIndexOverlay.groovy
real    0m1.365s
user    0m4.157s
sys     0m0.145s

$ time groovy GroovyJsonCharacterSource.groovy
real    0m1.447s
user    0m4.472s
sys     0m0.174s

$ time groovy GroovyJsonLax.groovy
real    0m1.452s
user    0m4.338s
sys     0m0.171s

$ time groovy GroovyJson.groovy
real    0m1.383s
user    0m4.050s
sys     0m0.165s

$ time groovy Gson.groovy
real    0m1.814s
user    0m5.543s
sys     0m0.209s

$ time groovy Jackson.groovy
real    0m2.007s
user    0m6.332s
sys     0m0.208s

$ time ./groovyjsonindexoverlay real 0m0.015s user 0m0.011s sys 0m0.004s

Summary

We have seen the basics of setting up our projects to parse JSON using Groovy and some of the numerous options available to use depending on the scenario. We also saw how to use other JSON libraries, utilize GQuery syntax during our processing, and looked at some very crude performance figures.



Classifying Iris Flowers with Deep Learning, Groovy and GraalVM

by paulk


Posted on Saturday June 25, 2022 at 10:52AM in Technology


iris_description.pngA classic data science dataset captures flower characteristics of Iris flowers. It captures the width and length of the sepals and petals for three species (Setosa, Versicolor, and Virginica).

The Iris project in the groovy-data-science repo is dedicated to this example. It includes a number of Groovy scripts and a Jupyter/BeakerX notebook highlighting this example comparing and contrasting various libraries and various classification algorithms.

Technologies/libraries covered
Data manipulationWekaTablesawEncogJSATDatavecTribuo
ClassificationWekaSmileEncogTribuoJSATDeep Learning4JDeep Netts
VisualizationXChartTablesaw Plot.lyJavaFX
Main aspects/algorithms coveredReading csv, dataframes, visualization, exploration, naive bayes, logistic regression, knn regression, softmax regression, decision trees, support vector machine
Other aspects/algorithms coveredneural networks, multilayer perceptron, PCA

Feel free to browse these other examples and the Jupyter/BeakerX notebook if you are interested in any of these additional techniques.

iris_jupyter.png

For this blog, let's just look at the Deep Learning examples. We'll look at solutions using Encog, Eclipse DeepLearning4J and Deep Netts (with standard Java and as a native image using GraalVM) but first a brief introduction.

Deep Learning

Deep learning falls under the branches of machine learning and artificial intelligence. It involves multiple layers (hence the "deep") of an artificial neural network. There are lots of ways to configure such networks and the details are beyond the scope of this blog post, but we can give some basic details. We will have four input nodes corresponding to the measurements of our four characteristics. We will have three output nodes corresponding to each possible class (species). We will also have one or more additional layers in between.

deep_network.png

Each node in this network mimics to some degree a neuron in the human brain. Again, we'll simplify the details. Each node has multiple inputs, which are given a particular weight, as well as an activation function which will determine whether our node "fires". Training the model is a process which works out what the best weights should be.

deep_node.png

The math involved for converting inputs to output for any node isn't too hard. We could write it ourselves (as shown here using matrices and Apache Commons Math for a digit recognition example) but luckily we don't have to. The libraries we are going to use do much of the work for us. They typically provide a fluent API which let's us specify, in a somewhat declarative way, the layers in our network.

Just before exploring our examples, we should pre-warn folks that while we do time running the examples, no attempt was made to rigorously ensure that the examples were identical across the different technologies. The different technologies support slightly different ways to set up their respective network layers. The parameters were tweaked so that when run there was typically at most one or two errors in the validation. Also, the initial parameters for the runs can be set with random or pre-defined seeds. When random ones are used, each run will have slightly different errors. We'd need to do some additional alignment of examples and use a framework like JMH if we wanted to get a more rigorous time comparison between the technologies. Never-the-less, it should give a very rough guide as to the speed to the various technologies.

Encog

Encog is a pure Java machine learning framework that was created in 2008. There is also a C# port for .Net users. Encog is a simple framework that supports a number of advanced algorithms not found elsewhere but isn't as widely used as other more recent frameworks.

The complete source code for our Iris classification example using Encog is here, but the critical piece is:

def model = new EncogModel(data).tap {
selectMethod(data, TYPE_FEEDFORWARD)
report = new ConsoleStatusReportable()
data.normalize()
holdBackValidation(0.3, true, 1001) // test with 30%
selectTrainingType(data)
}

def bestMethod = model.crossvalidate(5, true) // 5-fold cross-validation

println "Training error: " + pretty(calculateRegressionError(bestMethod, model.trainingDataset)) println "Validation error: " + pretty(calculateRegressionError(bestMethod, model.validationDataset))

When we run the example, we see:

paulk@pop-os:/extra/projects/iris_encog$ time groovy -cp "build/lib/*" IrisEncog.groovy 
1/5 : Fold #1
1/5 : Fold #1/5: Iteration #1, Training Error: 1.43550735, Validation Error: 0.73302237
1/5 : Fold #1/5: Iteration #2, Training Error: 0.78845427, Validation Error: 0.73302237
...
5/5 : Fold #5/5: Iteration #163, Training Error: 0.00086231, Validation Error: 0.00427126
5/5 : Cross-validated score:0.10345818553910753
Training error:  0.0009
Validation error:  0.0991
Prediction errors:
predicted: Iris-virginica, actual: Iris-versicolor, normalized input: -0.0556, -0.4167,  0.3898,  0.2500
Confusion matrix:            Iris-setosa     Iris-versicolor      Iris-virginica
         Iris-setosa                  19                   0                   0
     Iris-versicolor                   0                  15                   1
      Iris-virginica                   0                   0                  10

real	0m3.073s
user	0m9.973s
sys	0m0.367s

We won't explain all of the stats, but it basically says we have a pretty good model with low errors in prediction. If you see the green and purple points in the notebook image earlier in this blog, you'll see there are some points which are going to be hard to predict correctly all the time. The confusion matrix shows that the model predicted one flower incorrectly on the validation dataset.

One very nice aspect of this library is that it is a single jar dependency!

Eclipse DeepLearning4j

Eclipse DeepLearning4j is a suite of tools for running deep learning on the JVM. It has support for scaling up to Apache Spark as well as some integration with python at a number of levels. It also provides integration to GPUs and C/++ libraries for native integration.

The complete source code for our Iris classification example using DeepLearning4J is here, with the main part shown below:

MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(seed)
.activation(Activation.TANH) // global activation
.weightInit(WeightInit.XAVIER)
.updater(new Sgd(0.1))
.l2(1e-4)
.list()
.layer(new DenseLayer.Builder().nIn(numInputs).nOut(3).build())
.layer(new DenseLayer.Builder().nIn(3).nOut(3).build())
.layer(new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.activation(Activation.SOFTMAX) // override activation with softmax for this layer
.nIn(3).nOut(numOutputs).build())
.build()

def model = new MultiLayerNetwork(conf)
model.init()

model.listeners = new ScoreIterationListener(100)

1000.times { model.fit(train) }

def eval = new Evaluation(3)
def output = model.output(test.features)
eval.eval(test.labels, output)
println eval.stats()

When we run this example, we see:

paulk@pop-os:/extra/projects/iris_dl4j$ time groovy -cp "build/lib/*" IrisDl4j.groovy 
[main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
[main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 4
[main] INFO org.nd4j.nativeblas.Nd4jBlas - Number of threads used for OpenMP BLAS: 4
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CPU]; OS: [Linux]
...
[main] INFO org.deeplearning4j.optimize.listeners.ScoreIterationListener - Score at iteration 0 is 0.9707752535968273
[main] INFO org.deeplearning4j.optimize.listeners.ScoreIterationListener - Score at iteration 100 is 0.3494968712782093
...
[main] INFO org.deeplearning4j.optimize.listeners.ScoreIterationListener - Score at iteration 900 is 0.03135504326480282

========================Evaluation Metrics========================
 # of classes:    3
 Accuracy:        0.9778
 Precision:       0.9778
 Recall:          0.9744
 F1 Score:        0.9752
Precision, recall & F1: macro-averaged (equally weighted avg. of 3 classes)


=========================Confusion Matrix=========================
  0  1  2
----------
 18  0  0 | 0 = 0
  0 14  0 | 1 = 1
  0  1 12 | 2 = 2

Confusion matrix format: Actual (rowClass) predicted as (columnClass) N times
==================================================================

real	0m5.856s
user	0m25.638s
sys	0m1.752s

Again the stats tell us that the model is good. One error in the confusion matrix for our testing dataset.
DeepLearning4J does have an impressive range of technologies that can be used to enhance performance in certain scenarios. For this example, I enabled AVX (Advanced Vector Extensions) support but didn't try using the CUDA/GPU support nor make use of any Apache Spark integration. The GPU option might have sped up the application but given the size of the dataset and the amount of calculations needed to train our network, it probably wouldn't have sped up much. For this little example, the overheads of putting the plumbing in place to access native C++ implementations and so forth, outweighed the gains. Those features generally would come into their own for much larger datasets or massive amounts of calculations; tasks like intensive video processing spring to mind.

The downside of the impressive scaling options is the added complexity. The code was slightly more complex than the other technologies we look at in this blog based around certain assumptions in the API which would be needed if we wanted to make use of Spark integration even though we didn't here. The good news is that once the work is done, if we did want to use Spark, that would now be relatively straight forward.

The other increase in complexity is the number of jar files needed in the classpath. I went with the easy option of using the nd4j-native-platform dependency plus added the org.nd4j:nd4j-native:1.0.0-M2:linux-x86_64-avx2 dependency for AVX support. This made my life easy but brought in over 170 jars including many for unneeded platforms. Having all those jars is great if users of other platforms want to also try the example but it can be a little troublesome with certain tooling that breaks with long command lines on certain platforms. I could certainly do some more work to shrink those dependency lists if it became a real problem.

[For the interested reader, the groovy-data-science repo has other DeepLearning4J examples. The Weka library can wrap DeepLearning4J as shown for this Iris example here. There are also two variants of the digit recognition example we alluded to earlier using one and two layer neural networks.]

Deep Netts

Deep Netts is a company offering a range of products and services related to deep learning. Here we are using the free open-source Deep Netts community edition pure java deep learning library. It provides support for the Java Visual Recognition API (JSR381). The expert group from JSR381 released their final spec earlier this year, so hopefully we'll see more compliant implementations soon.

The complete source code for our Iris classification example using Deep Netts is here and the important part is below:

var splits = dataSet.split(0.7d, 0.3d)  // 70/30% split
var train = splits[0]
var test = splits[1]

var neuralNet = FeedForwardNetwork.builder()
.addInputLayer(numInputs)
.addFullyConnectedLayer(5, ActivationType.TANH)
.addOutputLayer(numOutputs, ActivationType.SOFTMAX)
.lossFunction(LossType.CROSS_ENTROPY)
.randomSeed(456)
.build()

neuralNet.trainer.with {
maxError = 0.04f
learningRate = 0.01f
momentum = 0.9f
optimizer = OptimizerType.MOMENTUM
}

neuralNet.train(train)

new ClassifierEvaluator().with {
println "CLASSIFIER EVALUATION METRICS\n${evaluate(neuralNet, test)}"
println "CONFUSION MATRIX\n$confusionMatrix"
}

When we run this command we see:

paulk@pop-os:/extra/projects/iris_graalvm$ time groovy -cp "build/lib/*" Iris.groovy 
16:49:27.089 [main] INFO deepnetts.core.DeepNetts - ------------------------------------------------------------------------
16:49:27.091 [main] INFO deepnetts.core.DeepNetts - TRAINING NEURAL NETWORK
16:49:27.091 [main] INFO deepnetts.core.DeepNetts - ------------------------------------------------------------------------
16:49:27.100 [main] INFO deepnetts.core.DeepNetts - Epoch:1, Time:6ms, TrainError:0.8584314, TrainErrorChange:0.8584314, TrainAccuracy: 0.5252525
16:49:27.103 [main] INFO deepnetts.core.DeepNetts - Epoch:2, Time:3ms, TrainError:0.52278274, TrainErrorChange:-0.33564866, TrainAccuracy: 0.52820516
...
16:49:27.911 [main] INFO deepnetts.core.DeepNetts - Epoch:3031, Time:0ms, TrainError:0.029988592, TrainErrorChange:-0.015680967, TrainAccuracy: 1.0
TRAINING COMPLETED
16:49:27.911 [main] INFO deepnetts.core.DeepNetts - Total Training Time: 820ms
16:49:27.911 [main] INFO deepnetts.core.DeepNetts - ------------------------------------------------------------------------
CLASSIFIER EVALUATION METRICS
Accuracy: 0.95681506 (How often is classifier correct in total)
Precision: 0.974359 (How often is classifier correct when it gives positive prediction)
F1Score: 0.974359 (Harmonic average (balance) of precision and recall)
Recall: 0.974359 (When it is actually positive class, how often does it give positive prediction)

CONFUSION MATRIX
                          none    Iris-setosaIris-versicolor Iris-virginica
           none              0              0              0              0
    Iris-setosa              0             14              0              0
Iris-versicolor              0              0             18              1
 Iris-virginica              0              0              0             12


real	0m3.160s
user	0m10.156s
sys	0m0.483s

This is faster than DeepLearning4j and similar to Encog. This is to be expected given our small data set and isn't indicative of performance for larger problems.

Another plus is the dependency list. It isn't quite the single jar situation as we saw with Encog but not far off. There is the Encog jar, the JSR381 VisRec API which is in a separate jar, and a handful of logging jars.

Deep Netts with GraalVM

Another technology we might want to consider if performance is important to us is GraalVM. GraalVM is a high-performance JDK distribution designed to speed up the execution of applications written in Java and other JVM languages. We'll look at creating a native version of our Iris Deep Netts application. We used GraalVM 22.1.0 Java 17 CE and Groovy 4.0.3. We'll cover just the basic steps but there are other places for additional setup info and troubleshooting help like here, here and here.

Groovy has two natures. It's dynamic nature supports adding methods at runtime through metaprogramming and interacting with method dispatch processing through missing method interception and other tricks. Some of these tricks make heavy use of reflection and dynamic class loading and cause problems for GraalVM which is trying to determine as much information as it can at compile time. Groovy's static nature has a more limited set of metaprogramming capabilities but allows bytecode much closer to Java to be produced. Luckily, we aren't relying on any dynamic Groovy tricks for our example. We'll compile it up using static mode:


paulk@pop-os:/extra/projects/iris_graalvm$ groovyc -cp "build/lib/*" --compile-static Iris.groovy

Next we build our native application:


paulk@pop-os:/extra/projects/iris_graalvm$ native-image --report-unsupported-elements-at-runtime \ --initialize-at-run-time=groovy.grape.GrapeIvy,deepnetts.net.weights.RandomWeights \ --initialize-at-build-time --no-fallback -H:ConfigurationFileDirectories=conf/ -cp ".:build/lib/*" Iris

We told GraalVM to initialize GrapeIvy at runtime (to avoid needing Ivy jars in the classpath since Groovy will lazily load those classes only if we use @Grab statements). We also did the same for the RandomWeights class to avoid it being locked into a random seed fixed at compile time.

Now we are ready to run our application:


paulk@pop-os:/extra/projects/iris_graalvm$ time ./iris ... CLASSIFIER EVALUATION METRICS Accuracy: 0.93460923 (How often is classifier correct in total) Precision: 0.96491224 (How often is classifier correct when it gives positive prediction) F1Score: 0.96491224 (Harmonic average (balance) of precision and recall) Recall: 0.96491224 (When it is actually positive class, how often does it give positive prediction) CONFUSION MATRIX none Iris-setosaIris-versicolor Iris-virginica none 0 0 0 0 Iris-setosa 0 21 0 0 Iris-versicolor 0 0 20 2 Iris-virginica 0 0 0 17 real 0m0.131s user 0m0.096s sys 0m0.029s

We can see here that the speed has dramatically increased. This is great, but we should note, that using GraalVM often involves some tricky investigation especially for Groovy which by default has its dynamic nature. There are a few features of Groovy which won't be available when using Groovy's static nature and some libraries might be problematical. As an example, Deep Netts has log4j2 as one of its dependencies. At the time of writing, there are still issues using log4j2 with GraalVM. We excluded the log4j-core dependency and used log4j-to-slf4j backed by logback-classic to sidestep this problem.

[Update: I put the Deep Netts GraalVM iris application with some more detailed instructions into its own subproject.]

Conclusion

We have seen a few different libraries for performing deep learning classification using Groovy. Each has its own strengths and weaknesses. There are certainly options to cater for folks wanting blinding fast startup speeds through to options which scale to massive computing farms in the cloud.


Using Groovy with Apache Wayang and Apache Spark

by paulk


Posted on Sunday June 19, 2022 at 01:01PM in Technology


wayang.pngApache Wayang (incubating) is an API for big data cross-platform processing. It provides an abstraction over other platforms like Apache Spark and Apache Flink as well as a default built-in stream-based "platform". The goal is to provide a consistent developer experience when writing code regardless of whether a light-weight or highly-scalable platform may eventually be required. Execution of the application is specified in a logical plan which is again platform agnostic. Wayang will transform the logical plan into a set of physical operators to be executed by specific underlying processing platforms.

Whiskey Clustering

groovy.pngWe'll take a look at using Apache Wayang with Groovy to help us in the quest to find the perfect single-malt Scotch whiskey. The whiskies produced from 86 distilleries have been ranked by expert tasters according to 12 criteria (Body, Sweetness, Malty, Smoky, Fruity, etc.). We'll use a KMeans algorithm to calculate the centroids. This is similar to the KMeans example in the Wayang documentation but instead of 2 dimensions (x and y coordinates), we have 12 dimensions corresponding to our criteria. The main point is that it is illustrative of typical data science and machine learning algorithms involving iteration (the typical map, filter, reduce style of processing).

whiskey_bottles.jpg

KMeans is a standard data-science clustering technique. In our case, it groups whiskies with similar characteristics (according to the 12 criteria) into clusters. If we have a favourite whiskey, chances are we can find something similar by looking at other instances in the same cluster. If we are feeling like a change, we can look for a whiskey in some other cluster. The centroid is the notional "point" in the middle of the cluster. For us it reflects the typical measure of each criteria for a whiskey in that cluster.

Implementation Details

We'll start with defining a Point record:

record Point(double[] pts) implements Serializable {
static Point fromLine(String line) { new Point(line.split(',')[2..-1]*.toDouble() as double[]) }
}

We've made it Serializable (more on that later) and included a fromLine factory method to help us make points from a CSV file. We'll do that ourselves rather than rely on other libraries which could assist. It's not a 2D or 3D point for us but 12D corresponding to the 12 criteria. We just use a double array, so any dimension would be supported but the 12 comes from the number of columns in our data file.

We'll define a related TaggedPointCounter record. It's like a Point but tracks a cluster Id and count used when clustering the "points":

record TaggedPointCounter(double[] pts, int cluster, long count) implements Serializable {
TaggedPointCounter plus(TaggedPointCounter that) {
new TaggedPointCounter((0..<pts.size()).collect{ pts[it] + that.pts[it] } as double[], cluster, count + that.count)
}

TaggedPointCounter average() {
new TaggedPointCounter(pts.collect{ double d -> d/count } as double[], cluster, 0)
}
}

We have plus and average methods which will be helpful in the map/reduce parts of the algorithm.

Another aspect of the KMeans algorithm is assigning points to the cluster associated with their nearest centroid. For 2 dimensions, recalling pythagoras' theorem, this would be the square root of x squared plus y squared, where x and y are the distance of a point from the centroid in the x and y dimensions respectively. We'll do the same across all dimensions and define the following helper class to capture this part of the algorithm:

class SelectNearestCentroid implements ExtendedSerializableFunction<Point, TaggedPointCounter> {
Iterable<TaggedPointCounter> centroids

void open(ExecutionContext context) {
centroids = context.getBroadcast("centroids")
}

TaggedPointCounter apply(Point p) {
def minDistance = Double.POSITIVE_INFINITY
def nearestCentroidId = -1
for (c in centroids) {
def distance = sqrt((0..<p.pts.size()).collect{ p.pts[it] - c.pts[it] }.sum{ it ** 2 } as double)
if (distance < minDistance) {
minDistance = distance
nearestCentroidId = c.cluster
}
}
new TaggedPointCounter(p.pts, nearestCentroidId, 1)
}
}

In Wayang parlance, the SelectNearestCentroid class is a UDF, a User-Defined Function. It represents some chunk of functionality where an optimization decision can be made about where to run the operation.

Once we get to using Spark, the classes in the map/reduce part of our algorithm will need to be serializable. Method closures in dynamic Groovy aren't serializable. We have a few options to avoid using them. I'll show one approach here which is to use some helper classes in places where we might typically use method references. Here are the helper classes:

class Cluster implements SerializableFunction<TaggedPointCounter, Integer> {
Integer apply(TaggedPointCounter tpc) { tpc.cluster() }
}

class Average implements SerializableFunction<TaggedPointCounter, TaggedPointCounter> {
TaggedPointCounter apply(TaggedPointCounter tpc) { tpc.average() }
}

class Plus implements SerializableBinaryOperator<TaggedPointCounter> {
TaggedPointCounter apply(TaggedPointCounter tpc1, TaggedPointCounter tpc2) { tpc1.plus(tpc2) }
}

Now we are ready for our KMeans script:

int k = 5
int iterations = 20

// read in data from our file
def url = WhiskeyWayang.classLoader.getResource('whiskey.csv').file
def pointsData = new File(url).readLines()[1..-1].collect{ Point.fromLine(it) }
def dims = pointsData[0].pts().size()

// create some random points as initial centroids
def r = new Random()
def initPts = (1..k).collect { (0..<dims).collect { r.nextGaussian() + 2 } as double[] }

// create planbuilder with Java and Spark enabled
def configuration = new Configuration()
def context = new WayangContext(configuration)
.withPlugin(Java.basicPlugin())
.withPlugin(Spark.basicPlugin())
def planBuilder = new JavaPlanBuilder(context, "KMeans ($url, k=$k, iterations=$iterations)")

def points = planBuilder
.loadCollection(pointsData).withName('Load points')

def initialCentroids = planBuilder
.loadCollection((0..<k).collect{ idx -> new TaggedPointCounter(initPts[idx], idx, 0) })
.withName("Load random centroids")

def finalCentroids = initialCentroids
.repeat(iterations, currentCentroids ->
points.map(new SelectNearestCentroid())
.withBroadcast(currentCentroids, "centroids").withName("Find nearest centroid")
.reduceByKey(new Cluster(), new Plus()).withName("Add up points")
.map(new Average()).withName("Average points")
.withOutputClass(TaggedPointCounter)).withName("Loop").collect()

println 'Centroids:'
finalCentroids.each { c ->
println "Cluster$c.cluster: ${c.pts.collect{ sprintf('%.3f', it) }.join(', ')}"
}

Here, k is the desired number of clusters, and iterations is the number of times to iterate through the KMeans loop. The pointsData variable is a list of Point instances loaded from our data file. We'd use the readTextFile method instead of loadCollection if our data set was large. The initPts variable is some random starting positions for our initial centroids. Being random, and given the way the KMeans algorithm works, it is possible that some of our clusters may have no points assigned.

Our algorithm works by assigning, at each iteration, all the points to their closest current centroid and then calculating the new centroids given those assignments. Finally, we output the results.

Running with the Java streams-backed platform

As we mentioned earlier, Wayang selects which platform(s) will run our application. It has numerous capabilities whereby cost functions and load estimators can be used to influence and optimize how the application is run. For our simple example, it is enough to know that even though we specified Java or Spark as options, Wayang knows that for our small data set, the Java streams option is the way to go.

Since we prime the algorithm with random data, we expect the results to be slightly different each time the script is run, but here is one output:

> Task :WhiskeyWayang:run
Centroids:
Cluster0: 2.548, 2.419, 1.613, 0.194, 0.097, 1.871, 1.742, 1.774, 1.677, 1.935, 1.806, 1.613
Cluster2: 1.464, 2.679, 1.179, 0.321, 0.071, 0.786, 1.429, 0.429, 0.964, 1.643, 1.929, 2.179
Cluster3: 3.250, 1.500, 3.250, 3.000, 0.500, 0.250, 1.625, 0.375, 1.375, 1.375, 1.250, 0.250
Cluster4: 1.684, 1.842, 1.211, 0.421, 0.053, 1.316, 0.632, 0.737, 1.895, 2.000, 1.842, 1.737 ...

Which if plotted looks like this:

WhiskeyWayang Centroid Spider Plot

If you are interested, check out the examples in the repo links at the end of this article to see the code for producing this centroid spider plot or the Jupyter/BeakerX notebook in this project's github repo.

Running with Apache Spark

spark.pngGiven our small dataset size and no other customization, Wayang will choose the Java streams based solution. We could use Wayang optimization features to influence which processing platform it chooses, but to keep things simple, we'll just disable the Java streams platform in our configuration by making the following change in our code:

WhiskeyWayang_DisableJava.png

Now when we run the application, the output will be something like this (a solution similar to before but with 1000+ extra lines of Spark and Wayang log information - truncated for presentation purposes):

[main] INFO org.apache.spark.SparkContext - Running Spark version 3.3.0
[main] INFO org.apache.spark.util.Utils - Successfully started service 'sparkDriver' on port 62081.
...
Centroids:
Cluster4: 1.414, 2.448, 0.966, 0.138, 0.034, 0.862, 1.000, 0.483, 1.345, 1.690, 2.103, 2.138
Cluster0: 2.773, 2.455, 1.455, 0.000, 0.000, 1.909, 1.682, 1.955, 2.091, 2.045, 2.136, 1.818
Cluster1: 1.762, 2.286, 1.571, 0.619, 0.143, 1.714, 1.333, 0.905, 1.190, 1.952, 1.095, 1.524
Cluster2: 3.250, 1.500, 3.250, 3.000, 0.500, 0.250, 1.625, 0.375, 1.375, 1.375, 1.250, 0.250
Cluster3: 2.167, 2.000, 2.167, 1.000, 0.333, 0.333, 2.000, 0.833, 0.833, 1.500, 2.333, 1.667
...
[shutdown-hook-0] INFO org.apache.spark.SparkContext - Successfully stopped SparkContext
[shutdown-hook-0] INFO org.apache.spark.util.ShutdownHookManager - Shutdown hook called

Discussion

A goal of Apache Wayang is to allow developers to write platform-agnostic applications. While this is mostly true, the abstractions aren't perfect. As an example, if I know I am only using the streams-backed platform, I don't need to worry about making any of my classes serializable (which is a Spark requirement). In our example, we could have omitted the "implements Serializable" part of the TaggedPointCounter record, and we could have used a method reference TaggedPointCounter::average instead of our Average helper class. This isn't meant to be a criticism of Wayang, after all if you want to write cross-platform UDFs, you might expect to have to follow some rules. Instead, it is meant to just indicate that abstractions often have leaks around the edges. Sometimes those leaks can be beneficially used, other times they are traps waiting for unknowing developers.

To summarise, if using the Java streams-backed platform, you can run the application on JDK17 (which uses native records) as well as JDK11 and JDK8 (where Groovy provides emulated records). Also, we could make numerous simplifications if we desired. When using the Spark processing platform, the potential simplifications aren't applicable, and we can run on JDK8 and JDK11 (Spark isn't yet supported on JDK17).

Conclusion

We have looked at using Apache Wayang to implement a KMeans algorithm that runs either backed by the JDK streams capabilities or by Apache Spark. The Wayang API hid from us some of the complexities of writing code that works on a distributed platform and some of the intricacies of dealing with the Spark platform. The abstractions aren't perfect but they certainly aren't hard to use and provide extra protection should we wish to move between platforms. As an added bonus, they open up numerous optimization possibilities.

Apache Wayang is an incubating project at Apache and still has work to do before it graduates but lots of work has gone on previously (it was previously known as Rheem and was started in 2015). Platform agnostic applications is a holy grail that has been desired for many years but is hard to achieve. It should be exciting to see how far Apache Wayang progresses in achieving this goal.

More Information

  • Repo containing the source code: WhiskeyWayang
  • Repo containing similar examples using a variety of libraries including Apache Commons CSV, Weka, Smile, Tribuo and others: Whiskey
  • A similar example using Apache Spark directly but with a built-in parallelized KMeans from the spark-mllib library rather than a hand-crafted algorithm: WhiskeySpark
  • A similar example using Apache Ignite directly but with a built-in clustered KMeans from the ignite-ml library rather than a hand-crafted algorithm: WhiskeyIgnite


GPars meets Virtual Threads

by paulk


Posted on Wednesday June 15, 2022 at 11:28AM in Technology


gpars-rgb.pngAn exciting preview feature coming in JDK19 is Virtual Threads (JEP 425). In my experiments so far, virtual threads work well with my favourite Groovy parallel and concurrency library GPars. GPars has been around a while (since Java 5 and Groovy 1.8 days) but still has many useful features. Let's have a look at a few examples.

If you want to try these out, make sure you have a recent JDK19 (currently EA) and enable preview features with your Groovy tooling.

Parallel Collections

First a refresher, to use the GPars parallel collections feature with normal threads, use the GParsPool.withPool method as follows:

withPool {
assert [1, 2, 3].collectParallel{ it ** 2 } == [1, 4, 9] }

For any Java readers, don't get confused with the collectParallel method name. Groovy's collect method (naming inspired by Smalltalk) is the equivalent of Java's map method. So, the equivalent Groovy code using the Java streams API would be something like:

assert [1, 2, 3].parallelStream().map(n -> n ** 2).collect(Collectors.toList()) == [1, 4, 9]

Now, let's bring virtual threads into the picture. Luckily, GPars parallel collection facilities provide a hook for using an existing custom executor service. This makes using virtual threads for such code easy:

withExistingPool(Executors.newVirtualThreadPerTaskExecutor()) {
assert [1, 2, 3].collectParallel{ it ** 2 } == [1, 4, 9]
}

Nice! But let's move onto some areas examples which might be less familiar to Java developers.

GPars has additional features for providing custom thread pools and the remaining examples rely on those features. The current version of GPars doesn't have a DefaultPool constructor that takes a vanilla executor service, so, we'll write our own class:

@AutoImplement
class VirtualPool implements Pool {
private final ExecutorService pool = Executors.newVirtualThreadPerTaskExecutor()
int getPoolSize() { pool.poolSize }
void execute(Runnable task) { pool.execute(task) }
ExecutorService getExecutorService() { pool }
}

It is essentially a delegate from the GPars Pool interface to the virtual threads executor service.

We'll use this in the remaining examples.

Agents

Agents provide a thread-safe non-blocking wrapper around an otherwise potentially mutable shared state object. They are inspired by agents in Clojure.

In our case we'll use an agent to "protect" a plain ArrayList. For this simple case, we could have used some synchronized list, but in general, agents eliminate the need to find thread-safe implementation classes or indeed care at all about the thread safety of the underlying wrapped object.

def mutableState = []     // a non-synchronized mutable list
def agent = new Agent(mutableState)

agent.attachToThreadPool(new VirtualPool()) // omit line for normal threads

agent { it << 'Dave' } // one thread updates list
agent { it << 'Joe' } // another thread also updating
assert agent.val.size() == 2

Actors

Actors allow for a message passing-based concurrency model. The actor model ensures that at most one thread processes the actor's body at any time. The GPars API and DSLs for actors are quite rich supporting many features. We'll look at a simple example here.

GPars manages actor thread pools in groups. Let's create one backed by virtual threads:

def vgroup = new DefaultPGroup(new VirtualPool())

Now we can write an encrypting and decrypting actor pair as follows:

def decryptor = vgroup.actor {
loop {
react { String message ->
reply message.reverse()
}
}
}

def console = vgroup.actor {
decryptor << 'lellarap si yvoorG'
react {
println 'Decrypted message: ' + it
}
}

console.join() // output: Decrypted message: Groovy is parallel

Dataflow

Dataflow offers an inherently safe and robust declarative concurrency model. Dataflows are also managed via thread groups, so we'll use vgroup which we created earlier.

We have three logical tasks which can run in parallel and perform their work. The tasks need to exchange data and they do so using dataflow variables. Think of dataflow variables as one-shot channels safely and reliably transferring data from producers to their consumers.

def df = new Dataflows()

vgroup.task {
df.z = df.x + df.y
}

vgroup.task {
df.x = 10
}

vgroup.task {
df.y = 5
}

assert df.z == 15

The dataflow framework works out how to schedule the individual tasks and ensures that a task's input variables are ready when needed.

Conclusion

We have had a quick glimpse at using virtual threads with Groovy and GPars. It is very early days, so expect much more to emerge in this space once virtual threads are released in preview in production versions of JDK19 and eventually beyond a preview feature.


Groovy 4.0.3 Released

by paulk


Posted on Wednesday June 15, 2022 at 08:16AM in Technology


groovy.pngDear community,

The Apache Groovy team is pleased to announce version 4.0.3 of Apache Groovy.
Apache Groovy is a multi-faceted programming language for the JVM.
Further details can be found at the https://groovy.apache.org website.

This release is a maintenance release of the GROOVY_4_0_X branch.
It is strongly encouraged that all users using prior
versions on this branch upgrade to this version.

This release includes 40 bug fixes/improvements as outlined in the changelog:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318123&version=12351650

Sources, convenience binaries, downloadable documentation and an SDK
bundle can be found at: https://groovy.apache.org/download.html
We recommend you verify your installation using the information on that page.

Jars are also available within the major binary repositories.

We welcome your help and feedback and in particular want
to thank everyone who contributed to this release.

For more information on how to report problems, and to get involved,
visit the project website at https://groovy.apache.org/

Best regards,

The Apache Groovy team.


Groovy 3 Highlights

by paulk


Posted on Thursday February 13, 2020 at 02:28AM in Technology


Groovy 3 Highlights 

General Improvements

Groovy has both a dynamic nature (supporting code styles similar to Ruby and Python) as well as a static nature (supporting styles similar to Java, Kotlin and Scala). Groovy continues to improve both those natures - filling in any feature gaps. As just one example, Groovy has numerous facilities for better managing null values. You can use Groovy's null-safe navigation operator, piggy back on Java's Optional or provide a null-checking extension to the type checker. These are augmented in Groovy 3 with null-safe indexing for arrays, lists and maps and a new AST transformation @NullCheck for automatically instrumenting code with null checks.

In general, the language design borrows heavily from Java, so careful attention is paid to changes in Java and acted on accordingly if appropriate. A lot of work has been done getting Groovy ready for Java modules and for making it work well with JDK versions 9-15. Other work has dramatically improved the performance of bytecode generation which makes use of the JVMs invoke dynamic capabilities. Additional changes are already underway for further improvements in these areas in Groovy 4.

There are also many other performance improvements under the covers. More efficient type resolution occurs during compilation and more efficient byecode is generated for numerous scenarios. The addition of a Maven BOM allows more flexible usage of Groovy from other projects.

Groovy also has particular strengths for scripting, testing, writing Domain Specific Languages (DSLs) and in domains like financial calculations and data science. On-going work has been made to ensure those strengths are maintained. The accuracy used for high-precision numbers has been improved and is configurable. Much of the tooling such as Groovy Console and groovysh have also been improved.

Other key strengths of Groovy such as its runtime and compile-time meta-programming capabilities have also seen many minor enhancements. All in all, this release represents the culmination of several years of activity. Over 500 new features, improvements and bug fixes have been added since Groovy 2.5. Just a few highlights are discussed below.

Parrot parser

Groovy has a new parser. While mostly an internal change within Groovy, the good news for users is that the new parser is more flexible and will allow the language to more rapidly change should the need arise.

New syntax

The new parser gave us the opportunity to add some new syntax features:

  • !in and !instanceof operators
assert 45 !instanceof Date
assert 4 !in [1, 3, 5, 7]
  • Elvis assignment operator
def first = 'Jane'
def last = null
first ?= 'John'
last ?= 'Doe'
assert [first, last] == ['Jane', 'Doe']
  • Identity comparison operators
assert cat === copyCat  // operator shorthand for is method
assert cat !== lion     // negated operator shorthand
  • Safe indexing (for maps, lists and arrays)
println map?['someKey'] // return null if map is null instead of throwing NPE

Java compatibility

The Groovy syntax can be thought of as a superset of Java syntax. It's considered good style to use the enhancements that Groovy provides when appropriate, but Groovy's aim is to still support as much of the Java syntax as possible to allow easy migration from Java or easy switching for folks working with both Java and Groovy.

The flexibility provided by the new parser allowed several syntax compatibility holes to be closed including:

  • do/while loop
def count = 5
def factorial = 1
do {
    factorial *= count--
} while(count > 1)
assert factorial == 120
  • Enhanced classic Java-style for loop (see multi-assignment for-loop example; note the comma in the last clause of the for statement)
  • Multi-assignment in combination with for loop
def count = 3
println 'The next three months are:'
for (def (era, yr, mo) = new Date(); count--; yr = mo == 11 ? yr + 1 : yr, mo = mo == 11 ? 0 : mo + 1) {
    println "$yr/$mo"
}
  • Java-style array initialization (but you might prefer Groovy's literal list notation)
def primes = new int[] {2, 3, 5, 7, 11}
  • Lambda expressions (but you might often prefer Groovy's Closures which support trampoline/tail recursion, partial application/currying, memoization/auto caching)
(1..10).forEach(e -> { println e })
assert (1..10).stream()
              .filter(e -> e % 2 == 0)
              .map(e -> e * 2)
              .toList() == [4, 8, 12, 16, 20]
def add = (int x, int y) -> { def z = y; return x + z }
assert add(3, 4) == 7
    • Method references (but you might often prefer Groovy's Method pointers which are Closures with the previously mentioned benefits)
    assert ['1', '2', '3'] == Stream.of(1, 2, 3)
                                    .map(String::valueOf)
                                    .toList()
          • "var" reserved type (allows Java 10/11 features even when using JDK 8)
          var two = 2                                                      // Java 10
          IntFunction<Integer> twice = (final var x) -> x * two            // Java 11
          assert [1, 2, 3].collect{ twice.apply(it) } == [2, 4, 6]
          • ARM Try with resources (Java 7 and 9 variations work on JDK 8 - but you might prefer Groovy's internal iteration methods for resources)
          def file = new File('/path/to/file.ext')
          def reader = file.newReader()
          try(reader) {
              String line = null
              while (line = reader.readLine()) {
                  println line
              }
          }
          • Nested code blocks
          • Java-style non-static inner class instantiation
          • Interface default methods (but you might prefer Groovy's traits)
          interface Greetable {
              String target()
              default String salutation() {
                  'Greetings'
              }
              default String greet() {
                  "${salutation()}, ${target()}"
              }
          }

          Split package changes

          In preparation for Groovy's modular jars to be first class modules, several classes have moved packages. Some examples:

          groovy.util.XmlParser => groovy.xml.XmlParser
          groovy.util.XmlSlurper => groovy.xml.XmlSlurper
          groovy.util.GroovyTestCase => groovy.test.GroovyTestCase

          In most cases, both the old and new class are available in Groovy 3. But by Groovy 4, the old classes will be removed. See the release notes for a complete list of these changes. 

          DGM improvements

          Groovy adds many extension methods to existing Java classes. In Groovy 3, about 80 new such extension methods were added. We highlight just a few here:

          • average() on arrays and iterables
          assert 3 == [1, 2, 6].average()

          • takeBetween() on String, CharSequence and GString
          assert 'Groovy'.takeBetween( 'r', 'v' ) == 'oo'

          • shuffle() and shuffled() on arrays and iterables
          def orig = [1, 3, 5, 7]
          def mixed = orig.shuffled()
          assert mixed.size() == orig.size()
          assert mixed.toString() ==~ /\[(\d, ){3}\d\]/

          • collect{ } on Future
          Future<String> foobar = executor.submit{ "foobar" }
          Future<Integer> foobarSize = foobar.collect{ it.size() } // async
          assert foobarSize.get() == 6

          • minus() on LocalDate
          def xmas = LocalDate.of(2019, Month.DECEMBER, 25)
          def newYear = LocalDate.of(2020, Month.JANUARY, 1)
          assert newYear - xmas == 7 // a week apart
          

          Other Improvements

          Improved Annotation Support

          Recent version of Java allow annotations in more places (JSR308). Groovy now also supports such use cases. This is important for frameworks like Spock, Micronaut, Grails, Jqwik and others, and also opens up the possibility for additional AST transformations (a key meta-programming feature of Groovy).

          Groovydoc Enhancements

          In addition to Groovydoc supporting the new parser, you can now embed Groovydoc comments in various ways:

          • They can be made available within the AST for use by AST transformations and other tools.
          • Groovydoc comments starting with a special /**@ opening comment delimiter can also be embedded into the class file. This provides a capability in Groovy inspired by languages like Ruby which can embed documentation into the standard binary jar and is thus always available rather than relying on a separate javadoc jar.

          Getting Groovy

          The official source release are on the download page. Convenience binaries, downloadable documentation, an SDK bundle and pointers to various community artifacts can be found on that page along with information to allow you to verify your installation. You can use the zip installation on any platform with Java support, or consider using an installer for your platform or IDE.

          The Windows installer for the latest versions of Groovy 3 are available from bintray. (community artifact)

          For Linux users, the latest versions of Groovy 3 are also available in the Snap Store. (community artifact)

          For Eclipse users, the latest versions of the Groovy 3 groovy-eclipse-batch plugin are available from bintray. (community artifact)

          For Intellij users, the latest community editions of IDEA have Groovy 3 support. 


          Groovy 3.0.0-beta-2 Windows Installer Released (Community Release)

          by Remko Popma


          Posted on Monday July 15, 2019 at 10:30AM in Technology


          The Windows installer for Groovy 3.0.0-beta-2 is now available from Bintray: https://bintray.com/groovy/Distributions/download_file?file_path=groovy-3.0.0-beta-2-installer.exe.

          I've again included a preview of an msi built with WiX, which I'm seeking feedback on: https://bintray.com/groovy/Distributions/download_file?file_path=groovy-3.0.0-beta-2+%28preview+installer%29.msi

          Be aware that you need to fully uninstall the NSIS based Groovy installation before installing with an MSI installer.


          Groovy 3.0.0-beta-2 Released

          by Remko Popma


          Posted on Monday July 15, 2019 at 10:25AM in Technology


          The Apache Groovy team is pleased to announce version 3.0.0-beta-2 of Apache Groovy. Apache Groovy is a multi-faceted programming language for the JVM. Further details can be found at the https://groovy.apache.org website.

          This is a pre-release of a new version of Groovy. We greatly appreciate any feedback you can give us when using this version.

          This release includes 40 bug fixes/improvements as outlined in the changelog: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318123&version=12345498

          Sources, convenience binaries, downloadable documentation and an SDK bundle can be found at: https://groovy.apache.org/download.html We recommend you verify your installation using the information on that page.

          Jars are also available within the major binary repositories.

          We welcome your help and feedback and in particular want to thank everyone who contributed to this release.

          For more information on how to report problems, and to get involved, visit the project website at https://groovy.apache.org/

          Note: Apache Groovy 3.0.0-beta-2 was compiled with JDK8, so the illegal access warnings will come back if you use JDK9+. But don't worry, we will do another release in a week or two. Please verify the issues you reported first and give us feedback, which will help us improve the quality of next release.

          Best regards,

          The Apache Groovy team.


          Groovy 2.5.7 and 3.0.0-beta-1 Windows Installers Released (Community Artifacts)

          by Remko Popma


          Posted on Sunday May 12, 2019 at 10:49PM in Technology


          The Windows installer for Groovy 2.5.7 (Community Artifact) is now available from Bintray: https://bintray.com/groovy/Distributions/Windows-Installer/groovy-2.5.7-installer.

          The Windows installer for Groovy 3.0.0-beta-1 (Community Artifact) is now available from Bintray: https://bintray.com/groovy/Distributions/download_file?file_path=groovy-3.0.0-beta-1-installer.exe.

          These are also the first releases where a preview of the Windows Installers is created with the WiX Toolset.  You are invited to try them out and provide any feedback you might have.  The intention is to eventually replace the current NSIS-based installer with this installer.  It is believed to be reasonably stable.  The maintainer of these installer has personally been using these instead of the NSIS based installer for a while now.  Here are the links to those installers:

          Be aware that you need to fully uninstall the NSIS based Groovy installation before installing with an MSI installer.


          Groovy 3.0.0-beta-1 Released

          by Remko Popma


          Posted on Sunday May 12, 2019 at 10:41PM in Technology


          Dear community,

          The Apache Groovy team is pleased to announce version 3.0.0-beta-1 of Apache Groovy.
          Apache Groovy is a multi-faceted programming language for the JVM.
          Further details can be found at the https://groovy.apache.org website.

          This is a pre-release of a new version of Groovy.
          We greatly appreciate any feedback you can give us when using this version.

          This release includes 109 bug fixes/improvements as outlined in the changelog:
          https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318123&version=12344761

          Sources, convenience binaries, downloadable documentation and an SDK
          bundle can be found at: https://groovy.apache.org/download.html
          We recommend you verify your installation using the information on that page.

          Jars are also available within the major binary repositories.

          We welcome your help and feedback and in particular want
          to thank everyone who contributed to this release.

          For more information on how to report problems, and to get involved,
          visit the project website at https://groovy.apache.org/

          Best regards,

          The Apache Groovy team.


          Groovy 2.5.7 Released

          by Remko Popma


          Posted on Sunday May 12, 2019 at 10:39PM in Technology


          Dear community,

          The Apache Groovy team is pleased to announce version 2.5.7 of Apache Groovy.
          Apache Groovy is a multi-faceted programming language for the JVM.
          Further details can be found at the https://groovy.apache.org website.

          This release is a maintenance release of the GROOVY_2_5_X branch.
          It is strongly encouraged that all users using prior
          versions on this branch upgrade to this version.

          This release includes 56 bug fixes/improvements as outlined in the changelog:
          https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318123&version=12344939

          Sources, convenience binaries, downloadable documentation and an SDK
          bundle can be found at: https://groovy.apache.org/download.html
          We recommend you verify your installation using the information on that page.

          Jars are also available within the major binary repositories.

          We welcome your help and feedback and in particular want
          to thank everyone who contributed to this release.

          For more information on how to report problems, and to get involved,
          visit the project website at https://groovy.apache.org/

          Best regards,

          The Apache Groovy team.