Groovy  Apache Groovy

Deck of cards with Groovy, JDK collections and Eclipse Collections

by paulk


Posted on Friday September 23, 2022 at 10:18AM in Technology


Once again, Donald Raab has produced an interesting blog post on Eclipse Collections; this one shows some code for modelling and manipulating cards with Java 17 and Eclipse Collections. His related katas are highly recommended.

Here is the same example in Groovy (4.0.5 was used here) with just a few tweaks:

enum Rank { ACE, TWO, THREE, FOUR, FIVE, SIX, SEVEN, EIGHT, NINE, TEN, JACK, QUEEN, KING }

enum Suit { SPADES, DIAMONDS, HEARTS, CLUBS }

@Sortable(includes='suit,rank')
record Card(Rank rank, Suit suit) { }

var cards = Sets.cartesianProduct(EnumSet.allOf(Rank), EnumSet.allOf(Suit), Card::new)
var cardsBySuit = cards.groupBy(Card::suit)
var cardsByRank = cards.groupBy(Card::rank)

assert [cards, cardsByRank, cardsBySuit]*.size() == [52, 13 ,4]

var random = new Random(42L)
var deck = cards.toList().shuffleThis(random).shuffleThis(random).shuffleThis(random).toStack()
(1..5).collect(i -> deck.pop(5).toSortedList()).each(this::println)

And here is the output:

[Card[rank=FOUR, suit=SPADES], Card[rank=FOUR, suit=DIAMONDS], Card[rank=SIX, suit=HEARTS], Card[rank=NINE, suit=CLUBS], Card[rank=JACK, suit=CLUBS]]
[Card[rank=FIVE, suit=DIAMONDS], Card[rank=TWO, suit=HEARTS], Card[rank=FIVE, suit=HEARTS], Card[rank=TEN, suit=CLUBS], Card[rank=QUEEN, suit=CLUBS]]
[Card[rank=FIVE, suit=SPADES], Card[rank=NINE, suit=SPADES], Card[rank=QUEEN, suit=SPADES], Card[rank=THREE, suit=DIAMONDS], Card[rank=TWO, suit=CLUBS]]
[Card[rank=EIGHT, suit=SPADES], Card[rank=TWO, suit=DIAMONDS], Card[rank=EIGHT, suit=DIAMONDS], Card[rank=KING, suit=HEARTS], Card[rank=FIVE, suit=CLUBS]]
[Card[rank=SIX, suit=SPADES], Card[rank=KING, suit=DIAMONDS], Card[rank=THREE, suit=HEARTS], Card[rank=TEN, suit=HEARTS], Card[rank=QUEEN, suit=HEARTS]]

We can do a similar example with the built-in JDK collections and make some additional tweaks for nicer output:

enum Rank {
ACE, TWO, THREE, FOUR, FIVE, SIX, SEVEN, EIGHT, NINE, TEN, JACK, QUEEN, KING
String toString() { ['A', *'2'..'9', '10', 'J', 'Q', 'K'][ordinal()] }
}

enum Suit {
SPADES, DIAMONDS, HEARTS, CLUBS
String toString() { '♠♦♥♣'[ordinal()] }
}

@Sortable(includes='suit,rank')
record Card(Rank rank, Suit suit) {
Card(List srPair) { this(srPair[1], srPair[0]) }
String toString() { "${rank}${suit}" }
}

var cards = [Suit.values(), Rank.values()].combinations().collect(Card::new)
var cardsBySuit = cards.groupBy(Card::suit)
var cardsByRank = cards.groupBy(Card::rank)

assert [cards, cardsByRank, cardsBySuit]*.size() == [52, 13, 4]

var random = new Random(42L)
3.times { cards.shuffle(random) }
var deck = cards as Stack
5.times { println "Hand ${it+1}: ${(0..5).collect{ deck.pop() }.sort()}" }

println "Remaining cards sorted:\n${deck.sort()}"

Which has this output:

Cards3.png

Both these Groovy examples run on JDK versions from 8 to 19
(emulated records are used in JDK versions 8 to 15).

Updated 25/Sep/2022: Use same ordering for the JDK collections version.


Encryption and decryption with Groovy

by paulk


Posted on Monday September 19, 2022 at 02:34PM in Technology


Inspired by this recent blog entry, here is an example showing how to encrypt and decrypt with Groovy.

Using the JDK crypto classes

First, we need some text to encrypt. We'll use an excerpt of the one from the aforementioned blog post:

var text = 'Contrary to popular belief, Lorem Ipsum is not simply random text.\
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.'

Next, we'll create a factory for our cipher instance, generate a key, and set up an initialization vector.

First, the cipher factory:

var factory = { Cipher.getInstance('AES/CBC/PKCS5Padding') }

For our cipher algorithm, we are using the Advanced Encryption Standard (AES) algorithm, in Cipher Block Chaining (CBC) mode, with PKCS5 padding. We'll look at other options later.

Next we generate our secret key. Our secret key is our password. Only someone who has the password will be able to decrypt the encrypted message. We could use any random bits for our key, but like passwords, we want to choose a strong key rather than a weak one. Cryptographic libraries provide classes to generate such keys. We just need to provide the key size. AES supports 128, 192 and 256 bit keys. We'll choose 192 here:

var key = generateKey('AES', 192)

Our code uses this helper method:

def generateKey(String algorithm, Integer size) {
var generator = KeyGenerator.getInstance(algorithm)
generator.init(size)
generator.generateKey()
}

Next, we generate an initialization vector:

var ivParameterSpec = randomParameterSpec(factory)

It uses this helper method (we're using the algorithm block size for our initialization vector size):

def randomParameterSpec(Closure<Cipher> factory) {
var block = new byte[factory().blockSize]
SecureRandom.instanceStrong.nextBytes(block)
new IvParameterSpec(block)
}

An initialization vector is used to introduce some additional randomness to avoid repeating patterns in the input leading to repeating patterns in the encrypted bytes.

With all these things in place, we are almost ready to encrypt or decrypt, but first, let's define two more helper methods:
def encrypt(byte[] bytes, Key key, IvParameterSpec spec, Closure<Cipher> factory) {
var cipher = factory()
cipher.init(ENCRYPT_MODE, key, spec)
cipher.doFinal(bytes)
}

def decrypt(byte[] bytes, Key key, IvParameterSpec spec, Closure<Cipher> factory) {
var cipher = factory()
cipher.init(DECRYPT_MODE, key, spec)
cipher.doFinal(bytes)
}
And here is how we encrypt and decrypt:
var encrypted = encrypt(text.bytes, key, ivParameterSpec, factory)
println "Encrypted bytes : $encrypted"
println "Encrypted text : ${new String(encrypted)}"

var decrypted = decrypt(encrypted, key, ivParameterSpec, factory)
println "Decrypted bytes : $decrypted"
println "Decrypted text : ${new String(decrypted)}"

Which has this output:

Encrypted bytes : [-117, 36, 18, 69, -101, -8, 35, 93, -102, -49, -12, ..., -19, -100]
Encrypted text : ‹$E›ø#]šÏôæ”Á˜çp^µ³=L(Ö^_ŒC>CIË„ö,1É8ÆŸ.Š?vßG,Èw‰å¼zÜf>?µ›D¹éÆk€	°˜2êÔ}í©àhl$>?¹¡Kå3ÔO?±&…êî¶Ê–¾°®q®à—0ú‘ÔhO<H¦ç®Ç”ÈhAëjó QPyƒy6Ĥ*´un¼ï¯m¨´ÙjeJtëº\ó6ƪKªœíœ
Decrypted bytes : [67, 111, 110, 116, 114, 97, 114, 121, 32, 116, 111, 32, ..., 100, 46]
Decrypted text : Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.

We can see that everything worked as expected, since the final output matches our original input text.

Using the Bouncy Castle library

We can alternatively, swap algorithms. There are numerous algorithms and modes supported by the JDK and others supported by third-party libraries. A nice summary can be found here.

We'll swap to use the CAST5 (CAST-128) algorithm which supports up to a 128-bit key. We'll use HMAC-SHA1 to generate our key.

import org.bouncycastle.jce.provider.BouncyCastleProvider
var bc = new BouncyCastleProvider()
factory = { Cipher.getInstance('CAST5', bc) }
key = generateKey('HmacSHA1', 128)
ivParameterSpec = randomParameterSpec(factory)

CAST5 is the default algorithm used in some versions of GPG and PGP. It isn't included by default in the JDK, so for this we'll use the Bouncy Castle library.

[Just as a note. If you are wanting to encrypt or decrypt GPG/PGP files, don't use the above code. Libraries like Bouncy Castle have dedicated classes for such scenarios.]

We now encrypt and decrypt as before:

encrypted = encrypt(text.bytes, key, ivParameterSpec, factory)
println "Encrypted text : ${new String(encrypted)}"
decrypted = decrypt(encrypted, key, ivParameterSpec, factory)
println "Decrypted text : ${new String(decrypted)}"

Which has this output:

Encrypted text : Mªá?r?v9£÷~4µT'›ÙÝÁl¿Þg¾0ñŽ¡?Ü=³9Q¬»3«ÖÁ¡µ ¾@4÷`FñÙŠfø7¥#›v¤Í–‰¼Ü¢ƒE6ôŽTÙlæÏz>o?àL›¡¢z1nÖo9]šOÔ¼SÔOÍ#Ý7LœÀî}ó5m%q•»l%/AWT´¢zH#t솱l¶£—Œ«©wˆÃ®>®Ü6ër-E
Decrypted text : Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.

Other useful functionality

Passing around binary data like our secret key or the encrypted data, has its own problems. Groovy provides extension methods to encode such data (and corresponding decode methods). For example, we can encode our secret key in various ways:

var keyBytes = key.encoded
println keyBytes.encodeHex()
println keyBytes.encodeBase64()
println keyBytes.encodeBase64Url()

Which has this output (the key is random, so the output will differ for each run):

85a0d3f0ce0cbe6402dc9579fbffcf1d
haDT8M4MvmQC3JV5+//PHQ==
haDT8M4MvmQC3JV5-__PHQ

Groovy also provides extension methods for various checksums (but you might want to look at stronger checksum algorithms in security sensitive scenarios):

println "SHA256 : ${text.sha256()}"
println "MD5 : ${text.md5()}"

Which has this output:

SHA256 : ccb184e35e4c32bafc730d84ec924ea2980035ea5fadb012e3b2b31abf4323c9
MD5 : 46c61a174c2dc99204521ca89f09f63c

If you are encrypting and decrypting entire files, the JDK has special classes for that too which are also easy to use from Groovy. That's all for now.

References

Conclusion

We have taken a brief look at encrypting and decrypting with Apache Groovy.


Calculating Fibonacci with Groovy revisited

by paulk


Posted on Thursday September 08, 2022 at 10:59AM in Technology


In a recent post, we looked at using Matrices with Groovy including using matrices to calculate Fibonacci terms. But do you need that complexity to calculate Fibonacci? Not, not at all. You can do various one-liners for that scenario (to repeat the calculation from that post):

Stream.iterate([1, 1], f -> [f[1], f.sum()]).limit(8).toList()*.head()

The previous post was more about using matrices than Fibonacci per se, though hopefully learning that the Fibonacci matrix was a specialisation of the Leslie matrix was an added bonus.

Let's have a look at a few other options to write Fibonacci methods in Groovy.

Iterative style

Unless you learned a functional programming language as your first language, you may have written an iterative Factorial or Fibonacci as one of your first programming learning exercises. Such an algorithm for Fibonacci could look something like this:

def fib(n) {
if (n <= 1) return n

def previous = n.valueOf(1), next = previous, sum
(n-2).times {
sum = previous
previous = next
next = sum + previous
}
next
}

assert fib(10) == 55
assert fib(50L) == 12586269025L
assert fib(100G) == 354224848179261915075G

The only interesting part to this solution is the use of dynamic idioms. We didn't provide an explicit type for n, so duck-typing means the method works fine for Integer, Long and BigInteger values. This implementation does all calculations using the type of the supplied n, so the user of the method controls that aspect.

Groovy gives the option to also specify an explicit type like Number or use TypeChecked or CompileStatic for type inference if you wanted. We'll see an example of those options later.

Recursive

Once you mastered iterative programming, your next programming learning exercise may have been the recursive version of Factorial or Fibonacci. For Fibonacci, you may have coded something like this:

def fib(n) {
if (n <= 1) return n
fib(n - 1) + fib(n - 2)
}

assert fib(10) == 55
assert fib(50L) == 12586269025L
assert fib(100G) == 354224848179261915075G

This naïve version is incredibly inefficient. Calling fib(6) ends up calculating fib(2) five times for instance:

There are several ways to avoid that repetition. One option is to use the @Memoized annotation. Adding this annotation on a method causes the compiler to inject appropriate code for caching results into the method. This is ideal for pure functions like Fibonacci since they always return the same output for a given input. There are annotation attributes to adjust how big such a cache might be, but that sophistication isn't needed here.

@Memoized
def fib(n) {
if (n <= 1) return n
fib(n - 1) + fib(n - 2)
}

assert fib(10) == 55
assert fib(50L) == 12586269025L
assert fib(100G) == 354224848179261915075G

This now runs just as quickly as the iterative method. If you happened to use a Closure instead of a method, you can call one of the memoize methods on Closure.

A problem with this approach (in fact recursion in general) is that you hit a stack overflow exception for larger values of n, e.g. fib(500G). Groovy supports tail call elimination with the inclusion of the TailRecursive annotation. In this case the compiler injects an "unrolled" non-recursive version of the algorithm. In order for the "unrolling" to succeed, the algorithm needs to be re-worked so that at most one call to fib occurs in any return statement. Here is a version of the algorithm re-worked in this way:

@TailRecursive
static fib(n, a, b) {
if (n == 0) return a
if (n == 1) return b
fib(n - 1, b, a + b)
}

assert fib(10, 0, 1) == 55
assert fib(50L, 0L, 1L) == 12586269025L
assert fib(100G, 0G, 1G) == 354224848179261915075G
assert fib(500G, 0G, 1G) == 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125G

This is slightly more complicated to understand than the original if you haven't seen it before but now is both efficient and handles large values of n. We can compile statically for even faster speed like this:

@TailRecursive
@CompileStatic
static fib(Number n, Number a, Number b) {
if (n == 0) return a
if (n == 1) return b
fib(n - 1, b, a + b)
}

assert fib(10, 0, 1) == 55
assert fib(50L, 0L, 1L) == 12586269025L
assert fib(100G, 0G, 1G) == 354224848179261915075G
assert fib(500G, 0G, 1G) == 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125G

If you are using a Closure, you would look at using the trampoline method on Closure to achieve a similar result.

Streams

We saw the Stream based "one-liner" solution at the start of this blog post. Let's adopt the duck-typing idioms we have used so far and define a fib method. It could look like this:

def fib(n) {
def zero = n.valueOf(0)
def one = n.valueOf(1)
Stream.iterate([zero, one], t -> [t[1], t.sum()])
.skip(n.longValue())
.findFirst().get()[0]
}

assert fib(10) == 55
assert fib(50L) == 12586269025L
assert fib(100G) == 354224848179261915075G

Bytecode and AST transforms

Finally, just so you know all your options, here is a version using the @Bytecode AST transform which lets you write JVM bytecode directly in your Groovy! Note this falls into the category of "don't ever ever do this" but just so you know you can, it is included here:

@Bytecode
int fib(int i) {
l0
iload 1
iconst_2
if_icmpgt l1
iconst_1
_goto l2
l1
frame SAME
aload 0
iload 1
iconst_2
isub
invokevirtual '.fib','(I)I'
aload 0
iload 1
iconst_1
isub
invokevirtual '.fib', '(I)I'
iadd
l2
frame same1,'I'
ireturn
}

assert fib(10) == 55

Please read the caveats for that transform before considering using it for anything but extreme situations. It's meant more as a fun thing to try than something anyone would want to do in production.

Have fun writing your own algorithms!



Solving cryptarithmetic puzzles with Groovy and constraint programming using Choco, JaCoP, and OR-Tools

by paulk


Posted on Monday September 05, 2022 at 01:43PM in Technology



Introduction

When writing solutions to problems, we frequently strive to hide away implementation details. In Object-oriented (OO) programming, we might build a rich hierarchy of classes with well-thought out methods so that our final solution can be expressed in terms of simple nouns and verbs (methods and class instances) in our domain model. When applying functional programming idioms, we will strive to emphasise the relationship between inputs and outputs and hide away side effects and iterative steps. Constraint programming (within the same family as logic programming) also strives to hide away details. Instead of expressing an iterative implementation, it focuses on expressing declarative properties of a solution. A solver is responsible for working out the exact implementation steps.

When using constraint programming, we develop a model consisting of variables, the domain of values each variable may hold, and additional constraints between the variables. A solver does all the work. It may employ heuristic searching, inference, propagation, symmetry and backtracking to find possible solutions. We may be able to (or want to, or need to) guide the solver as to which techniques and strategies it should use. Constraint programming has been used to solve various kinds of problems including scheduling problems, and excels at problems with combinatorial possibilities that are too irregular for other mathematical optimisations. Frequently used illustrative problems are Sudoku and Wordle solvers, N-Queen problems, and other kinds of puzzles. We'll just look at cryptarithmetic puzzles.

Cryptarithmetic Problems

Cryptarithmetic problems (also known as alphametics, verbal arithmetic, cryptarithm, and word addition) are a type of mathematical game where a mathematical equation is presented where digits in the equation are replaced by letters. Traditionally, each letter usually represents a unique number, and numbers don't start with the digit zero. If we look at one sample problem:

TO
+GO
=OUT

We can reason about what the solution can be by hand:

  • T, O, U, and G must be all different (game rule)
  • T, G, and O will be between 1 and 9, and U is between 0 and 9 (game rule)
  • If we added the two biggest 2-digit numbers, (99 + 99) we'd get 198, so O must be 1
  • Looking at the right-most "units" column, 1 + 1 equals 2, so T must be 2
  • Looking at the "tens" column, we know there is a carry of 1 (since O is 1) and we know T is 2, so G must be 8 or 9. If G was 9, U would be 1 but it can't be the same as O, so G must be 8 and U must be 0.

When solving by hand, we typically reason about individual columns and account for the "carry" to the next column. We'll come back to this point later but first, let's look at a slightly bigger problem:

SEND
+MORE
=MONEY

Solving with Brute Force

This problem isn't huge, so we can solve with brute force. We simply try all possible values for the letters in the puzzle:

for (s in 1..9)
for (e in 0..9)
for (n in 0..9)
for (d in 0..9)
for (m in 1..9)
for (o in 0..9)
for (r in 0..9)
for (y in 0..9)
if ([s, e, n, d, m, o, r, y].toSet().size() == 8) {
def send = 1000 * s + 100 * e + 10 * n + d
def more = 1000 * m + 100 * o + 10 * r + e
def money = 10000 * m + 1000 * o + 100 * n + 10 * e + y
if (send + more == money) {
println "s = $s, e = $e, n = $n, d = $d"
println "m = $m, o = $o, r = $r, y = $y"
}
}

This isn't very efficient though. It calculates 81 million combinations for the variables before skipping all but 1.5 million of them (since most won't be unique). All up it might execute in the low tens of seconds.

Alternatively, Groovy supports calculating permutations, so we can simplify our solution to a single for loop (with some tests to eliminate unhelpful iterations):

def digits = 0..9
for (p in digits.permutations()) {
if (p[-1] < p[-2]) continue
def (s, e, n, d, m, o, r, y) = p
if (s == 0 || m == 0) continue
def send = 1000 * s + 100 * e + 10 * n + d
def more = 1000 * m + 100 * o + 10 * r + e
def money = 10000 * m + 1000 * o + 100 * n + 10 * e + y
if (send + more == money) {
println "s = $s, e = $e, n = $n, d = $d"
println "m = $m, o = $o, r = $r, y = $y"
}
}

This has the advantage of only generating unique combinations. It will execute in seconds.

Running either of these solutions yields:

s = 9, e = 5, n = 6, d = 7
m = 1, o = 0, r = 8, y = 2

Using Constraint Programming

For the brute force approaches, we had a condition which checked any potential candidate answer to see if it was a correct solution. We had to be very explicit in how we wanted the potential candidates to be created. For constraint programming, we instead define variables to represent the problem, any known bounds on those variables, and we specify any other known properties of the solution, which in our case will be something similar to the condition we had to check if the answer was correct previously. Let's examine how to do that with three libraries, one with a variation.

Choco

Here is the code using the Choco library:

new Model("SEND+MORE=MONEY").with {
def (S, M) = ['S', 'M'].collect { intVar(it, 1, 9) }
def (E, N, D, O, R, Y) = ['E', 'N', 'D', 'O', 'R', 'Y'].collect { intVar(it, 0, 9) }

allDifferent(S, E, N, D, M, O, R, Y).post()

IntVar[] ALL = [
S, E, N, D,
M, O, R, E,
M, O, N, E, Y ]
int[] COEFFS = [
1000, 100, 10, 1,
1000, 100, 10, 1,
-10000, -1000, -100, -10, -1 ]

scalar(ALL, COEFFS, "=", 0).post()

println solver.findSolution()
}

We define our variables and their bounds (domain). We use an allDifferent global constraint to specify the uniqueness requirement and a scalar constraint that ensures that our variables multiplied by their respective scalar coefficients equal 0. This lets us factor in whether the particular variable is representing the "units" column, the "10s" column, the "100s" column etc. This captures the "puzzle addition" constraint. We then ask the solver to find the solution. We could just as easily have asked for all solutions (if more than one existed).

This is typical of how we solve such problems. We either define constraints directly between one or more variables or use whatever global constraints our library might support. If our library doesn't support the constraint we need, we find a way to express it using multiple simpler constraints.

The end result is that our code is more declarative than our brute force approaches, and the solution is found in tens of milliseconds. The solver has very efficient strategies for solving such puzzles.

JaCoP

We can solve the same problem using JaCoP:

def store = new Store()
def (S, M) = ['S', 'M'].collect { new IntVar(store, it, 1, 9) }
def (E, N, D, O, R, Y) = ['E', 'N', 'D', 'O', 'R', 'Y'].collect { new IntVar(store, it, 0, 9) }
var ctr = new Alldifferent(S, E, N, D, M, O, R, Y)
store.impose(ctr)

IntVar[] ALL = [
S, E, N, D,
M, O, R, E,
M, O, N, E, Y ]
int[] COEFFS = [
1000, 100, 10, 1,
1000, 100, 10, 1,
-10000, -1000, -100, -10, -1 ]
var lin = new LinearInt(ALL, COEFFS, "==", 0)
store.impose(lin)

var label = new DepthFirstSearch()
var select = new InputOrderSelect(store, ALL, new IndomainMin())
label.labeling(store, select)

There are some slight differences in this API, but nearly everything has a one-to-one correspondence to what we saw earlier. We are explicitly selecting search strategies and selection strategies here whereas with Choco, defaults were chosen for us. In both cases, explicit creation of such classes allows the strategies to be altered for particular scenarios if needed.

When run, the output looks like this:

Labeling has finished with return value of true
DFS1: DFS([S = 9, E = 5, N = 6, D = 7, M = 1, O = 0, R = 8, Y = 2], InputOrder, (org.jacop.search.IndomainMin@45394b31))

We can see here the code is very similar as is the execution time.

OR-Tools

We can repeat the solution using OR-Tools. Here is the code:

Loader.loadNativeLibraries()

new Solver('Send+More=Money').with {
def s = makeIntVar(1, 9, 's')
def e = makeIntVar(0, 9, 'e')
def n = makeIntVar(0, 9, 'n')
def d = makeIntVar(0, 9, 'd')
def m = makeIntVar(1, 9, 'm')
def o = makeIntVar(0, 9, 'o')
def r = makeIntVar(0, 9, 'r')
def y = makeIntVar(0, 9, 'y')

IntVar[] all = [s, e, n, d, m, o, r, y]
IntVar[] scalar = [s, e, n, d, m, o, r, e, m, o, n, e, y]
int[] coeffs = [
1000, 100, 10, 1, // S E N D +
1000, 100, 10, 1, // M O R E =
-10000, -1000, -100, -10, -1 // M O N E Y
]

addConstraint(makeScalProdEquality(scalar, coeffs, 0))
addConstraint(makeAllDifferent(all))

def db = makePhase(all, INT_VAR_DEFAULT, INT_VALUE_DEFAULT)
newSearch(db)
while (nextSolution()) {
println all.join(' ')
}
endSearch()

// Statistics
println "Solutions: ${solutions()}"
println "Failures: ${failures()}"
println "Branches: ${branches()}"
println "Wall time: ${wallTime()}ms"
}

which has this output:

s(9) e(5) n(6) d(7) m(1) o(0) r(8) y(2)
Solutions: 1
Failures: 5
Branches: 10
Wall time: 60ms

OR-Tools is written in C++ but has interfaces for numerous languages including Java - which is perfect for Groovy use.

Choco with JSR331

It is great to have multiple libraries to pick from but having a standard API can help switching between such libraries. This is where JSR331 comes in. It defines a standard API for interacting with constraint solvers and linear solves. Here we use a JSR331 implementation backed by an earlier version of the Choco library. The code looks like this:

import javax.constraints.*

ProblemFactory.newProblem('SEND+MORE=MONEY').with {
def (S, M) = ['S', 'M'].collect { variable(it, 1, 9) }
def (E, N, D, O, R, Y) = ['E', 'N', 'D', 'O', 'R', 'Y'].collect { variable(it, 0, 9) }

postAllDifferent(S, E, N, D, M, O, R, Y)

Var[] ALL = [
S, E, N, D,
M, O, R, E,
M, O, N, E, Y]
int[] COEFFS = [
1000, 100, 10, 1,
1000, 100, 10, 1,
-10000, -1000, -100, -10, -1]

post(COEFFS, ALL, '=', 0)

def solver = getSolver()
def solution = solver.findSolution()
println solution ?: 'No solution'
solver.logStats()
}
It is quite similar to earlier examples but now exclusively uses the JSR331 classes in the javax.constraint package. There are implementations of those classes backed by several implementations. So, indeed it would be possible to swap between them. When run, the output is:
Solution #1:
	 S[9] M[1] E[5] N[6] D[7] O[0] R[8] Y[2]

Having said that, at the time of writing, JSR331 popularity doesn't appear to be on the rise. Most folks using constraint programming libraries seem to be using the direct library classes. Indeed, the version of the Choco implementation used by the JSR331 implementation is over 10 years old.

Incorporating Carry

The scalar product global constraint we have used in the previous examples is very powerful and probably would be our first choice for this problem. We can, however, model constraint programming problems in multiple ways, so let's look at a solution that avoids that global constraint.

Instead, we will develop a model that mirrors how we reasoned about the original TO + GO = OUT problem that we solved by hand. For that, we just considered a column at a time and accounted for the carry. We'll explicitly introduce variables to hold the carry (0 if no carry, or 1 if there is a carry) into our model. Then we'll express the mathematical constraints that are applicable for each column.

Here is the code:

new Model("SEND+MORE=MONEY").with {
def (S, M) = ['S', 'M'].collect { intVar(it, 1, 9) }
def (E, N, D, O, R, Y) = ['E', 'N', 'D', 'O', 'R', 'Y'].collect { intVar(it, 0, 9) }
def C = (0..3).collect{ intVar("C$it", 0, 9) }

allDifferent(S, E, N, D, M, O, R, Y).post()
C[3] .eq(M).post() // C3 C2 C1 C0
C[2].add(S).add(M).eq(O.add(C[3].mul(10))).post() // S E N D
C[1].add(E).add(O).eq(N.add(C[2].mul(10))).post() // M O R E
C[0].add(N).add(R).eq(E.add(C[1].mul(10))).post() // -------------
D .add(E).eq(Y.add(C[0].mul(10))).post() // M O N E Y

println solver.findSolution()
}

We can see that there is now no scalar product global constraint any more but instead the constraints for each column.

When run, the output looks like this:

Solution: S=9, M=1, E=5, N=6, D=7, O=0, R=8, Y=2, C0=1, C1=1, C2=0, C3=1, sum_exp_1=9,
sum_exp_2=10, (C3*10)=10, sum_exp_3=10, sum_exp_4=6, sum_exp_5=6, (C2*10)=0, sum_exp_6=6,
sum_exp_7=7, sum_exp_8=15, (C1*10)=10, sum_exp_9=15, sum_exp_10=12, (C0*10)=10, sum_exp_11=12,

We can see that as we were defining our constraints for each column, subexpressions were being created in the model which are reflected in the solution. They are if you like, temporary calculations along the way to getting the answer - or more accurately a snapshot of ever changing temporary calculations. They don't form part of the answer that interests us, so we would be free to just print out the part of the solution which interests us if we wanted.

Creating a DSL

The previous example has lots of calls to add and mul methods. We can create a little bit of a DSL to provide some syntactic sugar to our previous examples to allow use of Groovy's operator overloading, support ranges when specifying the domain of a variable, and a few other niceties. Our code becomes:

model("SEND+MORE=MONEY") {
def (S, M) = ['S', 'M'].collect { intVar(it, 1..9) }
def (E, N, D, O, R, Y) = ['E', 'N', 'D', 'O', 'R', 'Y'].collect { intVar(it, 0..9) }
def C = intVarArray(4, 0..1)

[allDifferent(S, E, N, D, M, O, R, Y), // C3 C2 C1 C0
C[3] .eq(M), // S E N D
(C[2] + S + M).eq(O + C[3] * 10), // M O R E
(C[1] + E + O).eq(N + C[2] * 10), // -------------
(C[0] + N + R).eq(E + C[1] * 10), // M O N E Y
(D + E).eq(Y + C[0] * 10)]*.post()

println solver.findSolution()
}

It has the same output as previously.

You might wonder how the solver finds the solution. You can watch the variables in the debugger and use tools like choco-cpviz but it is a quite convoluted process until you are used to it. We'll try to give you a flavor of what is going on here. Basically, there will be various steps of pruning wherever possible and branching with possible backtracking. Below are some snapshots for our example above.

To start with, we have nearly 90 light green squares which represents our problem search space. We walk our way through the rules looking for ways to prune the search space:

choco_step1.png

 choco_step2.png

 choco_step3.png

choco_step4.png

choco_step5.png

choco_step6.png

choco_step7.png

choco_step8.png

choco_step9.png

choco_step10.png

choco_step11.png

choco_step12.png

choco_step13.png

choco_step14.png

As we are locking in the value of variables, we can substitute them into and simplify our constraints. When we reapply them, they will be quicker to evaluate and may reveal more information.

At this point we only have 2 of our variables locked down but our search space is nearly half what we started with and we have simplified some of our constraints. We would continue branching and solving at this point until we find our solution or determine that no solution is possible.

Looking at other languages

The example repo also contains solutions for this problem in other languages so you can compare and contrast including Clojure, Haskell (Frege), Java, JavaScript (Nashorn), Ruby (JRuby), Python (Jython), Kotlin, Lua (Luaj), Prolog (tuprolog), and Scala.

other language logos

Other examples

To wrap up, let's look at solving a few more examples (using Choco). We'll solve some of the examples from an interesting blog on the history of Cryptarithmetic problems:

  • ABCD * 4 = DCBA
  • AA + BB + CC = ABC
  • HALF + HALF = WHOLE
  • HALF + FIFTH + TENTH + TENTH + TENTH = WHOLE

Here is the code:

new Model("ABCD*4=DCBA").with {
def (A, D) = ['A', 'D'].collect { intVar(it, 1, 9) }
def (B, C) = ['B', 'C'].collect { intVar(it, 0, 9) }
def R = (0..2).collect { intVar(0, 9) }

allDifferent(A, B, C, D).post()
R[2].add(A.mul(4)).eq(D).post()
R[1].add(B.mul(4)).eq(C.add(R[2].mul(10))).post()
R[0].add(C.mul(4)).eq(B.add(R[1].mul(10))).post()
D.mul(4).eq(A.add(R[0].mul(10))).post()
solver.findAllSolutions().each {
println "$name: ${pretty(it, [A, B, C, D, ' * 4 = ', D, C, B, A])}\n$it\n"
}
}

new Model("AA+BB+CC=ABC").with {
def (A, B, C) = ['A', 'B', 'C'].collect { intVar(it, 1, 9) }
allDifferent(A, B, C).post()
A.mul(11).add(B.mul(11).add(C.mul(11))).eq(A.mul(100).add(B.mul(10)).add(C)).post()
solver.findAllSolutions().each {
println "$name: ${pretty(it, [A, A, ' + ', B, B, ' + ', C, C, ' = ', A, B, C])}\n$it\n"
}
}

new Model("HALF+HALF=WHOLE").with {
def (H, W) = ['H', 'W'].collect { intVar(it, 1, 9) }
def (A, E, F, L, O) = ['A', 'E', 'F', 'L', 'O'].collect { intVar(it, 0, 9) }
allDifferent(H, W, A, E, F, L, O).post()
IntVar[] ALL = [
H, A, L, F,
W, H, O, L, E]
int[] COEFFS = [
2000, 200, 20, 2,
-10000, -1000, -100, -10, -1]
scalar(ALL, COEFFS, "=", 0).post()
solver.findAllSolutions().each {
println "$name: ${pretty(it, [H, A, L, F, ' + ', H, A, L, F, ' = ', W, H, O, L, E])}\n$it\n"
}
}

new Model("HALF+FIFTH+TENTH+TENTH+TENTH=WHOLE").with {
def (H, F, T, W) = ['H', 'F', 'T', 'W'].collect { intVar(it, 1, 9) }
def (A, L, I, E, N, O) = ['A', 'L', 'I', 'E', 'N', 'O'].collect { intVar(it, 0, 9) }
allDifferent(H, F, T, W, A, L, I, E, N, O).post()
IntVar[] ALL = [
H, A, L, F,
F, I, F, T, H,
T, E, N, T, H,
T, E, N, T, H,
T, E, N, T, H,
W, H, O, L, E]
int[] COEFFS = [
1000, 100, 10, 1,
10000, 1000, 100, 10, 1,
10000, 1000, 100, 10, 1,
10000, 1000, 100, 10, 1,
10000, 1000, 100, 10, 1,
-10000, -1000, -100, -10, -1]
scalar(ALL, COEFFS, "=", 0).post()
solver.findAllSolutions().each {
def parts = [H, A, L, F, '+', F, I, F, T, H, '+', T, E, N, T, H, '+',
T, E, N, T, H, '+', T, E, N, T, H, '=', W, H, O, L, E]
println "$name: ${pretty(it, parts)}\n$it\n"
}
}

// helper method to print solutions
def pretty(model, parts) {
parts.collect { p -> p instanceof IntVar ? model.getIntVal(p) : p }.join()
}

which has this output:

ABCD*4=DCBA: 2178 * 4 = 8712
Solution: A=2, D=8, B=1, C=7, IV_1=3, IV_2=3, IV_3=0, (A*4)=8, sum_exp_4=8, (B*4)=4, ..., 

AA+BB+CC=ABC: 11 + 99 + 88 = 198
Solution: A=1, B=9, C=8, (A*11)=11, (B*11)=99, (C*11)=88, ..., 

HALF+HALF=WHOLE: 9604 + 9604 = 19208
Solution: H=9, W=1, A=6, E=8, F=4, L=0, O=2, 

HALF+HALF=WHOLE: 9703 + 9703 = 19406
Solution: H=9, W=1, A=7, E=6, F=3, L=0, O=4, 

HALF+HALF=WHOLE: 9802 + 9802 = 19604
Solution: H=9, W=1, A=8, E=4, F=2, L=0, O=6, 

HALF+FIFTH+TENTH+TENTH+TENTH=WHOLE: 6701+14126+25326+25326+25326=96805
Solution: H=6, F=1, T=2, W=9, A=7, L=0, I=4, E=5, N=3, O=8, 

You should see the common patterns used for solving these puzzles.

Further Information

Conclusion

We have looked at using Groovy and a few constraint programming libraries to solve a cryptarithmetic puzzles. Why not try solving some of your own puzzles.


Groovy List Processing Cheat Sheet

by paulk


Posted on Sunday August 28, 2022 at 08:46AM in Technology


Declaring lists

Several styles are supported for declaring lists:

var pets    = ['cat', 'canary', 'dog', 'fish', 'gerbil']  // idiomatic Groovy
var nums = [1, 2, 3, 4] as LinkedList // use 'as' for other kinds of list
var primes = List.of(2, 3, 5, 7) // Java 9+ style (immutable)
var range = 1..4 // a range is also a list (immutable)
var bigNums = [1000, 2000].asUnmodifiable() // unmodifiable (backed by original)
var negNums = [-100, -200].asImmutable() // immutable (backed by copy)

List elements and properties

Java methods for accessing list elements and list properties:

assert !pets.isEmpty()
assert pets.size() == 5
assert pets.get(0) == 'cat'
assert pets.contains('dog')
assert pets.containsAll('cat', 'dog')
pets.forEach { assert it.size() > 2 }
assert ['a', 'b', 'a'].indexOf('a') == 0
assert ['a', 'b', 'a'].lastIndexOf('a') == 2

Groovy extensions for accessing list elements and list properties:

assert pets[0] == 'cat'
assert pets?[0] == 'cat' // safe indexing returns null if pets was null instead of NPE
assert pets.first() == 'cat'
assert pets.head() == 'cat'
assert pets[-1] == 'gerbil'
assert pets[1..3] == ['canary', 'dog', 'fish']
assert pets[3..1] == ['fish', 'dog', 'canary'] // reverse range
assert pets[1, 3, 3] == ['canary', 'fish', 'fish'] // arbitrary collection
assert pets[0..1, [3, 3]] == ['cat', 'canary', 'fish', 'fish'] // nested collections
assert [1, 2, 3, 1].count(1) == 2
assert [1, 2, 3, 4].min() == 1
assert [1, 2, 3, 4].max() == 4
assert [1, 2, 3, 4].sum() == 10
assert [1, 2, 3, 4].average() == 2.5
[1, 2, 3, 4].eachWithIndex{ val, idx -> assert val == idx + 1 }
def cpets = pets[0..1]
assert cpets == ['cat', 'canary']
assert pets.findAll { it =~ /c.*/ } == cpets
assert pets.find { it =~ /c.*/ } == cpets[0]
assert pets.grep(~/c.*/) == cpets
assert cpets.min { it.size() } == 'cat' // cpet with smallest size name
assert cpets.max { it.size() } == 'canary' // cpet with largest name
assert pets.groupBy { it.size() } == [3: ['cat', 'dog'], 4: ['fish'], 6: ['canary', 'gerbil']]
assert pets.countBy { it.size() } == [3: 2, 4: 1, 6: 2]
assert pets.sum{ it.size() } == 22 // total size of all pet names
assert pets.average{ it.size() } == 4.4 // average size of pet names
assert pets.indices == 0..4
assert pets.findIndexValues { it.size() == 6 } == [1, 4]
assert cpets.indexed() == [0: 'cat', 1: 'canary']
assert cpets.withIndex()*.toList() == [['cat', 0], ['canary', 1]]

Modifying mutable lists

Methods from Java for modifying lists:

pets.remove(2);                        assert pets == ['cat', 'canary', 'fish', 'gerbil']  // remove by index
pets.remove('fish'); assert pets == ['cat', 'canary', 'gerbil'] // remove by element
pets.removeIf(p -> p.startsWith("c")); assert pets == ['gerbil'] // remove by condition
pets.clear(); assert pets.isEmpty() // make empty
pets.add("kangaroo"); assert pets == ['kangaroo'] // add element
pets.add(0, "koala"); assert pets == ['koala', 'kangaroo'] // add element at index position
pets.addAll(['ant', 'bee']); assert pets == ['koala', 'kangaroo', 'ant', 'bee'] // add collection
pets.addAll(1, ['ant']); assert pets == ['koala', 'ant', 'kangaroo', 'ant', 'bee'] // add collection at index
pets.removeAll(['ant', 'flea']); assert pets == ['koala', 'kangaroo', 'bee'] // remove from collection
pets.retainAll(['bee', 'koala']); assert pets == ['koala', 'bee'] // retain from collection
pets.set(0, "zebra"); assert pets == ['zebra', 'bee'] // set element at index
pets.replaceAll(String::toUpperCase); assert pets == ['ZEBRA', 'BEE'] // replace elements by unary operator
pets.sort(Comparator.naturalOrder()); assert pets == ['BEE', 'ZEBRA'] // sort

Groovy extensions for modifying lists:

pets << 'rock';                        assert pets == ['BEE', 'ZEBRA', 'rock']  // leftShift append
pets += ['rabbit', 'rock', 'hare']; assert pets == ['BEE', 'ZEBRA', 'rock', 'rabbit', 'rock', 'hare'] // append collection
pets.unique(); assert pets == ['BEE', 'ZEBRA', 'rock', 'rabbit', 'hare'] // remove duplicates
pets.sort{ it.size() }; assert pets == ['BEE', 'rock', 'hare', 'ZEBRA', 'rabbit'] // sort by size
pets[0] = 'ant'; assert pets == ['ant', 'rock', 'hare', 'ZEBRA', 'rabbit'] // replace element by index
pets[1..3] = ['snake', 'SNAKE']; assert pets == ['ant', 'snake', 'SNAKE', 'rabbit'] // replace range of elements
pets.unique{it.toLowerCase() }; assert pets == ['ant', 'snake', 'rabbit'] // remove duplicates ignoring case
pets.reverse(true); assert pets == ['rabbit', 'snake', 'ant'] // flip
pets.shuffle(); assert pets.size() == 3; // shuffle elements; resulting order will vary
def dice = [1, 2, 3, 4, 5, 6]
dice.removeAt(2); assert dice == [1, 2, 4, 5, 6]
dice.removeElement(2); assert dice == [1, 4, 5, 6]
dice.removeLast(); assert dice == [1, 4, 5]
dice.swap(0, 2); assert dice == [5, 4, 1]

Additional list functionality

Methods and operators which return new lists or values:

assert [1, 2, 3] + [1] == [1, 2, 3, 1]
assert [1, 2, 3, 1] - [1] == [2, 3]
assert [1, 2, 3] * 2 == [1, 2, 3, 1, 2, 3]
assert [1, [2, 3]].flatten() == [1, 2, 3]
assert [1, 2, 3].disjoint([4, 5, 6])
assert [1, 2, 3].intersect([4, 3, 1]) == [1, 3]
assert [1, 2, 3].collect { it + 3 } == [4, 5, 6]
assert [1, 2, 3, 4].collect { 1 } == [1, 1, 1, 1]
assert [4, 2, 1, 3].findAll { it % 2 == 0 } == [4, 2]
assert [1, 2, 3, 4].take(3) == [1, 2, 3]
assert [1, 2, 3, 4].takeRight(3) == [2, 3, 4]
assert [1, 2, 3, 4].takeWhile{ it < 3 } == [1, 2]
assert [1, 2, 3, 4].drop(2) == [3, 4]
assert [1, 2, 3, 4].dropRight(2) == [1, 2]
assert [1, 2, 3, 4].dropWhile{it < 3 } == [3, 4]
assert [1, 2, 3, 4].join('-') == '1-2-3-4'
assert [1, 2, 3, 4].tail() == [2, 3, 4]
assert [1, 2, 3, 4].init() == [1, 2, 3]
assert [1, 2, 3, 4].tails() == [[1, 2, 3, 4], [2, 3, 4], [3, 4], [4], []]
assert [1, 2, 3, 4].inits() == [[1, 2, 3, 4], [1, 2, 3], [1, 2], [1], []]
assert [1, 2, 3, 4].reverse() == [4, 3, 2, 1]
assert [1, 2, 3, 1].toUnique() == [1, 2, 3]
assert [1, 2, 3, 1].toSorted() == [1, 1, 2, 3]
assert [1, 2, 3, 4].collect { it * 2 } == [2, 4, 6, 8]
assert [[1, 2], [3, 4]].collectNested { it * 2 } == [[2, 4], [6, 8]]
def squaresAndCubesOfEvens = { it % 2 ? [] : [it**2, it**3] }
assert [1, 2, 3, 4].collectMany(squaresAndCubesOfEvens) == [4, 8, 16, 64]
assert [1, 2, 3, 4].any { it > 3 }
assert [1, 2, 3, 4].every { it < 5 }
assert ![1, 2, 3, 4].every { it > 3 }
assert [1, 2, 3, 4].find { it > 2 } == 3
assert [1, 2, 3, 4].findAll { it > 2 } == [3, 4]
assert [1, 2, 3, 4].findIndexOf { it > 2 } == 2
assert [1, 2, 3, 1].findLastIndexOf { it > 2 } == 2
assert [1, 2, 3, 4].inject { acc, i -> acc + i } == 10
assert (1..10).collate(3) == [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]
assert (1..10).chop(1, 3, 2, -1) == [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10]]
assert [1,2,3].permutations().toList() == [
[1, 2, 3], [3, 2, 1], [2, 1, 3], [3, 1, 2], [1, 3, 2], [2, 3, 1]
]
def matrix = [['a', 'b'], [ 1 , 2 ]]
assert matrix.transpose() == [ ['a', 1], ['b', 2] ]
assert matrix.combinations() == [ ['a', 1], ['b', 1], ['a', 2], ['b', 2] ]
assert [1, 2, 3].subsequences()*.toList() == [[1], [1, 2, 3], [2], [2, 3], [1, 2], [3], [1, 3]]
def answers = [1, 2, 3].withDefault{ 42 }
assert answers[2] == 3 && answers[99] == 42

GINQ processing

Groovy also supports language integrated query support to process lists:

// squares of odd numbers between 1 and 5
assert [1, 9, 25] == GQL {
from n in 1..5 where n % 2 != 0 select n ** 2
}

// group pets by name size
assert ["3:[cat, dog]", "4:[fish]", "6:[canary, gerbil]"] == GQL {
from p in pets
groupby p.size() as size
select size, agg(p) as names
}*.with{ "$it.size:$it.names" }

Stream methods

Useful stream methods (it is worthwhile comparing some of the examples to earlier non-stream variants):

pets = ['cat', 'canary', 'dog', 'fish', 'gerbil']
assert pets.stream().filter(p -> p.size() == 3).map(String::toUpperCase).toList() == ['CAT', 'DOG']
assert pets.stream().map(p -> p.size()).distinct().sorted().toList() == [3, 4, 6] // ordered pet name sizes
assert nums.stream().reduce{ a, b -> a + b }.get() == 10
assert (1..10).stream().skip(3).limit(5).filter(i -> i % 2 == 0).map(i -> i ** 2).toList() == [16, 36, 64]
assert [1, 2, 3, 4].stream().flatMap(i -> i % 2 ? Stream.empty() : Stream.of(i**2, i**3)).toList() == [4, 8, 16, 64]
assert pets.stream().collect(Collectors.groupingBy(p -> p.size())) == [3:['cat', 'dog'], 4:['fish'], 6:['canary', 'gerbil']]
assert [1, 2, 3, 4].stream().map(Integer::toString).collect(Collectors.joining('-')) == '1-2-3-4'
Arrays.stream(0..9 as int[]).summaryStatistics().with {
assert sum == 45 && min == 0 && max == 9 && average == 4.5 && count == 10
}
assert pets.stream().allMatch(w -> w ==~ /.*[aeiou].*/) // all pet names contain a vowel

GPars

Before looking at GPars, it is worth looking at parallel stream processing:

// calculate squares of odd numbers from input list
assert (1..5).parallelStream().filter{ it % 2 != 0 }.map(n -> n ** 2).toList() == [1, 9, 25]

GPars was designed to provide similar functionality long before streams processing was available. It still has some useful features.

Groovy has several tricks for removing the outer "withPool" clauses but we'll do the longhand here. Two GPars variations of above streams example:

GParsPool.withPool {
assert (1..5).findAllParallel{ it % 2 }.collectParallel{ it ** 2 } == [1, 9, 25]
assert (1..5).parallel.filter{ it % 2 }.map{ it ** 2 }.collection == [1, 9, 25]
}

Or using (JEP 425) virtual threads:

GParsExecutorsPool.withExistingPool(Executors.newVirtualThreadPerTaskExecutor()) {
assert (1..5).findAllParallel{ it % 2 }.collectParallel{ it ** 2 } == [1, 9, 25]
}

Other libraries

There are numerous list-related libraries on the JVM. We'll look at just a few.

Eclipse collections

Eclipse collections comes with many container types including immutable collections, primitive collections, bimaps, multimaps and bags as well as numerous utility methods. It focuses on reduced memory footprint and efficient containers. It might be particularly of interest if you need primitive collections, immutable collections or some more exotic collection types like bag or bidirectional maps. Here are just a few examples:

var certainties = Lists.immutable.of('death', 'taxes')
assert certainties.reduce{ a, b -> "$a & $b" }.get() == 'death & taxes'
var numBag = Bags.immutable.with('One', 'One', 'Two', 'Three')
assert numBag.toMapOfItemToCount() == [One:2, Two:1, Three:1]
var biMap = BiMaps.immutable.with(6, "six", 2, "two")
assert biMap.inverse().six == 6

Guava

Guava provides a number of extensions to the JDK collections ecosystem. In particular, it has immutable collections, new collection types like multisets and bidirectional maps and various powerful extensions and utilities. Here are a few examples:

var set = TreeMultiset.create([1, 2, 3])
assert set == TreeMultiset.create([3, 2, 1])
set.addAll([1, 3, 5])
assert set.size() == 6 && set.elementSet().size() == 4
assert set.toList() == [1, 1, 2, 3, 3, 5]
var bimap = HashBiMap.create()
bimap.five = 5
assert bimap.inverse()[5] == 'five'

Apache Commons Collections

The Apache Commons Collections library extends upon the JDK collections framework adding some new types like bidirectional maps and bags as well as providing many comparator and iterator implementations. The library was designed to fill gaps in the JDK offerings and while some of those gaps in the JDK have now been filled by the JDK itself, Commons Collections still contains much useful functionality. Here are a few examples:

var six = [six: 6] as TreeBidiMap
assert six.inverseBidiMap() == [6: 'six']
var bag = new HashBag(['one'] * 6)
bag.remove('one', 2)
assert bag.getCount('one') == 4

Further Information

Conclusion

We have looked at the more common methods for list processing with Groovy and a few other useful libraries.



Matrix calculations with Groovy, Apache Commons Math, ojAlgo, Nd4j and EJML

by paulk


Posted on Thursday August 18, 2022 at 01:41PM in Technology


This blogs looks at performing matrix calculations with Groovy using various libraries: Apache Commons Math, ojAlgo, EJML, and Nd4j (part of Eclipse Deeplearning4j). We'll also take a quick look at using the incubating Vector API for matrix calculations (JEPs 338, 414417426).

Fibonacci

The Fibonacci sequence has origins in India centuries earlier but is named after the Italian author of the publication, The Book of Calculation, published in 1202. In that publication, Fibonacci proposed the sequence as a means to calculate the growth of idealized (biologically unrealistic) rabbit populations. He proposed that a newly born breeding pair of rabbits are put in a field; each breeding pair mates at the age of one month, and at the end of their second month they always produce another pair of rabbits; and rabbits never die, but continue breeding forever. Fibonacci posed the puzzle: how many pairs will there be in one year? The sequence goes like this:

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233

We can solve this problem using matrices. If we multiply the matrix 2022-08-18 23_42_31-ASF Blogs_ Edit Entry.png by itself n times we get 2022-08-18 23_42_48-ASF Blogs_ Edit Entry.png. This is an operation known as matrix exponentiation. Let's explore this problem using four of the most popular and maintained matrix libraries.

Apache Commons Math

Let's explore solving this problem using Apache Commons Math. Apache Commons Math is a library of lightweight self-contained mathematics and statics components. Matrices are part of the linear algebra part of this library and for that context, matrices of double values are relevant. So, we'll represent our Fibonacci numbers as double values.

double[][] data = [[1d, 1d], [1d, 0d]]
def m = MatrixUtils.createRealMatrix(data)
println m * m * m
println m**6

Commons math has a factory method for creating matrixes from double arrays. The names of the methods for multiplication and exponentiation happen to align with Groovy's methods available for operator overloading, namely multiply and power, so we can use Groovy's convenient shorthands.

When we run the script, the output looks like this:

Array2DRowRealMatrix{{3.0,2.0},{2.0,1.0}}
Array2DRowRealMatrix{{13.0,8.0},{8.0,5.0}}

We could go a little further and print the values from the matrix, but the result is clear enough. We see the values in the Fibonacci sequence appearing in the output.

EJML

EJML (Efficient Java Matrix Library) is a linear algebra library for manipulating real, complex, dense, and sparse matrices. It is also a 100% Java solution. It has some novel features including import from Matlab and support for semirings (GraphBLAS) which can be used for graph algorithms where a sparse matrix may be used to represent a graph as an adjacency matrix or incidence matrix.

We can do the same calculation using EJML:

def m = new SimpleMatrix([[1d, 1d], [1d, 0d]] as double[][])
def ans = m.mult(m).mult(m)
println ans
6.times { ans = ans.mult(m) }
println ans

The name of the multiply method differs from the one where we can use automatic operator overloading shorthands, so we just call the method that EJML provides. See a little later under language extensibility on how we could in fact add support to use the same shorthands as we saw for Commons Math.

EJML doesn't have an exponentiation method but we just call multiply the requisite number of times to achieve the same effect. Note that we bumped the number of iterations called for the second matrix, to reveal the next bunch of elements from the Fibonacci sequence.

Running this script has this output:

Type = DDRM , rows = 2 , cols = 2
 3.0000E+00  2.0000E+00 
 2.0000E+00  1.0000E+00 

Type = DDRM , rows = 2 , cols = 2
 5.5000E+01  3.4000E+01 
 3.4000E+01  2.1000E+01

The first matrix has the same number as previously, and the second reveals the next numbers in the Fibonacci sequence (21, 34, and 55).

Nd4j

Nd4j provides functionality available on Python to Java users. It contains a mix of numpy operations and tensorflow/pytorch operations. Nd4l makes use of native backends to allow it to work on different platforms and provide efficient operation when scaling up.

The code for our Fibonacci solution is very similar to EJML:

def m = Nd4j.create([[1, 1], [1, 0]] as int[][])
def ans = m.mmul(m).mmul(m)
println ans
9.times { ans = ans.mmul(m) }
println ans

One feature that is different to the previous two libraries is that Nd4j supports integer matrices (as well as doubles and numerous other numerical and related types).

Running the script gives the following output:

...
[main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
...
[main] INFO org.nd4j.linalg.cpu.nativecpu.CpuNDArrayFactory - Binary level Generic x86 optimization level AVX/AVX2
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [OPENBLAS]
...
[[         3,         2], 
 [         2,         1]]
[[       233,       144], 
 [       144,        89]]

Again, the first matrix is the same as we have seen previously, the second has been bumped along the Fibonacci sequence by three more elements.

ojAlgo

The next library we'll look at is ojAlgo (oj! Algorithms). It is an open source all-Java offering for mathematics, linear algebra and optimisation, supporting data science, machine learning and scientific computing. It claims to be the fastest pure-Java linear algebra library available and the project website provide links to numerous benchmarks backing that claim.

Here is the code for our Fibonacci example:

def factory = Primitive64Matrix.FACTORY
def m = factory.rows([[1d, 1d], [1d, 0d]] as double[][])
println m * m * m
println m**15

We can see it supports the same operator overloading features we saw for Commons Math.

When we run the script, it has the following output:

org.ojalgo.matrix.Primitive64Matrix < 2 x 2 >
{ { 3.0,	2.0 },
{ 2.0,	1.0 } }
org.ojalgo.matrix.Primitive64Matrix < 2 x 2 >
{ { 987.0,	610.0 },
{ 610.0,	377.0 } }

As expected, the first matrix is as we've seen before, while the second reveals the next three numbers in the sequence.

Exploring the Vector API and EJML

From JDK16, various versions (JEPs 338414417426) of the Vector API have been available as an incubating preview feature. The HotSpot compiler has previously already had some minimal optimisations that can leverage vector hardware instructions but the Vector API expands the scope of possible optimisations considerably. We could look at writing our own code that might make use of the Vector API and perhaps perform our matrix multiplications ourselves. But, one of the libraries has already done just that, so we'll explore that path.

The main contributor to the EJML library has published a repo for the purposes of very early prototyping and benchmarking. We'll use the methods from one of its classes to explore use of the vector API for our Fibonacci example. The MatrixMultiplication class has three methods: mult_ikj_simple is coded in the way any of us might write a multiplication method as a first pass from its definition without any attempts at optimisation, mult_ikj is coded in a highly-optimised fashion and corresponds to the code EJML would normally use, and mult_ikj_vector uses the Vector API. Note, you can think of these methods as "one layer down" from the mult method we called in the previous example, i.e. the mult method we used previously would be calling one of these under the covers. That's why we pass the internal "matrix" representation instead of our SimpleMatrix instance.

For our little calculations, the optimisations offered by the Vector API would not be expected to be huge. However, we'll do our calculation for generating the matrix we did as a first step for all of the libraries and we'll do it in a loop with 1000 iterations for each of the three methods (mult_ikj_simple, mult_ikj, and mult_ikj_vector). The code looks like this:

def m = new SimpleMatrix([[1, 1], [1, 0]] as double[][])
double[] expected = [3, 2, 2, 1]
def step1, result
long t0 = System.nanoTime()
1000.times {
step1 = new SimpleMatrix(2, 2)
result = new SimpleMatrix(2, 2)
MatrixMultiplication.mult_ikj_simple(m.matrix, m.matrix, step1.matrix)
MatrixMultiplication.mult_ikj_simple(step1.matrix, m.matrix, result.matrix)
assert result.matrix.data == expected
}
long t1 = System.nanoTime()
1000.times {
step1 = new SimpleMatrix(2, 2)
result = new SimpleMatrix(2, 2)
MatrixMultiplication.mult_ikj(m.matrix, m.matrix, step1.matrix)
MatrixMultiplication.mult_ikj(step1.matrix, m.matrix, result.matrix)
assert result.matrix.data == expected
}
long t2 = System.nanoTime()
1000.times {
step1 = new SimpleMatrix(2, 2)
result = new SimpleMatrix(2, 2)
MatrixMultiplication.mult_ikj_vector(m.matrix, m.matrix, step1.matrix)
MatrixMultiplication.mult_ikj_vector(step1.matrix, m.matrix, result.matrix)
assert result.matrix.data == expected
}
long t3 = System.nanoTime()
printf "Simple: %6.2f ms\n", (t1 - t0)/1000_000
printf "Optimized: %6.2f ms\n", (t2 - t1)/1000_000
printf "Vector: %6.2f ms\n", (t3 - t2)/1000_000

This example was run on JDK16 with the following VM options: "--enable-preview --add-modules jdk.incubator.vector".

The output looks like this:

WARNING: Using incubator modules: jdk.incubator.vector
Simple:    116.34 ms
Optimized:  34.91 ms
Vector:     21.94 ms

We can see here that we have some improvement even for our trivial little calculation. Certainly, for biggest problems, the benefit of using the Vector API could be quite substantial.

We should give a big disclaimer here. This little microbenchmark using a loop of 1000 will give highly variable results and was just done to give a very simple performance comparison. For a more predictable comparison, consider running the jmh benchmark in the aforementioned repo. And you may wish to wait until the Vector API is out of preview before relying on it for any production code - but by all means, consider trying it out now.

Leslie Matrices

Earlier, we described the Fibonacci sequence as being for unrealistic rabbit populations, where rabbits never died and continued breeding forever. It turns out that Fibonacci matrices are a special case of a more generalised model which can model realistic rabbit populations (among other things). These are Leslie matrices. For Leslie matrices, populations are divided into classes, and we keep track of birth rates and survival rates over a particular period for each class. We store this information in a matrix in a special form. The populations for each class for the next period can be calculated from those for the current period through multiplication by the Leslie matrix.

This technique can be used for animal populations or human population calculations. A Leslie matrix can help you find out if there will be enough GenX, Millenials, and GenZ tax payers to support an aging and soon retiring baby boomer population. Sophisticated animal models might track populations for an animal and for its predators or its prey. The survival and birth rates might be adjusted based on such information. Given that only females give birth, Leslie models will often be done only in terms of the female population, with the total population extrapolated from that.

We'll show an example for kangaroo population based on this video tutorial for Leslie matrices. It can help us find out if the kangaroo population is under threat (perhaps drought, fires or floods have impacted their feeding habitat) or if favorable conditions are leading to overpopulation.

Following that example, we divide kangaroos into 3 population classes: ages 0 to 3, 3 to 6, and 6 to 9. We are going to look at the population every three years. The 0-3 year olds birthrate (B1) is 0 since they are pre-fertile. The most fertile 3-6 year olds birthrate (B2) is 2.3. The old roos (6-9) have a birthrate (B3) of 0.4. We assume no kangaroos survive past 9 years old. 60% (S1) of the young kangaroos survive to move into the next age group. 30% (S2) of the middle-aged kangaroos survive into old age. Initially, we have 400 kangaroos in each age group.

Here is what the code looks like for this model:

double[] init = [400,   // 0..<3
400, // 3..<6
400] // 6..9
def p0 = MatrixUtils.createRealVector(init)
println "Initial populations: $p0"

double[][] data = [
[0 , 2.3, 0.4], // B1 B2 B3
[0.6, 0, 0 ], // S1 0 0
[0 , 0.3, 0 ] // 0 S2 0
]
def L = MatrixUtils.createRealMatrix(data)
def p1 = L.operate(p0)
println "Population after 1 round: $p1"

def p2 = L.operate(p1)
println "Population after 2 rounds: $p2"

def L10 = L ** 10
println "Population after 10 rounds: ${L10.operate(p0).toArray()*.round()}"

This code produces the following output:

Initial populations: {400; 400; 400}
Population after 1 round: {1,080; 240; 120}
Population after 2 rounds: {600; 648; 72}
Population after 10 rounds: [3019, 2558, 365]

After the first round, we see many young roos but a worrying drop off in the older age groups. After the second round, only the oldest age group looks worryingly small. However, with the healthy numbers in the young generation, we can see that after 10 generations that indeed, the overall population is not at risk. In fact, overpopulation might become a problem.

Encryption with matrices

An early technique to encrypt a message was the Caesar cipher. It substitutes letters in the alphabet by the letter shifted a certain amount along in the alphabet, e.g. "IBM" becomes "HAL" if shifting to the previous letter and "VMS" becomes "WNT" if shifting one letter forward in the alphabet. This kind of cipher can be broken by looking at frequency analysis of letters or pattern words.

The Hill cipher improves upon the Caesar cipher by factoring multiple letters into each letter of the encrypted text. Using matrices made it practical to look at three or more symbols at once. In general, an N-by-N matrix (the key) is multiplied by an encoding of the message in matrix form. The result is a matrix representation (encoded form) of the encrypted text. We use the inverse matrix of our key to decrypt or message.

We need to have a way to convert our text message to and from a matrix. A common scheme is to encode A as 1, B as 2, and so on. We'll just use the ascii value for each character. We define encode and decode methods to do this:

double[][] encode(String s) { s.bytes*.intValue().collate(3) as double[][] }
String decode(double[] data) { data*.round() as char[] }

We'll define a 2-by-2 matrix as our key and use it to encrypt. We'll find the inverse of our key and use that to decrypt. If we wanted to, we could use a 3-by-3 key for improved security at the cost of more processing time.

Our code looks like this:

def message = 'GROOVY'
def m = new SimpleMatrix(encode(message))
println "Original: $message"

def enKey = new SimpleMatrix([[1, 3], [-1, 2]] as double[][])
def encrypted = enKey.mult(m)
println "Encrypted: ${decode(encrypted.matrix.data)}"

def deKey = enKey.invert()
def decrypted = deKey.mult(encrypted)
println "Decrypted: ${decode(decrypted.matrix.data)}"

When run, it has the following output:

Original: GROOVY
Encrypted: ĴŔŚWZc
Decrypted: GROOVY

This offers far more security than the Caesar cipher, however, given today's computing availability, Hill ciphers can still eventually be broken with sufficient brute force. For this reason, Hill ciphers are seldom used on their own for encryption but they are often used in combination with other techniques to add diffusion - strengthening the security offered by the other techniques.

Shape manipulation

Our final example looks at geometrically transforming shapes. To do this, we represent the points of the shape as vectors and multiply them using transforms represented as matrices. We need only worry about the corners, since we'll use Swing to draw our shape and it has a method for drawing a polygon by giving its corners.

First we'll use Groovy's SwingBuilder to set up our frame:

new SwingBuilder().edt {
def frame = frame(title: 'Shapes', size: [420, 440], show: true, defaultCloseOperation: DISPOSE_ON_CLOSE) {
//contentPane.background = Color.WHITE
widget(new CustomPaintComponent())
}
frame.contentPane.background = Color.WHITE
}

We aren't really use much of SwingBuilder's functionality here but if we wanted to add more functionality, SwingBuilder would make that task easier.

We will actually draw our shapes within a custom component. We'll define a few color constants, a drawPolygon method which given a matrix of points will draw those points as a polygon. We'll also define a vectors method to convert a list of points (the corners) into vectors, and a transform method which is a factory method for creating a transform matrix.

Here is the code:

class CustomPaintComponent extends Component {
static final Color violet = new Color(0x67, 0x27, 0x7A, 127)
static final Color seaGreen = new Color(0x69, 0xCC, 0x67, 127)
static final Color crystalBlue = new Color(0x06, 0x4B, 0x93, 127)
static drawPolygon(Graphics g, List pts, boolean fill) {
def poly = new Polygon().tap {
pts.each {
addPoint(*it.toRawCopy1D()*.round()*.intValue().collect { it + 200 })
}
}
fill ? g.fillPolygon(poly) : g.drawPolygon(poly)
}

static List<Primitive64Matrix> vectors(List<Integer>... pts) {
pts.collect{ factory.column(*it) }
}

static transform(List<Number>... lists) {
factory.rows(lists as double[][])
}

void paint(Graphics g) {
g.drawLine(0, 200, 400, 200)
g.drawLine(200, 0, 200, 400)
g.stroke = new BasicStroke(2)

def triangle = vectors([-85, -150], [-145, -30], [-25, -30])
g.color = seaGreen
drawPolygon(g, triangle, true)
// transform triangle ...

def rectangle = vectors([0, -110], [0, -45], [95, -45], [95, -110])
g.color = crystalBlue
drawPolygon(g, rectangle, true)
// transform rectangle ...

def trapezoid = vectors([50, 50], [70, 100], [100, 100], [120, 50])
g.color = violet
drawPolygon(g, trapezoid, true)
// transform trapezoid ...
}
}

When we run this code we see our three shapes:

2022-08-18 00_02_11-Shapes.png

We can now add our transforms. We'll have one which rotate by 90 degrees anti-clockwise. Another which enlarges a shape by 25% and one that shrinks a shape by 25%. We can combine transforms simply by multiplying them together. We'll make two transformations of our triangle. We'll rotate in both cases but we'll shrink one and enlarge the other. We apply the transform simply by multiplying each point by the transform matrix. Then we'll draw both transformed shapes as an outline rather than filled (to make it easier to distinguish the original and transformed versions). Here is the code:

def rot90 = transform([0, 1], [-1, 0])
def bigger = transform([1.25, 0], [0, 1.25])
def smaller = transform([0.75, 0], [0, 0.75])
def triangle_ = triangle.collect{ rot90 * bigger * it }
drawPolygon(g, triangle_, false)
triangle_ = triangle.collect{ rot90 * smaller * it }
drawPolygon(g, triangle_, false)

For our rectangle, we'll have one simple transform which flips the shape in the vertical axis. A second transform combines multiple changes in one transform. We could have split this into smaller transforms and the multiplied them together - but here they are all in one. We flip in the horizontal access and then apply a shear. We then draw the transformed shapes as outlines:

def flipV = transform([1, 0], [0, -1])
def rectangle_ = rectangle.collect{ flipV * it }
drawPolygon(g, rectangle_, false)
def flipH_shear = transform([-1, 0.5], [0, 1])
rectangle_ = rectangle.collect{ flipH_shear * it }
drawPolygon(g, rectangle_, false)

For our trapezoid, we create a transform which rotates 45 degrees anti-clockwise (recall sin 45° = cos 45° = 0.707). Then we create 6 transforms rotating at 45, 90, 135 and so forth. We draw each transformed shape as an outline:

def rot45 = transform([0.707, 0.707], [-0.707, 0.707])
def trapezoid_
(1..6).each { z ->
trapezoid_ = trapezoid.collect{ rot45 ** z * it }
drawPolygon(g, trapezoid_, false)
}

When we run the entire example, here is what the output looks like:

2022-08-18 00_00_48-Shapes.png

We can see here that matrix transforms give us powerful ways to manipulate images. We have used 2D images here, but the same techniques would apply to 3D shapes.

Language and tool extensibility

We saw earlier that some of the examples could make use of Groovy operator shorthand syntax, while others couldn't. Here is a summary of some common methods in the libraries:

Groovy operator+-***
Groovy methodplusminusmultiplypower
Commons mathaddsubtractmultiplypower
EJMLplusminusmult-
Nd4jaddsubmmul-
ojAlgoaddsubtractmultiplypower

Where the library used the same name as Groovy's method, the shorthand could be used.

Groovy offers numerous extensibility features. We won't go into all of the details but instead refer readers to the History of Groovy paper which gives more details.

In summary, that paper defined the following extensions for Commons Math matrices:

RealMatrix.metaClass {
plus << { RealMatrix ma -> delegate.add(ma) }
plus << { double d -> delegate.scalarAdd(d) }
multiply { double d -> delegate.scalarMultiply(d) }
bitwiseNegate { -> new LUDecomposition(delegate).solver.inverse }
constructor = { List l -> MatrixUtils.createRealMatrix(l as double[][]) }
}

This fixes some of the method name discrepancies in the above table. We can now use the operator shorthand for matrix and scalar addition as well as scalar multiply. We can also use the ~ (bitwiseNegate) operator when finding the inverse. The mention of double[][] during matrix construction is now also gone.

The paper goes on to discuss how to automatically add the necessary imports for a matrix library and provide aliases if needed. The imports aren't shown for the code listings in this blog but are in the listings in the sample code repo. In any case, the imports can be "configured away" as the paper discusses. The end result is our code in its entirety can look like this:

var m = [[1, 1], [1, 0]] as Matrix
m**6

The paper also discusses tooling extensibility, in particular the visualisation aspects of the GroovyConsole. It shows how to define an output transform which renders any matrix result using a jlatexmath library. So instead of seeing "Array2DRowRealMatrix{{13.0,8.0},{8.0,5.0}}", they will see a graphical rendition of the matrix. So, the final end-user experience when using the GroovyConsole looks like this:

2022-08-18 00_28_04-bin.png

When using in Jupyter style environments, other pretty output styles may be supported.

Further information

Conclusion

We have examined a number of simple applications of matrices using the Groovy programming language and the Apache Commons Math, ojAlgo, Nd4j, and JEML libraries. You should be convinced that using matrices on the JVM isn't hard and you have numerous options. We also saw a brief glimpse at the Vector API which looks like an exciting addition to the JVM (hopefully arriving soon in non-preview mode).



Life on Mars: Units of Measurement systems, Groovy, and domain specific languages (DSLs)

by paulk


Posted on Saturday August 13, 2022 at 06:31AM in Technology


The Mars Climate Orbiter was launched in 1998 as part of a multi-faceted Mars exploration program. It was lost due to a trajectory calculation error when nearing Mars. An investigation attributed the failure to a measurement mismatch between two software systems: metric units were used by NASA and US customary units were used by spacecraft builder Lockheed Martin.

Mars Climate Orbiter - image credit: wikipedia

If the system in question had been developed using a units of measurement system, perhaps the failure could have been avoided.

Units of measurement systems

All programming languages have types for representing numbers. As an example, we may have three integers, one representing a height, one a weight and one a temperature. We can write code to add those three integers together, but the result may have no useful meaning. We could start writing our own class hierarchies for each integer type but it can quickly become cumbersome to use the resulting classes. Units of measurement systems attempt to provide a range of commonplace units, ways to construct quantities from those units and ways to manipulate them. Manipulation might involve performing numerical calculation or conversions. Goals of such systems include the ability to provide runtime and/or compile-time type safety. So, we should fail early when trying to perform the earlier mentioned addition of three unrelated quantities.

Units of measurement systems aren't new. There are existing systems in F# and Swift and Java has had several versions (and earlier attempts) around standardising such a system including:

  • JSR 275: Units Specification (rejected earlier attempt)
  • JSR 363: Units of Measurement API
  • JSR 385: Units of Measurement API 2.0

There are also existing Java libraries like JScience which have been developed, and shown early promise, but now don't seem to be actively maintained. While the jury is still out about whether units of measurement systems will spread further into mainstream programming, now seems like a great time to check the status. The JSR 385 maintenance release was approved last year and the latest version of the reference implementation was released earlier this year.

JSR 385: Units of Measurement API 2.0

The first thing we need to do is bring in our dependencies (shown for Gradle - other options). The main one is the reference implementation (it brings in the javax.measure API transitively):

implementation 'tech.units:indriya:2.1.3'

JSR 385 is extensible. We'll also bring in some units from the Unicode CLDR units, e.g. MILE:

implementation 'systems.uom:systems-unicode:2.1'
implementation 'systems.uom:systems-quantity:2.1'

Let's follow on with the theme of visiting Mars. We can create variables for the mass and diameter of Mars as follows:

var mass= Quantities.getQuantity(6.39E23, KILO(GRAM))
var diameter= Quantities.getQuantity(6_779, KILOMETER)
println mass
println diameter

JSR 385 has metric prefix qualifiers like MICRO, MILLI, CENTI, KILO, MEGA, GIGA and many more. The CLDR units also define some commonly used units like KILOMETER. We could choose either here.

When we run this script, it has the following output:

639000000000000000000000 kg
6779 km

If we try to compare or add those two values, we'll see an error. Groovy has both static and dynamic natures. Using dynamic code like this:

println mass> diameter

We'll see a runtime error like this:

javax.measure.IncommensurableException: km is not compatible with kg

Or, with TypeChecked or CompileStatic in play for a statement like this:

println mass.add(diameter)

We'd see a compile-time error like this:

[Static type checking] - Cannot call tech.units.indriya.ComparableQuantity#add(javax.measure.Quantity<javax.measure.quantity.Mass>)
with arguments [tech.units.indriya.ComparableQuantity<javax.measure.quantity.Length>]

If for some strange reason we did want to compare or perform calculations between incommensurable types, we can explicitly get the value:

assert mass.value > diameter.value

This escape hatch takes off a layer of type safety but requires explicit work to do so. It is normally never the case that you would want to do this.

JSR 385 also supports ranges and conversions. We can look at the minimum and maximum temperatures on Mars:

var minTemp = Quantities.getQuantity(-128, CELSIUS)
var maxTemp = Quantities.getQuantity(70, FAHRENHEIT)
println minTemp
println minTemp.to(FAHRENHEIT)
println maxTemp
println maxTemp.to(CELSIUS)
println QuantityRange.of(minTemp, maxTemp)

It's quite a bit colder than Earth! When run, this script has the following output:

-128 ℃
-198.400 ((K*5)/9)+459.67
70 ((K*5)/9)+459.67
21.1111111111111111111111111111111 ℃
min= -128 ℃, max= 70 ((K*5)/9)+459.67

In case you're wondering about the strange looking unit display for Fahrenheit temperatures, the definition of that unit that we are using, defines degrees Fahrenheit using a formula calculated from the temperature in degrees Kelvin.

We'd see the same thing if using the MILE unit:

println diameter.to(MILE)

Which shows us that the diameter of Mars is a little over 4200 miles:

4212.275312176886980036586335798934 (m*1609344)/1000

Adding some metaprogramming

Groovy has various features which allow methods to be (apparently) added to classes. We'll use extension methods. This technique involves writing static methods in a helper class using certain conventions. The first parameter in all such methods is the target of the extension. Groovy code referencing instances of the target class have code that can call such a method as if it existed on the target class. In reality, the Groovy compiler or runtime funnels the call through the helper class. For us, it means we will have methods like getMeters() on the Number class which using Groovy's shorthand's for property notation allows for very compact quantity definitions like 5.meters. We'll also add some methods to allow Groovy's normal operator overloading syntax to apply:

class UomExtensions {
static Quantity<Length> getCentimeters(Number num) { Quantities.getQuantity(num, CENTI(METRE)) }

static Quantity<Length> getMeters(Number num) { Quantities.getQuantity(num, METRE) }

static Quantity<Length> getKilometers(Number num) { Quantities.getQuantity(num, KILO(METRE)) }

static Quantity<Length> getCm(Number num) { getCentimeters(num) }

static Quantity<Length> getM(Number num) { getMeters(num) }

static Quantity<Length> getKm(Number num) { getKilometers(num) }

static Quantity<Mass> getKilograms(Number num) { Quantities.getQuantity(num, KILO(GRAM)) }

static Quantity<Mass> getKgs(Number num) { getKilograms(num) }

static Quantity<Time> getHours(Number num) { Quantities.getQuantity(num, HOUR) }

static Quantity<Time> getSeconds(Number num) { Quantities.getQuantity(num, SECOND) }

static Quantity<Time> getHr(Number num) { getHours(num) }

static Quantity<Time> getS(Number num) { getSeconds(num) }

static Quantity<Speed> div(Quantity<Length> q, Quantity<Time> divisor) { q.divide(divisor) as Quantity<Speed> }

static <Q> Quantity<Q> div(Quantity<Q> q, Number divisor) { q.divide(divisor) }

static <Q> Quantity<Q> plus(Quantity<Q> q, Quantity<Q> divisor) { q.add(divisor) }

static <Q> Quantity<Q> minus(Quantity<Q> q, Quantity<Q> divisor) { q.subtract(divisor) }
}

Note that we have longer and shorter versions of many of the methods, e.g. kg and kilogram, m and meter. We didn't need a method for multiply since it is already using the name Groovy expects.

Now we can write very short definitions to declare or compare times and lengths:

def s = 1.s
assert 1000.meters == 1.km && 1.m == 100.cm

We can also declare variables for acceleration due to gravity on Earth and Mars. Gravity is a lot less on Mars:

var g= 3.7.m/s/s
var g= 9.8.m/s/s
assert g.toString() == '9.8 m/s²'
assert g> g

We can also use the operator overloading in calculations (here showing that the Earth has a diameter that is between 1.8 and 2 times bigger than that of Mars):

var diameter= 12_742.kilometers
assert diameter+ diameter> diameter
assert diameter- diameter< diameter
assert diameter* 1.8 < diameter

Even though we have more compact expressions, the same data types are in play that we saw previously. They're just a little nicer to type.

A dynamic DSL for controlling a Mars Rover

Let's now look at how you could write a little Domain-Specific-Language (DSL) to control a Mars rover robot.

Mars Rover Selfie - image credit: wikipedia

First, we'll write a Direction enum as part of our robot domain model:

enum Direction {
left, right, forward, backward
}

There are many ways to write DSLs in Groovy. We'll use a little trick where the verbs are represented as keys in a map. Our DSL then looks like this:

def move(Direction dir) {
[by: { Quantity<Length> dist ->
[at: { Quantity<Speed> speed ->
println "robot moved $dir by $dist at $speed"
}]
}]
}

Here the implementation is just going to print out a message indicating all of the values it is processing. The real robot would send signals to the rover's robotic subsystems.

Our script for controlling the rover now looks like this:

move right by 2.m at 5.cm/s

Which when run gives this output:

robot moved right by 2 m at 5 cm/s

As we saw earlier, this is backed by our JSR 385 types. We'll certainly get fail-early runtime errors if there are any calculations involving mismatched types.

If we enable static typing, some additional errors will be detected at compile but because of the very dynamic style of our DSL implementation, not all runtime errors are reflected by typing information. If we want, we can change our DSL implementation to use richer types and that will support better static typing checking. We'll look at one way to do that next.

A type-rich DSL for the Rover

Now, instead of using our nested map style we saw previously, we create several richly-typed helper classes and define our move method in terms of those classes:

class MoveHolder {
Direction dir
ByHolder by(Quantity<Length> dist) {
new ByHolder(dist: dist, dir: dir)
}
}

class ByHolder {
Quantity<Length> dist
Direction dir
void at(Quantity<Speed> speed) {
println "robot moved $dir by $dist at $speed"
}
}

static MoveHolder move(Direction dir) {
new MoveHolder(dir: dir)
}

While our DSL implementation has changed, the robot scripts remain the same:

move right by 2.m at 5.cm/s

Indeed, if we use Groovy dynamic nature, we can still run the same script and will notice no change.

If however, we enable static checking and have a script with an error like this:

move forward by 2.kgs

We'll now see a compile-time error:

[Static type checking] - Cannot call MoveHolder#by(javax.measure.Quantity<javax.measure.quantity.Length>) with arguments [javax.measure.Quantity<javax.measure.quantity.Mass>]

It is great to get this additional earlier feedback on script errors, so you may wonder why we don't write our DSL implementations like this all of the time? Actually, both the dynamic and static flavors of our DSL can be useful at different times. When prototyping our script DSL, deciding on all the nouns and verbs that we should be using to control our robot, the dynamic flavored style can be much quicker to write especially during early iterations which might evolve and change rapidly. Once the DSL language has been locked down, we can invest in adding the richer types. In the rover scenario, it might also be the case that the rover itself has limited power and so may not want to perform additional type checking steps. We might run all scripts through a type checker back at mission control before sending them through to the rover where they may be enacted in dynamic mode.

Adding custom type checking

There is one additional language feature of Groovy we haven't mentioned. Groovy's type checking mechanism is extensible, so we'll have a look at using that feature here. The rover's speed is rather limited, "In the case of exploring Mars, however, speed isn't the most relevant quality. It's about the journey and the destinations along the way. The slow pace is energy-efficient...". Let's look at limiting the speed to avoid unsafe or energy wasting movement.

We could put early defensive checks in our DSL implementation to detect undesirable manoeuvres but we can also use type checking extensions for certain kinds of errors. Groovy in fact has its own DSL for writing such extensions. That's a topic for its own blog but here's what the code looks like:

afterMethodCall { call ->
def method = getTargetMethod(call)
if (method.name != 'at') return
if (call.arguments.size() != 1) return
def arg = call.arguments[0]
if (arg !instanceof BinaryExpression) return
def left = arg.leftExpression
if (left !instanceof PropertyExpression) return
def obj = left.objectExpression
if (obj !instanceof ConstantExpression) return
if (obj.value > 5) {
addStaticTypeError("Speed of $obj.value is too fast!",call)
handled = true
}
}

This is only a partial implementation, it's make numerous assumptions. We could remove those assumptions by adding more code but for now we'll keep this simplified version.

So, now the following script (with the above type checking extension applied) compiles fine:

move right by 2.m at 5.cm/s

But this script fails:

move right by 2.m at 6.cm/s

The error message is:

[Static type checking] - Speed of 6 is too fast!

Further information

Conclusion

We have looked at using the JSR 385 javax.measure API using Groovy and added some DSL examples to make using the API a little nicer.


Natural Language Processing with Groovy, OpenNLP, CoreNLP, Nlp4j, Datumbox, Smile, Spark NLP, DJL and TensorFlow

by paulk


Posted on Sunday August 07, 2022 at 07:34AM in Technology


Natural Language Processing is certainly a large and sometimes complex topic with many aspects. Some of those aspects deserve entire blogs in their own right. For this blog, we will briefly look at a few simple use cases illustrating where you might be able to use NLP technology in your own project.

Language Detection

Knowing what language some text represents can be a critical first step to subsequent processing. Let's look at how to predict the language using a pre-built model and Apache OpenNLP. Here, ResourceHelper is a utility class used to download and cache the model. The first run may take a little while as it downloads the model. Subsequent runs should be fast. Here we are using a well-known model referenced in the OpenNLP documentation.

def helper = new ResourceHelper('https://dlcdn.apache.org/opennlp/models/langdetect/1.8.3/')
def model = new LanguageDetectorModel(helper.load('langdetect-183'))
def detector = new LanguageDetectorME(model)

[ spa: 'Bienvenido a Madrid', fra: 'Bienvenue à Paris',
dan: 'Velkommen til København', bul: 'Добре дошли в София'
].each { k, v ->
assert detector.predictLanguage(v).lang == k
}

The LanguageDetectorME class lets us predict the language. In general, the predictor may not be accurate on small samples of text but it was good enough for our example. We've used the language code as the key in our map and we check that against the predicted language.

A more complex scenario is training your own model. Let's look at how to do that with Datumbox. Datumbox has a pre-trained models zoo but its language detection model didn't seem to work well for the small snippets in the next example, so we'll train our own model. First, we'll define our datasets:

def datasets = [
English: getClass().classLoader.getResource("training.language.en.txt").toURI(),
French: getClass().classLoader.getResource("training.language.fr.txt").toURI(),
German: getClass().classLoader.getResource("training.language.de.txt").toURI(),
Spanish: getClass().classLoader.getResource("training.language.es.txt").toURI(),
Indonesian: getClass().classLoader.getResource("training.language.id.txt").toURI()
]

The de training dataset comes from the Datumbox examples. The training datasets for the other languages are from Kaggle.

We set up the training parameters needed by our algorithm:

def trainingParams = new TextClassifier.TrainingParameters(
numericalScalerTrainingParameters: null,
featureSelectorTrainingParametersList: [new ChisquareSelect.TrainingParameters()],
textExtractorParameters: new NgramsExtractor.Parameters(),
modelerTrainingParameters: new MultinomialNaiveBayes.TrainingParameters()
)

We'll use a Naïve Bayes model with Chisquare feature selection.

Next we create our algorithm, train it with our training dataset, and then validate it against the training dataset. We'd normally want to split the data into training and testing datasets, to give us a more accurate statistic of the accuracy of our model. But for simplicity, while still illustrating the API, we'll train and validate with our entire dataset:

def config = Configuration.configuration
def classifier = MLBuilder.create(trainingParams, config)
classifier.fit(datasets)
def metrics = classifier.validate(datasets)
println "Classifier Accuracy (using training data): $metrics.accuracy"

When run, we see the following output:

Classifier Accuracy (using training data): 0.9975609756097561

Our test dataset will consist of some hard-coded illustrative phrases. Let's use our model to predict the language for each phrase:

[   'Bienvenido a Madrid', 'Bienvenue à Paris', 'Welcome to London',
'Willkommen in Berlin', 'Selamat Datang di Jakarta'
].each { txt ->
def r = classifier.predict(txt)
def predicted = r.YPredicted
def probability = sprintf '%4.2f', r.YPredictedProbabilities.get(predicted)
println "Classifying: '$txt', Predicted: $predicted, Probability: $probability"
}

When run, it has this output:

Classifying: 'Bienvenido a Madrid',  Predicted: Spanish,  Probability: 0.83
Classifying: 'Bienvenue à Paris',  Predicted: French,  Probability: 0.71
Classifying: 'Welcome to London',  Predicted: English,  Probability: 1.00
Classifying: 'Willkommen in Berlin',  Predicted: German,  Probability: 0.84
Classifying: 'Selamat Datang di Jakarta',  Predicted: Indonesian,  Probability: 1.00
Given these phrases are very short, it is nice to get them all correct, and the probabilities all seem reasonable for this scenario.

Parts of Speech

Parts of speech (POS) analysers examine each part of a sentence (the words and potentially punctuation) in terms of the role they play in a sentence. A typical analyser will assign or annotate words with their role like identifying nouns, verbs, adjectives and so forth. This can be a key early step for tools like the voice assistants from Amazon, Apple and Google.

We'll start by looking at a perhaps lesser known library Nlp4j before looking at some others. In fact, there are multiple Nlp4j libraries. We'll use the one from nlp4j.org, which seems to be the most active and recently updated.

This library uses the Stanford CoreNLP library under the covers for its English POS functionality. The library has the concept of documents, and annotators that work on documents. Once annotated, we can print out all of the discovered words and their annotations:

var doc = new DefaultDocument()
doc.putAttribute('text', 'I eat sushi with chopsticks.')
var ann = new StanfordPosAnnotator()
ann.setProperty('target', 'text')
ann.annotate(doc)
println doc.keywords.collect{ k -> "${k.facet - 'word.'}(${k.str})" }.join(' ')

When run, we see the following output:

PRP(I) VBP(eat) NN(sushi) IN(with) NNS(chopsticks) .(.)

The annotations, also known as tags or facets, for this example are as follows:

PRPPersonal pronoun
VBPPresent tense verb
NNNoun, singular
INPreposition
NNSNoun, plural

The documentation for the libraries we are using give a more complete list of such annotations.

A nice aspect of this library is support for other languages, in particular, Japanese. The code is very similar but uses a different annotator:

doc = new DefaultDocument()
doc.putAttribute('text', '私は学校に行きました。')
ann = new KuromojiAnnotator()
ann.setProperty('target', 'text')
ann.annotate(doc)
println doc.keywords.collect{ k -> "${k.facet}(${k.str})" }.join(' ')

When run, we see the following output:

名詞(私) 助詞(は) 名詞(学校) 助詞(に) 動詞(行き) 助動詞(まし) 助動詞(た) 記号(。)

Before progressing, we'll highlight the result visualization capabilities of the GroovyConsole. This feature lets us write a small Groovy script which converts results to any swing component. In our case we'll convert lists of annotated strings to a JLabel component containing HTML including colored annotation boxes. The details aren't included here but can be found in the repo. We need to copy that file into our ~/.groovy folder and then enable script visualization as shown here:

Screenshot from 2022-08-04 21-57-35.png

Then we should see the following when running the script:

Screenshot from 2022-08-04 21-59-47.png

The visualization is purely optional but adds a nice touch. If using Groovy in notebook environments like Jupyter/BeakerX, there might be visualization tools in those environments too.

Let's look at a larger example using the Smile library.

First, the sentences that we'll examine:

def sentences = [
'Paul has two sisters, Maree and Christine.',
'No wise fish would go anywhere without a porpoise',
'His bark was much worse than his bite',
'Turn on the lights to the main bedroom',
"Light 'em all up",
'Make it dark downstairs'
]

A couple of those sentences might seem a little strange but they are selected to show off quite a few of the different POS tags.

Smile has a tokenizer class which splits a sentence into words. It handles numerous cases like contractions and abbreviations ("e.g.", "'tis", "won't"). Smile also has a POS class based on the hidden Markov model and a built-in model is used for that class. Here is our code using those classes:

def tokenizer = new SimpleTokenizer(true)
sentences.each {
def tokens = Arrays.stream(tokenizer.split(it)).toArray(String[]::new)
def tags = HMMPOSTagger.default.tag(tokens)*.toString()
println tokens.indices.collect{tags[it] == tokens[it] ? tags[it] : "${tags[it]}(${tokens[it]})" }.join(' ')
}

We run the tokenizer for each sentence. Each token is then displayed directly or with its tag if it has one.

Running the script gives this visualization:

Paul
NNP
has
VBZ
two
CD
sisters
NNS
,
Maree
NNP
and
CC
Christine
NNP
.
No
DT
wise
JJ
fish
NN
would
MD
go
VB
anywhere
RB
without
IN
a
DT
porpoise
NN
His
PRP$
bark
NN
was
VBD
much
RB
worse
JJR
than
IN
his
PRP$
bite
NN
Turn
VB
on
IN
the
DT
lights
NNS
to
TO
the
DT
main
JJ
bedroom
NN
Light
NNP
'em
PRP
all
RB
up
RB
Make
VB
it
PRP
dark
JJ
downstairs
NN

[Note: the scripts in the repo just print to stdout which is perfect when using the command-line or IDEs. The visualization in the GoovyConsole kicks in only for the actual result. So, if you are following along at home and wanting to use the GroovyConsole, you'd change the each to collect and remove the println, and you should be good for visualization.]

The OpenNLP code is very similar:

def tokenizer = SimpleTokenizer.INSTANCE
sentences.each {
String[] tokens = tokenizer.tokenize(it)
def posTagger = new POSTaggerME('en')
String[] tags = posTagger.tag(tokens)
println tokens.indices.collect{tags[it] == tokens[it] ? tags[it] : "${tags[it]}(${tokens[it]})" }.join(' ')
}

OpenNLP allows you to supply your own POS model but downloads a default one if none is specified.

When the script is run, it has this visualization:

Paul
PROPN
has
VERB
two
NUM
sisters
NOUN
,
PUNCT
Maree
PROPN
and
CCONJ
Christine
PROPN
.
PUNCT
No
DET
wise
ADJ
fish
NOUN
would
AUX
go
VERB
anywhere
ADV
without
ADP
a
DET
porpoise
NOUN
His
PRON
bark
NOUN
was
AUX
much
ADV
worse
ADJ
than
ADP
his
PRON
bite
NOUN
Turn
VERB
on
ADP
the
DET
lights
NOUN
to
ADP
the
DET
main
ADJ
bedroom
NOUN
Light
NOUN
'
PUNCT
em
NOUN
all
ADV
up
ADP
Make
VERB
it
PRON
dark
ADJ
downstairs
NOUN

The observant reader may have noticed some slight differences in the tags used in this library. They are essentially the same but using slightly different names. This is something to be aware of when swapping between POS libraries or models. Make sure you look up the documentation for the library/model you are using to understand the available tag types.

Entity Detection

Named entity recognition (NER), seeks to identity and classify named entities in text. Categories of interest might be persons, organizations, locations dates, etc. It is another technology used in many fields of NLP.

We'll start with our sentences to analyse:

String[] sentences = [
"A commit by Daniel Sun on December 6, 2020 improved Groovy 4's language integrated query.",
"A commit by Daniel on Sun., December 6, 2020 improved Groovy 4's language integrated query.",
'The Groovy in Action book by Dierk Koenig et. al. is a bargain at $50, or indeed any price.',
'The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.',
'I saw Ms. May Smith waving to June Jones.',
'The parcel was passed from May to June.',
'The Mona Lisa by Leonardo da Vinci has been on display in the Louvre, Paris since 1797.'
]

We'll use some well-known models, we'll focus on the person, money, date, time, and location models:

def base = 'http://opennlp.sourceforge.net/models-1.5'
def modelNames = ['person', 'money', 'date', 'time', 'location']
def finders = modelNames.collect { model ->
new NameFinderME(DownloadUtil.downloadModel(new URL("$base/en-ner-${model}.bin"), TokenNameFinderModel))
}

We'll now tokenize our sentences:

def tokenizer = SimpleTokenizer.INSTANCE
sentences.each { sentence ->
String[] tokens = tokenizer.tokenize(sentence)
Span[] tokenSpans = tokenizer.tokenizePos(sentence)
def entityText = [:]
def entityPos = [:]
finders.indices.each {fi ->
// could be made smarter by looking at probabilities and overlapping spans
Span[] spans = finders[fi].find(tokens)
spans.each{span ->
def se = span.start..<span.end
def pos = (tokenSpans[se.from].start)..<(tokenSpans[se.to].end)
entityPos[span.start] = pos
entityText[span.start] = "$span.type(${sentence[pos]})"
}
}
entityPos.keySet().sort().reverseEach {
def pos = entityPos[it]
def (from, to) = [pos.from, pos.to + 1]
sentence = sentence[0..<from] + entityText[it] + sentence[to..-1]
}
println sentence
}

And when visualized, shows this:

A commit by
Daniel Sun
person
on
December 6, 2020
date
improved Groovy 4's language integrated query.
A commit by
Daniel
person
on Sun.,
December 6, 2020
date
improved Groovy 4's language integrated query.
The Groovy in Action book by
Dierk Koenig
person
et. al. is a bargain at
$50
money
, or indeed any price.
The conference wrapped up
yesterday
date
at
5:30 p.m.
time
in
Copenhagen
location
,
Denmark
location
.
I saw Ms.
May Smith
person
waving to
June Jones
person
.
The parcel was passed from
May to June
date
.
The Mona Lisa by
Leonardo da Vinci
person
has been on display in the Louvre,
Paris
location
since 1797
date
.

We can see here that most examples have been categorized as we might expect. We'd have to improve our model for it to do a better job on the "May to June" example.

Scaling Entity Detection

We can also run our named entity detection algorithms on platforms like Spark NLP which adds NLP functionality to Apache Spark. We'll use glove_100d embeddings and the onto_100 NER model.

var assembler = new DocumentAssembler(inputCol: 'text', outputCol: 'document', cleanupMode: 'disabled')

var tokenizer = new Tokenizer(inputCols: ['document'] as String[], outputCol: 'token')

var embeddings = WordEmbeddingsModel.pretrained('glove_100d').tap {
inputCols = ['document', 'token'] as String[]
outputCol = 'embeddings'
}

var model = NerDLModel.pretrained('onto_100', 'en').tap {
inputCols = ['document', 'token', 'embeddings'] as String[]
outputCol ='ner'
}

var converter = new NerConverter(inputCols: ['document', 'token', 'ner'] as String[], outputCol: 'ner_chunk')

var pipeline = new Pipeline(stages: [assembler, tokenizer, embeddings, model, converter] as PipelineStage[])

var spark = SparkNLP.start(false, false, '16G', '', '', '')

var text = [
"The Mona Lisa is a 16th century oil painting created by Leonardo. It's held at the Louvre in Paris."
]
var data = spark.createDataset(text, Encoders.STRING()).toDF('text')

var pipelineModel = pipeline.fit(data)

var transformed = pipelineModel.transform(data)
transformed.show()

use(SparkCategory) {
transformed.collectAsList().each { row ->
def res = row.text
def chunks = row.ner_chunk.reverseIterator()
while (chunks.hasNext()) {
def chunk = chunks.next()
int begin = chunk.begin
int end = chunk.end
def entity = chunk.metadata.get('entity').get()
res = res[0..<begin] + "$entity($chunk.result)" + res[end<..-1]
}
println res
}
}

We won't go into all of the details here. In summary, the code sets up a pipeline that transforms our input sentences, via a series of steps, into chunks, where each chunk corresponds to a detected entity. Each chunk has a start and ending position, and an associated tag type.

This may not seem like it is much different to our earlier examples, but if we had large volumes of data and we were running in a large cluster, the work could be spread across worker nodes within the cluster.

Here we have used a utility SparkCategory class which makes accessing the information in Spark Row instances a little nicer in terms of Groovy shorthand syntax. We can use row.text instead of row.get(row.fieldIndex('text')). Here is the code for this utility class:

class SparkCategory {
static get(Row r, String field) { r.get(r.fieldIndex(field)) }
}

If doing more than this simple example, the use of SparkCategory could be made implicit through various standard Groovy techniques.

When we run our script, we see the following output:

22/08/07 12:31:39 INFO SparkContext: Running Spark version 3.3.0
...
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
...
onto_100 download started this may take some time.
Approximate size to download 13.5 MB
...
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|          embeddings|                 ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|The Mona Lisa is ...|[{document, 0, 98...|[{token, 0, 2, Th...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 0, 12, T...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
PERSON(The Mona Lisa) is a DATE(16th century) oil painting created by PERSON(Leonardo). It's held at the FAC(Louvre) in GPE(Paris).

The result has the following visualization:

The Mona Lisa
PERSON
is a
16th century
DATE
oil painting created by
Leonardo
PERSON
. It's held at the
Louvre
FAC
in
Paris
GPE
.

Here FAC is facility (buildings, airports, highways, bridges, etc.) and GPE is Geo-Political Entity (countries, cities, states, etc.).

Sentence Detection

Detecting sentences in text might seem a simple concept at first but there are numerous special cases.

Consider the following text:

def text = '''
The most referenced scientific paper of all time is "Protein measurement with the
Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall,
R. J. and was published in the J. BioChem. in 1951. It describes a method for
measuring the amount of protein (even as small as 0.2 γ, were γ is the specific
weight) in solutions and has been cited over 300,000 times and can be found here:
https://www.jbc.org/content/193/1/265.full.pdf. Dr. Lowry completed
two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago
before moving to Harvard under A. Baird Hastings. He was also the H.O.D of
Pharmacology at Washington University in St. Louis for 29 years.
'''

There are full stops at the end of each sentence (though in general, it could also be other punctuation like exclamation marks and question marks). There are also full stops and decimal points in abbreviations, URLs, decimal numbers and so forth. Sentence detection algorithms might have some special hard-coded cases, like "Dr.", "Ms.", or in an emoticon, and may also use some heuristics. In general, they might also be trained with examples like above.

Here is some code for OpenNLP for detecting sentences in the above:

def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5')
def model = new SentenceModel(helper.load('en-sent'))
def detector = new SentenceDetectorME(model)
def sentences = detector.sentDetect(text)
assert text.count('.') == 28
assert sentences.size() == 4
println "Found ${sentences.size()} sentences:\n" + sentences.join('\n\n')

It has the following output:

Downloading en-sent
Found 4 sentences:
The most referenced scientific paper of all time is "Protein measurement with the
Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall,
R. J. and was published in the J. BioChem. in 1951.

It describes a method for
measuring the amount of protein (even as small as 0.2 γ, were γ is the specific
weight) in solutions and has been cited over 300,000 times and can be found here:
https://www.jbc.org/content/193/1/265.full.pdf.

Dr. Lowry completed
two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago
before moving to Harvard under A. Baird Hastings.

He was also the H.O.D of
Pharmacology at Washington University in St. Louis for 29 years.

We can see here, it handled all of the tricky cases in the example.

Relationship Extraction with Triples

The next step after detecting named entities and the various parts of speech of certain words is to explore relationships between them. This is often done in the form of subject-predicate-object triplets. In our earlier NER example, for the sentence "The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.", we found various date, time and location named entities.

We can extract triples using the MinIE library (which in turns uses the Standford CoreNLP library) with the following code:

def parser = CoreNLPUtils.StanfordDepNNParser()
sentences.each { sentence ->
def minie = new MinIE(sentence, parser, MinIE.Mode.SAFE)

println "\nInput sentence: $sentence"
println '============================='
println 'Extractions:'
for (ap in minie.propositions) {
println "\tTriple: $ap.tripleAsString"
def attr = ap.attribution.attributionPhrase ? ap.attribution.toStringCompact() : 'NONE'
println "\tFactuality: $ap.factualityAsString\tAttribution: $attr"
println '\t----------'
}
}

The output for the previously mentioned sentence is shown below:

Input sentence: The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.
=============================
Extractions:
        Triple: "conference"    "wrapped up yesterday at"       "5:30 p.m."
        Factuality: (+,CT)      Attribution: NONE
        ----------
        Triple: "conference"    "wrapped up yesterday in"       "Copenhagen"
        Factuality: (+,CT)      Attribution: NONE
        ----------
        Triple: "conference"    "wrapped up"    "yesterday"
        Factuality: (+,CT)      Attribution: NONE

We can now piece together the relationships between the earlier entities we detected.

There was also a problematic case amongst the earlier NER examples, "The parcel was passed from May to June.". Using the previous model, detected "May to June" as a date. Let's explore that using CoreNLP's triple extraction directly. We won't show the source code here but CoreNLP supports simple and more powerful approaches to solving this problem. The output for the sentence in question using the more powerful technique is:

Sentence #7: The parcel was passed from May to June.
root(ROOT-0, passed-4)
det(parcel-2, The-1)
nsubj:pass(passed-4, parcel-2)
aux:pass(passed-4, was-3)
case(May-6, from-5)
obl:from(passed-4, May-6)
case(June-8, to-7)
obl:to(passed-4, June-8)
punct(passed-4, .-9)

Triples:
1.0	parcel	was	passed
1.0	parcel	was passed to	June
1.0	parcel	was	passed from May to June
1.0	parcel	was passed from	May

We can see that this has done a better job of piecing together what entities we have and their relationships.

Sentiment Analysis

Sentiment analysis is a NLP technique used to determine whether data is positive, negative, or neutral. Standford CoreNLP has default models it uses for this purpose:

def doc = new Document('''
StanfordNLP is fantastic!
Groovy is great fun!
Math can be hard!
''')
for (sent in doc.sentences()) {
println "${sent.toString().padRight(40)} ${sent.sentiment()}"
}

Which has the following output:

[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.6 sec].
[main] INFO edu.stanford.nlp.sentiment.SentimentModel - Loading sentiment model edu/stanford/nlp/models/sentiment/sentiment.ser.gz ... done [0.1 sec].
StanfordNLP is fantastic!                POSITIVE
Groovy is great fun!                     VERY_POSITIVE
Math can be hard!                        NEUTRAL

We can also train our own. Let's start with two datasets:

def datasets = [
positive: getClass().classLoader.getResource("rt-polarity.pos").toURI(),
negative: getClass().classLoader.getResource("rt-polarity.neg").toURI()
]

We'll first use Datumbox which, as we saw earlier, requires training parameters for our algorithm:

def trainingParams = new TextClassifier.TrainingParameters(
numericalScalerTrainingParameters: null,
featureSelectorTrainingParametersList: [new ChisquareSelect.TrainingParameters()],
textExtractorParameters: new NgramsExtractor.Parameters(),
modelerTrainingParameters: new MultinomialNaiveBayes.TrainingParameters()
)

We now create our algorithm, train it with or training dataset, and for illustrative purposes validate against the training dataset:

def config = Configuration.configuration
TextClassifier classifier = MLBuilder.create(trainingParams, config)
classifier.fit(datasets)
def metrics = classifier.validate(datasets)
println "Classifier Accuracy (using training data): $metrics.accuracy"

The output is shown here:

[main] INFO com.datumbox.framework.core.common.dataobjects.Dataframe$Builder - Dataset Parsing positive class
[main] INFO com.datumbox.framework.core.common.dataobjects.Dataframe$Builder - Dataset Parsing negative class
...
Classifier Accuracy (using training data): 0.8275959103273615

Now we can test our model against several sentences:

['Datumbox is divine!', 'Groovy is great fun!', 'Math can be hard!'].each {
def r = classifier.predict(it)
def predicted = r.YPredicted
def probability = sprintf '%4.2f', r.YPredictedProbabilities.get(predicted)
println "Classifing: '$it', Predicted: $predicted, Probability: $probability"
}

Which has this output:

...
[main] INFO com.datumbox.framework.applications.nlp.TextClassifier - predict()
...
Classifing: 'Datumbox is divine!',  Predicted: positive,  Probability: 0.83
Classifing: 'Groovy is great fun!',  Predicted: positive,  Probability: 0.80
Classifing: 'Math can be hard!',  Predicted: negative,  Probability: 0.95

We can do the same thing but with OpenNLP. First, we collect our input data. OpenNLP is expecting it in a single dataset with tagged examples:

def trainingCollection = datasets.collect { k, v ->
new File(v).readLines().collect{"$k $it".toString() }
}.sum()

Now, we'll train two models. One uses naïve bayes, the other maxent. We train up both variants.

def variants = [
Maxent : new TrainingParameters(),
NaiveBayes: new TrainingParameters((CUTOFF_PARAM): '0', (ALGORITHM_PARAM): NAIVE_BAYES_VALUE)
]
def models = [:]
variants.each{ key, trainingParams ->
def trainingStream = new CollectionObjectStream(trainingCollection)
def sampleStream = new DocumentSampleStream(trainingStream)
println "\nTraining using $key"
models[key] = DocumentCategorizerME.train('en', sampleStream, trainingParams, new DoccatFactory())
}

Now we run sentiment predictions on our sample sentences using both variants:

def w = sentences*.size().max()

variants.each { key, params ->
def categorizer = new DocumentCategorizerME(models[key])
println "\nAnalyzing using $key"
sentences.each {
def result = categorizer.categorize(it.split('[ !]'))
def category = categorizer.getBestCategory(result)
def prob = sprintf '%4.2f', result[categorizer.getIndex(category)]
println "${it.padRight(w)} $category ($prob)}"
}
}

When we run this we get:

Training using Maxent ...done.
...

Training using NaiveBayes ...done.
...

Analyzing using Maxent
OpenNLP is fantastic! positive (0.64)}
Groovy is great fun!  positive (0.74)}
Math can be hard!     negative (0.61)}

Analyzing using NaiveBayes
OpenNLP is fantastic! positive (0.72)}
Groovy is great fun!  positive (0.81)}
Math can be hard!     negative (0.72)}

The models here appear to have lower probability levels compared to the model we trained for Datumbox. We could try tweaking the training parameters further if this was a problem. We'd probably also need a bigger testing set to convince ourselves of the relative merits of each model. Some models can be over-trained on small datasets and perform very well with data similar to their training datasets but perform much worse for other data.

Universal Sentence Encoding

This example is inspired from the UniversalSentenceEncoder example in the DJL examples module. It looks at using the universal sentence encoder model from TensorFlow Hub via the DeepJavaLibrary (DJL) api.

First we define a translator. The Translator interface allow us to specify pre and post processing functionality.

class MyTranslator implements NoBatchifyTranslator<String[], double[][]> {
@Override
NDList processInput(TranslatorContext ctx, String[] raw) {
var factory = ctx.NDManager
var inputs = new NDList(raw.collect(factory::create))
new NDList(NDArrays.stack(inputs))
}

@Override
double[][] processOutput(TranslatorContext ctx, NDList list) {
long numOutputs = list.singletonOrThrow().shape.get(0)
NDList result = []
for (i in 0..<numOutputs) {
result << list.singletonOrThrow().get(i)
}
result*.toFloatArray() as double[][]
}
}

Here, we manually pack our input sentences into the required n-dimensional data types, and extract our output calculations into a 2D double array.

Next, we create our predict method by first defining the criteria for our prediction algorithm. We are going to use our translator, use the TensorFlow engine, use a predefined sentence encoder model from the TensorFlow Hub, and indicate that we are creating a text embedding application:

def predict(String[] inputs) {
String modelUrl = "https://storage.googleapis.com/tfhub-modules/google/universal-sentence-encoder/4.tar.gz"

Criteria<String[], double[][]> criteria =
Criteria.builder()
.optApplication(Application.NLP.TEXT_EMBEDDING)
.setTypes(String[], double[][])
.optModelUrls(modelUrl)
.optTranslator(new MyTranslator())
.optEngine("TensorFlow")
.optProgress(new ProgressBar())
.build()
try (var model = criteria.loadModel()
var predictor = model.newPredictor()) {
predictor.predict(inputs)
}
}

Next, let's define our input strings:

String[] inputs = [
"Cycling is low impact and great for cardio",
"Swimming is low impact and good for fitness",
"Palates is good for fitness and flexibility",
"Weights are good for strength and fitness",
"Orchids can be tricky to grow",
"Sunflowers are fun to grow",
"Radishes are easy to grow",
"The taste of radishes grows on you after a while",
]
var k = inputs.size()

Now, we'll use our predictor method to calculate the embeddings for each sentence. We'll print out the embeddings and also calculate the dot product of the embeddings. The dot product (the same as the inner product for this case) reveals how related the sentences are.

var embeddings = predict(inputs)

var z = new double[k][k]
for (i in 0..<k) {
println "Embedding for: ${inputs[i]}\n${embeddings[i]}"
for (j in 0..<k) {
z[i][j] = dot(embeddings[i], embeddings[j])
}
}

Finally, we'll use the Heatmap class from Smile to present a nice display highlighting what the data reveals:

new Heatmap(inputs, inputs, z, Palette.heat(20).reverse()).canvas().with {
title = 'Semantic textual similarity'
setAxisLabels('', '')
window()
}

The output shows us the embeddings:

Loading:     100% |========================================|
2022-08-07 17:10:43.212697: ... This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
...
2022-08-07 17:10:52.589396: ... SavedModel load for tags { serve }; Status: success: OK...
...
Embedding for: Cycling is low impact and great for cardio
[-0.02865048497915268, 0.02069241739809513, 0.010843578726053238, -0.04450441896915436, ...]
...
Embedding for: The taste of radishes grows on you after a while
[0.015841705724596977, -0.03129228577017784, 0.01183396577835083, 0.022753292694687843, ...]

The embeddings are an indication of similarity. Two sentences with similar meaning typically have similar embeddings.

The displayed graphic is shown below:

2022-08-06 22_18_05-Smile Plot 1.png

This graphic shows that our first four sentences are somewhat related, as are the last four sentences, but that there is minimal relationship between those two groups.

More information

Further examples can be found in the related repos:

https://github.com/paulk-asert/groovy-data-science/blob/master/subprojects/LanguageProcessing

https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/LanguageProcessingSparkNLP

https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/LanguageProcessingDjl

Conclusion

We have look at a range of NLP examples using various NLP libraries. Hopefully you can see some cases where you could use additional NLP technologies in some of your own applications.



Detecting objects with Groovy, the Deep Java Library (DJL), and Apache MXNet

by paulk


Posted on Monday August 01, 2022 at 11:52AM in Technology


This blog posts looks at using Apache Groovy with the Deep Java Library (DJL) and backed by the Apache MXNet engine to detect objects within an image. (Apache MXNet is an incubating project at the ASF.)

Deep Learning

Deep learning falls under the branches of machine learning and artificial intelligence. It involves multiple layers (hence the "deep") of an artificial neural network. There are lots of ways to configure such networks and the details are beyond the scope of this blog post, but we can give some basic details. We will have four input nodes corresponding to the measurements of our four characteristics. We will have three output nodes corresponding to each possible class (species). We will also have one or more additional layers in between.

deep_network.png

Each node in this network mimics to some degree a neuron in the human brain. Again, we'll simplify the details. Each node has multiple inputs, which are given a particular weight, as well as an activation function which will determine whether our node "fires". Training the model is a process which works out what the best weights should be.

deep_node.png

Deep Java Library (DJL) & Apache MXNet

Rather than writing your own neural networks, libraries such as DJL provide high-level abstractions which automate to some degree the creation of the necessary neural network layers. DJL is engine agnostic, so it's capable of supporting different backends including Apache MXNet, PyTorch, TensorFlow and ONNX Runtime. We'll use the default engine which for our application (at the time of writing) is Apache MXNet.

Apache MXNet provides the underlying engine. It has support for imperative and symbolic execution, distributed training of your models using multi-gpu or multi-host hardware, and multiple language bindings. Groovy is fully compatible with the Java binding.

Using DJL with Groovy

Groovy uses the Java binding. Consider looking at the DJL beginner tutorials for Java - they will work almost unchanged for Groovy.

For our example, the first thing we need to do is download the image we want to run the object detection model on:

Path tempDir = Files.createTempDirectory("resnetssd")
def imageName = 'dog-ssd.jpg'
Path localImage = tempDir.resolve(imageName)
def url = new URL("https://s3.amazonaws.com/model-server/inputs/$imageName")
DownloadUtils.download(url, localImage, new ProgressBar())
Image img = ImageFactory.instance.fromFile(localImage)

It happens to be a well-known already available image. We'll store a local copy of the image in a temporary directory and we'll use a utility class that comes with DJL to provide a nice progress bar while the image is downloading. DJL provides it's own image classes, so we'll create an instance using the appropriate class from the downloaded image.

Next we want to configure our neural network layers:

def criteria = Criteria.builder()
.optApplication(Application.CV.OBJECT_DETECTION)
.setTypes(Image, DetectedObjects)
.optFilter("backbone", "resnet50")
.optEngine(Engine.defaultEngineName)
.optProgress(new ProgressBar())
.build()

DLJ supports numerous model applications including image classification, word recognition, sentiment analysis, linear regression, and others. We'll select object detection. This kind of application looks for the bounding box of known objects within an image. The types configuration option identifies that our input will be an image and the output will be detected objects. The filter option indicates that we will be using ResNet-50 (a 50-layers deep convolutional neural network often used as a backbone for many computer vision tasks). We set the engine to be the default engine which happens to be Apache MXNet. We also configure an optional progress bar to provide feedback of progress while our model is running.

Now that we have our configuration sorted, we'll use it to load a model and then use the model to make object predictions:

def detection = criteria.loadModel().withCloseable { model ->
model.newPredictor().predict(img)
}
detection.items().each { println it }
img.drawBoundingBoxes(detection)

For good measure, we'll draw the bounding boxes into our image.

Next, we save our image into a file and display it using Groovy's SwingBuilder.

Path imageSaved = tempDir.resolve('detected.png')
imageSaved.withOutputStream { os -> img.save(os, 'png') }
def saved = ImageIO.read(imageSaved.toFile())
new SwingBuilder().edt {
frame(title: "$detection.numberOfObjects detected objects",
size: [saved.width, saved.height],
defaultCloseOperation: DISPOSE_ON_CLOSE,
show: true) { label(icon: imageIcon(image: saved)) }
}

Building and running our application

Our code is stored on a source file called ObjectDetect.groovy.

We used Gradle for our build file:

apply plugin: 'groovy'
apply plugin: 'application'

repositories {
mavenCentral()
}

application {
mainClass = 'ObjectDetect'
}

dependencies {
implementation "ai.djl:api:0.18.0"
implementation "org.apache.groovy:groovy:4.0.4"
implementation "org.apache.groovy:groovy-swing:4.0.4"
runtimeOnly "ai.djl:model-zoo:0.18.0"
runtimeOnly "ai.djl.mxnet:mxnet-engine:0.18.0"
runtimeOnly "ai.djl.mxnet:mxnet-model-zoo:0.18.0"
runtimeOnly "ai.djl.mxnet:mxnet-native-auto:1.8.0"
runtimeOnly "org.apache.groovy:groovy-nio:4.0.4"
runtimeOnly "org.slf4j:slf4j-jdk14:1.7.36"
}

We run the application with the gradle run task:

paulk@pop-os:/extra/projects/groovy-data-science$ ./gradlew DLMXNet:run
> Task :DeepLearningMxnet:run
Downloading: 100% |████████████████████████████████████████| dog-ssd.jpg
Loading:     100% |████████████████████████████████████████|
...
class: "car", probability: 0.99991, bounds: [x=0.611, y=0.137, width=0.293, height=0.160]
class: "bicycle", probability: 0.95385, bounds: [x=0.162, y=0.207, width=0.594, height=0.588]
class: "dog", probability: 0.93752, bounds: [x=0.168, y=0.350, width=0.274, height=0.593]

The displayed image looks like this:
2022-08-01 21_28_33-3 detected objects.png

Further Information

The full source code can be found in the following repo:
https://github.com/paulk-asert/groovy-data-science/subprojects/DeepLearningMxnet

Conclusion

We have examined using Apache Groovy, DLJ and Apache MXNet to detect objects within an image. We've used a model based on a rich deep learning model but we didn't need to get into the details of the model or its neural network layers. DLJ and Apache MXNet did the hard lifting for us. Groovy provided a simple coding experience for building our application.


Working with SQL databases with Groovy and GraalVM

by paulk


Posted on Friday July 29, 2022 at 02:07PM in Technology


During the week, there was an interesting video and blog post on the latest GraalVM 22.2 Release. The release has numerous new features and improvements including:

  • smaller native executables
  • the ability to generate heap dumps in native executables
  • experimental native image debugging within IntelliJ IDEA
  • the ability to embed a Software Bill of Materials (SBOM) into the executable for improved security (when using GraalVM Enterprise)
  • native metadata integration.

This blog looks at the last of these. We'll use the running example of the H2 database which the video discusses.

Native Metadata

For anyone who has used GraalVM, they will know that frequently certain information must be given to the native compiler. Certain classes can be initialized at build time, others should be initialized at runtime. If accessing certain kinds of resources, knowledge of those resources must be given to the compiler. Parts of the application which might be invoked through reflection or involve serialization, might not be deemed reachable and won't automatically be included by the compiler.

Each library that is being used within an application will have its own set of classes and resources which will commonly need to dealt with by anyone using that library. The Native Metadata repository keeps a shared copy of this information on a per library basis. Once someone has populated the metadata, other projects using the same library can get that information automatically. We'll look more at metadata integration shortly, but first, let's look at our database application.

Working with SQL in Groovy

The application creates and then populates a customer database with four customers. It then prints them out:

import groovy.sql.Sql
import groovy.transform.CompileStatic

@CompileStatic
class H2Demo {
static void main(args) {
Sql.withInstance('jdbc:h2:./data/test') { sql ->
sql.execute 'DROP TABLE IF EXISTS customers'
sql.execute 'CREATE TABLE customers(id INTEGER AUTO_INCREMENT, name VARCHAR)'
for (cust in ['Lord Archimonde', 'Arthur', 'Gilbert', 'Grug']) {
sql.executeInsert "INSERT INTO customers(name) VALUES $cust"
}
println sql.rows('SELECT * FROM customers').join('\n')
}
}
}

Groovy's Sql class makes this relatively easy. The withInstance method will create a database connection and close it down when finished with. The executeInsert method is using a Groovy interpolated String (GString) which creates a prepared statement under the covers.

Configuring our native build

Here is our build file:

plugins {
id 'application'
id 'groovy'
id 'org.graalvm.buildtools.native'
}

application {
mainClass = 'H2Demo'
}

repositories {
mavenCentral()
}

dependencies {
implementation 'com.h2database:h2:2.1.210'
implementation 'org.apache.groovy:groovy:4.0.4'
implementation 'org.apache.groovy:groovy-sql:4.0.4'
}

graalvmNative {
agent {
defaultMode = 'standard'
}
metadataRepository {
enabled = true
}
binaries {
main {
buildArgs.addAll(
// '-H:IncludeSBOM=cyclonedx',
'--report-unsupported-elements-at-runtime',
'--initialize-at-run-time=groovy.grape.GrapeIvy,org.h2.store.fs.niomem.FileNioMemData',
'--initialize-at-build-time',
'--no-fallback',
)
}
}
}

We make use of the graalvm native build plugin. We define our dependencies of Groovy and H2. We can also supply any needed parameters to the native compiler. Importantly, we enable integration with the metadata repository.

When we run the build, it will automatically create the native app for us:

paulk@pop-os:/extra/projects/groovy-graalvm-h2$ ./gradlew clean nativeRun
...
> Task :nativeCompile
[native-image-plugin] Using executable path: /extra/devtools/graalvm-ce-java17-22.2.0/bin/native-image
========================================================================================================================
GraalVM Native Image: Generating 'H2Demo' (executable)...
========================================================================================================================
...
[1/7] Initializing...                                                                                    (5.3s @ 0.26GB)
 Version info: 'GraalVM 22.2.0 Java 17 CE'
 Java version info: '17.0.4+8-jvmci-22.2-b06'
 C compiler: gcc (linux, x86_64, 11.2.0)
 Garbage collector: Serial GC
 1 user-specific feature(s)
 - com.oracle.svm.polyglot.groovy.GroovyIndyInterfaceFeature
[2/7] Performing analysis...  [************]                                                            (51.7s @ 1.82GB)
  10,597 (90.60%) of 11,697 classes reachable
  17,002 (64.13%) of 26,510 fields reachable
  58,165 (63.45%) of 91,666 methods reachable
     393 classes,   100 fields, and 2,057 methods registered for reflection
      65 classes,    74 fields, and    55 methods registered for JNI access
       4 native libraries: dl, pthread, rt, z
[3/7] Building universe...                                                                               (8.0s @ 4.02GB)
[4/7] Parsing methods...      [**]                                                                       (4.8s @ 3.85GB)
[5/7] Inlining methods...     [***]                                                                      (3.0s @ 1.72GB)
[6/7] Compiling methods...    [******]                                                                  (38.0s @ 3.63GB)
[7/7] Creating image...                                                                                  (5.9s @ 1.70GB)
  26.65MB (46.64%) for code area:    38,890 compilation units
  28.04MB (49.05%) for image heap:  359,812 objects and 66 resources
   2.46MB ( 4.31%) for other data
  57.15MB in total
------------------------------------------------------------------------------------------------------------------------
Top 10 packages in code area:                               Top 10 object types in image heap:
   1.48MB sun.security.ssl                                     5.85MB byte[] for code metadata
   1.06MB java.util                                            2.82MB java.lang.String
 979.43KB java.lang.invoke                                     2.78MB java.lang.Class
 758.29KB org.apache.groovy.parser.antlr4                      2.47MB byte[] for general heap data
 723.92KB com.sun.crypto.provider                              2.04MB byte[] for java.lang.String
 588.57KB org.h2.table                                       910.68KB com.oracle.svm.core.hub.DynamicHubCompanion
 582.06KB org.h2.command                                     764.95KB java.util.HashMap$Node
 494.23KB org.codehaus.groovy.classgen                       761.53KB java.lang.Object[]
 476.03KB c.s.org.apache.xerces.internal.impl.xs.traversers  715.65KB byte[] for embedded resources
 468.69KB java.lang                                          584.75KB java.util.HashMap$Node[]
  18.87MB for 370 more packages                                8.28MB for 2535 more object types
------------------------------------------------------------------------------------------------------------------------
                        3.9s (3.2% of total time) in 30 GCs | Peak RSS: 6.22GB | CPU load: 6.48
------------------------------------------------------------------------------------------------------------------------
Produced artifacts:
 /extra/projects/groovy-graalvm-h2/build/native/nativeCompile/H2Demo (executable)
 /extra/projects/groovy-graalvm-h2/build/native/nativeCompile/H2Demo.build_artifacts.txt (txt)
========================================================================================================================
Finished generating 'H2Demo' in 2m 1s.
    [native-image-plugin] Native Image written to: /extra/projects/groovy-graalvm-h2/build/native/nativeCompile

> Task :nativeRun
[ID:1, NAME:Lord Archimonde]
[ID:2, NAME:Arthur]
[ID:3, NAME:Gilbert]
[ID:4, NAME:Grug]

Checking the native image speed

We can also check the speed once the native image is built:

paulk@pop-os:/extra/projects/groovy-graalvm-h2$ time build/native/nativeCompile/H2Demo
[ID:1, NAME:Lord Archimonde]
[ID:2, NAME:Arthur]
[ID:3, NAME:Gilbert]
[ID:4, NAME:Grug]

real	0m0.027s
user	0m0.010s
sys	0m0.011s

More information

Check out the full source code from the repo: https://github.com/paulk-asert/groovy-graalvm-h2.

Conclusion

We have looked at a simple H2 database application and the steps involved in creating a native application with Groovy and GraalVM.


Reading and Writing CSV files with Groovy

by paulk


Posted on Monday July 25, 2022 at 02:26PM in Technology


In this post, we'll look at reading and writing CSV files using Groovy.

Aren't CSV files just text files?

For simple cases, we can treat CSV files no differently than we would other text files. Suppose we have the following data that we would like to write to a CSV file:

def data = [
['place', 'firstname', 'lastname', 'team'],
['1', 'Lorena', 'Wiebes', 'Team DSM'],
['2', 'Marianne', 'Vos', 'Team Jumbo Visma'],
['3', 'Lotte', 'Kopecky', 'Team SD Worx']
]

Groovy uses File or Path objects similar to Java. We'll use a File object here and, for our purposes, we'll just use a temporary file since we are just going to read it back in and check it against our data. Here is how to create a temporary file:

def file = File.createTempFile('FemmesStage1Podium', '.csv')

Writing our CSV (in this simple example) is as simple as joining the data with commas and the lines with line separator character(s):

file.text = data*.join(',').join(System.lineSeparator())

Here we "wrote" the entire file contents in one go but there are options for writing a line or character or byte at a time.

Reading the data in is just as simple. We read the lines and split on commas:

assert file.readLines()*.split(',') == data

In general, we might want to further process the data. Groovy provides nice options for this too. Suppose we have the following existing CSV file:
HommesOverall.png
We can read in the file and select various columns of interest with code like below:

def file = new File('HommesStageWinners.csv')
def rows = file.readLines().tail()*.split(',')
int total = rows.size()
Set names = rows.collect { it[1] + ' ' + it[2] }
Set teams = rows*.getAt(3)
Set countries = rows*.getAt(4)
String result = "Across $total stages, ${names.size()} riders from " +
"${teams.size()} teams and ${countries.size()} countries won stages."
assert result == 'Across 21 stages, 15 riders from 10 teams and 9 countries won stages.'

Here, the tail() method skips over the header line. Column 0 contains the stage number which we ignore. Column 1 contains the first name, column 2 the last name, column 3 the team, and column 4 the country of the rider. We store away the full names, teams and countries in sets to remove duplicates. We then create an overall result message using the size of those sets.

While for this simple example, the coding was fairly simple, it isn't recommended to hand process CSV files in this fashion. The details for CSV can quickly get messy. What if the values themselves contain commas or newlines? Perhaps we can surround in double quotes but then what if the value contains a double quote? And so forth. For this reason, CSV libraries are recommended.

We'll look at three shortly, but first let's summarise some of the highlights of the tour by looking at multiple winners. Here is some code which summarises our CSV data:

def byValueDesc = { -it.value }
def bySize = { k, v -> [k, v.size()] }
def isMultiple = { it.value > 1 }
def multipleWins = { Closure select -> rows
.groupBy(select)
.collectEntries(bySize)
.findAll(isMultiple)
.sort(byValueDesc)
.entrySet()
.join(', ')
}
println 'Multiple wins by country:\n' + multipleWins{ it[4] }
println 'Multiple wins by rider:\n' + multipleWins{ it[1] + ' ' + it[2] }
println 'Multiple wins by team:\n' + multipleWins{ it[3] }

This summary has nothing in particular to do with CSV files but is summarised in honour of the great riding during the tour! Here's the output:

MultipleWins.png

Okay, now let's look at our three CSV libraries.

Commons CSV

The Apache Commons CSV library makes writing and parsing CSV files easier. Here is the code for writing our CSV which makes use of the CSVPrinter class:

file.withWriter { w ->
new CSVPrinter(w, CSVFormat.DEFAULT).printRecords(data)
}

And here is the code for reading it back in which uses the RFC4180 parser factory singleton:

file.withReader { r ->
assert RFC4180.parse(r).records*.toList() == data
}

There are other singleton factories for tab-separated values and other common formats and builders to let you set a whole variety of options like escape characters, quote options, whether to use an enum to define header names, and whether to ignore empty lines or nulls.

For our more elaborate example, we have a tiny bit more work to do. We'll use the builder to tell the parser to skip the header row. We could have chosen to use the tail() trick we used earlier but we decided to use the parser features instead. The code would look like this:

file.withReader { r ->
def rows = RFC4180.builder()
.setHeader()
.setSkipHeaderRecord(true)
.build()
.parse(r)
.records
assert rows.size() == 21
assert rows.collect { it.firstname + ' ' + it.lastname }.toSet().size() == 15
assert rows*.team.toSet().size() == 10
assert rows*.country.toSet().size() == 9
}

You can see here that we have used column names rather than column numbers during our processing. Using column names is another advantage of using the CSV library; it would be quite a lot of work to do that aspect by hand. Also note that, for simplicity, we didn't create the entire result message as in the earlier example. Instead, we just checked the size of all of the relevant sets that we calculated previously.

OpenCSV

The OpenCSV library handles the messy CSV details when needed but doesn't get in the way for simple cases. For our first example, the CSVReader and CSVWriter classes will be suitable. Here is the code for writing our CSV file in the same way as earlier:

file.withWriter { w ->
new CSVWriter(w).writeAll(data.collect{ it as String[] })
}

And here is the code for reading data:

file.withReader { r ->
assert new CSVReader(r).readAll() == data
}

If we look at the produced file, it is already a little fancier than earlier with double quotes around all data:

FemmesPodiumStage1.png

If we want to do more elaborate processing, the CSVReaderHeaderAware class is aware of the initial header row and its column names. Here is our more elaborate example which processed some of the data further:

file.withReader { r ->
def rows = []
def reader = new CSVReaderHeaderAware(r)
while ((next = reader.readMap())) rows << next
assert rows.size() == 21
assert rows.collect { it.firstname + ' ' + it.lastname }.toSet().size() == 15
assert rows*.team.toSet().size() == 10
assert rows*.country.toSet().size() == 9
}

You can see here that we have again used column names rather than column numbers during our processing. For simplicity, we followed the same style as in the Commons CSV example and just checked the size of all of the relevant sets that we calculated previously.

OpenCSV also supports transforming CSV files into JavaBean instances. First, we define our target class (or annotate an existing domain class):

class Cyclist {
@CsvBindByName(column = 'firstname')
String first
@CsvBindByName(column = 'lastname')
String last
@CsvBindByName
String team
@CsvBindByName
String country
}

For two of the columns, we've indicated that the column name in the CSV file doesn't match our class property. The annotation attribute caters for that scenario.

Then, we can use this code to convert our CSV file into a list of domain objects:

file.withReader { r ->
List<Cyclist> rows = new CsvToBeanBuilder(r).withType(Cyclist).build().parse()
assert rows.size() == 21
assert rows.collect { it.first + ' ' + it.last }.toSet().size() == 15
assert rows*.team.toSet().size() == 10
assert rows*.country.toSet().size() == 9
}

OpenCSV has many options we didn't show. When writing files you can specify the separator and quote characters, when reading CSV you can specify column positions, types, and validate data.

Jackson Databind CSV

The Jackson Databind library supports the CSV format (as well as many others).

Writing CSV files from existing data is simple as shown here for running example:

file.withWriter { w ->
new CsvMapper().writeValue(w, data)
}

This writes the data into our temporary file as we saw with previous examples. One minor difference is that by default, just the values containing spaces will be double quoted but like the other libraries, there are many configuration options to tweak such settings.

Reading the data can be achieved using the following code:

def mapper = new CsvMapper().readerForListOf(String).with(CsvParser.Feature.WRAP_AS_ARRAY)
file.withReader { r ->
assert mapper.readValues(r).readAll() == data
}

Our more elaborate example is done in a similar way:

def schema = CsvSchema.emptySchema().withHeader()
def mapper = new CsvMapper().readerForMapOf(String).with(schema)
file.withReader { r ->
def rows = mapper.readValues(r).readAll()
assert rows.size() == 21
assert rows.collect { it.firstname + ' ' + it.lastname }.toSet().size() == 15
assert rows*.team.toSet().size() == 10
assert rows*.country.toSet().size() == 9
}

Here, we tell the library to make use of our header row and store each row of data in a map.

Jackson Databind also supports writing to classes including JavaBeans as well as records. Let's create a record to hold our cyclist information:

@JsonCreator
record Cyclist(
@JsonProperty('stage') int stage,
@JsonProperty('firstname') String first,
@JsonProperty('lastname') String last,
@JsonProperty('team') String team,
@JsonProperty('country') String country) {
String full() { "$first $last" }
}

Note that again we can indicate where our record component names may not match the names used in the CSV file, we simply supply the alternate name when specifying the property. There are other options like indicating that a field is required or giving its column position but we don't need those options for our example. We've also added a full() helper method to return the full name of the cyclist.

Groovy will use native records on platforms that support it (JDK16+) or emulated records on earlier platforms.

Now we can write our code for record deserialization:

def schema = CsvSchema.emptySchema().withHeader()
def mapper = new CsvMapper().readerFor(Cyclist).with(schema)
file.withReader { r ->
List<Cyclist> records = mapper.readValues(r).readAll()
assert records.size() == 21
assert records*.full().toSet().size() == 15
assert records*.team.toSet().size() == 10
assert records*.country.toSet().size() == 9
}

Conclusion

We have looked at writing and reading CSV files to Strings and domain classes and records. We had a look at handling simple cases by hand and also looked at the OpenCSV, Commons CSV and Jackson Databind CSV libraries.

Code for these examples:
https://github.com/paulk-asert/CsvGroovy

Code for other examples of using Groovy for Data Science:
https://github.com/paulk-asert/groovy-data-science




Groovy release train: 4.0.4, 3.0.12, 2.5.18

by paulk


Posted on Sunday July 24, 2022 at 12:55PM in Technology


groovy.pngIt's been a productive time for the Apache Groovy project recently. We recently released versions 4.0.4, 3.0.12 and 2.5.18 with 42, 21 and 15 fixes and improvements respectively. Two quick highlights for the 4.0.4 release before getting into more details about the release.

Eric Milles has been interacting for many months with the team from the hephaestus project in particular Stefanos Chaliasos and Thodoris Sotiropoulos. You can think of hephaestus as a fuzzying tool for type checkers and they have been putting Groovy's static compiler through its paces finding plenty of edge cases for us to assess. We still have some work to do but we have made significant improvements and would welcome any feedback. If you're interested, consider diving further into the research behind hephaestus.

We've also had some great contributions from Sandip Chitale for Groovy's Object Browser. You can access this from a number of ways including the :inspect command in groovysh or in the GroovyConsole via the Script->Inspect Last or Script->Inspect Variables menu items. It's also hooked into the AST Browser if you're exploring code produced by the Groovy compiler.

2022-07-24 22_23_16-Support launching of ObjectExplore when property rows are double clic… by sandip.png

Please find more details about the 4.0.4 release below.


Dear community,


The Apache Groovy team is pleased to announce version 4.0.4 of Apache Groovy.
Apache Groovy is a multi-faceted programming language for the JVM.
Further details can be found at the https://groovy.apache.org website.

This release is a maintenance release of the GROOVY_4_0_X branch.
It is strongly encouraged that all users using prior
versions on this branch upgrade to this version.

This release includes 42 bug fixes/improvements as outlined in the changelog:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318123&version=12351811

Sources, convenience binaries, downloadable documentation and an SDK
bundle can be found at: https://groovy.apache.org/download.html
We recommend you verify your installation using the information on that page.

Jars are also available within the major binary repositories.

We welcome your help and feedback and in particular want
to thank everyone who contributed to this release.

For more information on how to report problems, and to get involved,
visit the project website at https://groovy.apache.org/

Best regards,

The Apache Groovy team.




Comparators and Sorting in Groovy

by paulk


Posted on Thursday July 21, 2022 at 03:51PM in Technology


2022-07-22 01_05_29-s-l300.webp (300×291).pngThis blog post is inspired by the Comparator examples in the excellent Collections Refuelled talk and blog by Stuart Marks. That blog from 2017 highlights improvements in the Java collections library in Java 8 and 9 including numerous Comparator improvements. It is now 5 years old but still highly recommended for anyone using the Java collections library.

Rather than have a Student class as per the original blog example, we'll have a Celebrity class (and later record) which has the same first and last name fields and an additional age field. We'll compare initially by last name with nulls before non-nulls and then by first name and lastly by age.

As with the original blog, we'll cater for nulls, e.g. a celebrity known by a single name.

The Java comparator story recap

JavaTransparent.pngOur Celebrity class if we wrote it in Java would look something like:

public class Celebrity {                    // Java
private String firstName;
private String lastName;
private int age;

public Celebrity(String firstName, int age) {
this(firstName, null, age);
}

public Celebrity(String firstName, String lastName, int age) {
this.firstName = firstName;
this.lastName = lastName;
this.age = age;
}

public int getAge() {
return age;
}

public void setAge(int age) {
this.age = age;
}

public String getFirstName() {
return firstName;
}

public void setFirstName(String firstName) {
this.firstName = firstName;
}

public String getLastName() {
return lastName;
}

public void setLastName(String lastName) {
this.lastName = lastName;
}

@Override
public String toString() {
return "Celebrity{" +
"firstName='" + firstName +
(lastName == null ? "" : "', lastName='" + lastName) +
"', age=" + age +
'}';
}
}

It would look much nicer as a Java record (JDK16+) but we'll keep with the spirit of the original blog example for now. This is fairly standard boilerplate and in fact was mostly generated by IntelliJ IDEA. The only slightly interesting aspect is that we tweaked the toString method to not display null last names.

On JDK 8 with the old-style comparator coding, a main application which created and sorted some celebrities might look like this:

import java.util.ArrayList;            // Java
import java.util.Collections;
import java.util.List;

public class Main {
public static void main(String[] args) {
List<Celebrity> celebrities = new ArrayList<>();
celebrities.add(new Celebrity("Cher", "Wang", 63));
celebrities.add(new Celebrity("Cher", "Lloyd", 28));
celebrities.add(new Celebrity("Alex", "Lloyd", 47));
celebrities.add(new Celebrity("Alex", "Lloyd", 37));
celebrities.add(new Celebrity("Cher", 76));
Collections.sort(celebrities, (c1, c2) -> {
String f1 = c1.getLastName();
String f2 = c2.getLastName();
int r1;
if (f1 == null) {
r1 = f2 == null ? 0 : -1;
} else {
r1 = f2 == null ? 1 : f1.compareTo(f2);
}
if (r1 != 0) {
return r1;
}
int r2 = c1.getFirstName().compareTo(c2.getFirstName());
if (r2 != 0) {
return r2;
}
return Integer.compare(c1.getAge(), c2.getAge());
});
System.out.println("Celebrities:");
celebrities.forEach(System.out::println);
}
}

When we run this example, the output looks like this:

Celebrities:
Celebrity{firstName='Cher', age=76}
Celebrity{firstName='Alex', lastName='Lloyd', age=37}
Celebrity{firstName='Alex', lastName='Lloyd', age=47}
Celebrity{firstName='Cher', lastName='Lloyd', age=28}
Celebrity{firstName='Cher', lastName='Wang', age=63}

As pointed out in the original blog, this code is rather tedious and error-prone and can be improved greatly with comparator improvements in JDK8:

import java.util.Arrays;             // Java
import java.util.List;

import static java.util.Comparator.comparing;
import static java.util.Comparator.naturalOrder;
import static java.util.Comparator.nullsFirst;

public class Main {
public static void main(String[] args) {
List<Celebrity> celebrities = Arrays.asList(
new Celebrity("Cher", "Wang", 63),
new Celebrity("Cher", "Lloyd", 28),
new Celebrity("Alex", "Lloyd", 47),
new Celebrity("Alex", "Lloyd", 37),
new Celebrity("Cher", 76));
celebrities.sort(comparing(Celebrity::getLastName, nullsFirst(naturalOrder())).
thenComparing(Celebrity::getFirstName).thenComparing(Celebrity::getAge));
System.out.println("Celebrities:");
celebrities.forEach(System.out::println);
}
}

The original blog also points out the convenience factory methods from JDK9 for list creation which you might be tempted to consider here. For our case, we will be sorting in place, so the immutable lists returned by those methods won't help us here but Arrays.asList isn't much longer than List.of and works well for this example.

As well as being much shorter, the comparing and thenComparing methods and built-in comparators like nullsFirst and naturalOrdering allow for far simpler composability. The sort within array list is also more efficient than the sort that would have been used with the Collections.sort method on earlier JDKs. The output when running the example is the same as previously.

The Groovy comparator story

groovy.pngAt about the same time that Java was evolving its comparator story Groovy added some complementary features to tackle many of the same problems. We'll look at some of those features and also how the JDK improvements we saw above features can be used instead if preferred.

First off, let's create a Groovy Celebrity record:

@Sortable(includes = 'last,first,age')
@ToString(ignoreNulls = true, includeNames = true)
record Celebrity(String first, String last = null, int age) {}

And create our list of celebrities:

var celebrities = [
new Celebrity("Cher", "Wang", 63),
new Celebrity("Cher", "Lloyd", 28),
new Celebrity("Alex", "Lloyd", 47),
new Celebrity("Alex", "Lloyd", 37),
new Celebrity(first: "Cher", age: 76)
]

The record definition is nice and concise. It would look good in recent Java versions too. A nice aspect of the Groovy solution is that it will provide emulated records on earlier JDKs and it also has some nice declarative transforms to tweak the record definition. We could leave off the @ToString annotation and we'd get a standard record-style toString. Or we could add a toString method to our record definition similar to what was done in the Java example. Using @ToString allows us to remove null last names from the toString in a declarative way. We'll cover the @Sortable annotation a little later.

First off, Groovy's spaceship operator <=> allows us to write a nice compact version of the "tedious" code in the first Java version. It looks like this:

celebrities.sort { c1, c2 ->
c1.last <=> c2.last ?: c1.first <=> c2.first ?: c1.age <=> c2.age
}
println 'Celebrities:\n' + celebrities.join('\n')

And the output looks like this:

Celebrities:
Celebrity(first:Cher, age:76)
Celebrity(first:Alex, last:Lloyd, age:37)
Celebrity(first:Alex, last:Lloyd, age:47)
Celebrity(first:Cher, last:Lloyd, age:28)
Celebrity(first:Cher, last:Wang, age:63)

We'd have a tiny bit more work to do if we wanted nulls last but the defaults work well for the example at hand.

We can alternatively, make use of the "new in JDK8" methods mentioned earlier:

celebrities.sort(comparing(Celebrity::last, nullsFirst(naturalOrder())).
thenComparing(c -> c.first).thenComparing(c -> c.age))

But this is where we should come back and further explain the @Sortable annotation. That annotation is associated with an Abstract Syntax Tree (AST) transformation, or just transform for short, which provides us with an automatic compareTo method that takes into account the record's properties (and likewise if it was a class). Since we provided an includes annotation attribute and provided a list of property names, the order of those names determines the priority of the properties used in the comparator. We could equally include just some of the names in that list or alternatively provide an excludes annotation attribute and just mention that properties we don't want included.

It also adds Comparable<Celebrity> to the list of implemented interfaces for our record. So, what does all this mean? It means we can just write:

celebrities.sort()

The transform associated with the @Sortable annotation also provides some additional comparators for us. To sort by age, we can use one of those comparators:

celebrities.sort(Celebrity.comparatorByAge())

Which gives this output:

Celebrities:
Celebrity(first:Cher, last:Lloyd, age:28)
Celebrity(first:Alex, last:Lloyd, age:37)
Celebrity(first:Alex, last:Lloyd, age:47)
Celebrity(first:Cher, last:Wang, age:63)
Celebrity(first:Cher, age:76)

In addition to the sort method, Groovy provides a toSorted method which sorts a copy of the list, leaving the original unchanged. So, to create a list sorted by first name we can use this code:

var celebritiesByFirst = celebrities.toSorted(Celebrity.comparatorByFirst())

Which if output in a similar way to previous examples gives:

Celebrities:
Celebrity(first:Alex, last:Lloyd, age:37)
Celebrity(first:Alex, last:Lloyd, age:47)
Celebrity(first:Cher, last:Lloyd, age:28)
Celebrity(first:Cher, last:Wang, age:63)
Celebrity(first:Cher, age:76)

If you are a fan of functional style programming, you might consider using List.of to define the original list and then only toSorted method calls in further processing.

Mixing in some language integrated queries

Groovy also has a GQuery (aka GINQ) capability which provides a SQL inspired DSL for working with collections. We can use GQueries to examine and order our collection. Here is an example:

println GQ {
from c in celebrities
select c.first, c.last, c.age
}

Which has this output:

+-------+-------+-----+
| first | last  | age |
+-------+-------+-----+
| Cher  |       | 76  |
| Alex  | Lloyd | 37  |
| Alex  | Lloyd | 47  |
| Cher  | Lloyd | 28  |
| Cher  | Wang  | 63  |
+-------+-------+-----+

In this case, it's using the natural ordering which @Sortable gives us.

Or we can sort by age:

println GQ {
from c in celebrities
orderby c.age
select c.first, c.last, c.age
}

Which has this output:

+-------+-------+-----+
| first | last  | age |
+-------+-------+-----+
| Cher  | Lloyd | 28  |
| Alex  | Lloyd | 37  |
| Alex  | Lloyd | 47  |
| Cher  | Wang  | 63  |
| Cher  |       | 76  |
+-------+-------+-----+

Or we can sort by last name descending and then age:

println GQ {
from c in celebrities
orderby c.last in desc, c.age
select c.first, c.last, c.age
}

Which has this output:

+-------+-------+-----+
| first | last  | age |
+-------+-------+-----+
| Cher  | Wang  | 63  |
| Cher  | Lloyd | 28  |
| Alex  | Lloyd | 37  |
| Alex  | Lloyd | 47  |
| Cher  |       | 76  |
+-------+-------+-----+

Conclusion

We have seen a little example of using comparators in Groovy. All the great JDK capabilities are available as well as the spaceship operator, the sort and toSorted methods, and the @Sortable AST transformation.


Testing your Java with Groovy, Spock, JUnit5, Jacoco, Jqwik and Pitest

by paulk


Posted on Friday July 15, 2022 at 08:26AM in Technology


spock-main-logo.pngThis blog post covers a common scenario seen in the Groovy community which is projects which use Java for their production code and Groovy for their tests. This can be a low risk way for Java shops to try out and become more familiar with Groovy. We'll write our initial tests using the Spock testing framework and we'll use JUnit5 later with our jqwik tests. You can usually use your favorite Java testing libraries if you switch to Groovy.

The system under test

For illustrative purposes, we will test a Java mathematics utility function sumBiggestPair. Given three numbers, it finds the two biggest and then adds them up. An initial stab at the code for this might look something like this:

public class MathUtil {

public static int sumBiggestPair(int a, int b, int c) {
int op1 = a;
int op2 = b;
if (c > a) {
op1 = c;
} else if (c > b) {
op2 = c;
}
return op1 + op2;
}


private MathUtil(){}
}

Testing with Spock

An initial test could look like this:

class MathUtilSpec extends Specification {
def "sum of two biggest numbers"() {
expect:
MathUtil.sumBiggestPair(2, 5, 3) == 8
}
}

When we run this test, all tests pass:

test result success image
But if we look at the coverage report, generated with Jacoco, we see that our test hasn't covered all lines of code:

incomplete coverage image

We'll swap to use Spock's data-driven feature and include an additional testcase:

    def "sum of two biggest numbers"(int a, int b, int c, int d) {
expect:
MathUtil.sumBiggestPair(a, b, c) == d

where:
a | b | c | d
2 | 5 | 3 | 8
5 | 2 | 3 | 8
}

We can check our coverage again:

2022-07-14 22_35_22-MathUtil.java.png

That is a little better. We now have 100% line coverage but not 100% branch coverage. Let's add one more testcase:

    def "sum of two biggest numbers"(int a, int b, int c, int d) {
expect:
MathUtil.sumBiggestPair(a, b, c) == d

where:
a | b | c | d
2 | 5 | 3 | 8
5 | 2 | 3 | 8
5 | 4 | 1 | 9
}

And now we can see that we have reached 100% line coverage and 100% branch coverage:

2022-07-14 22_33_45-MathUtil.java.png

At this point, we might be very confident in our code and ready to ship it to production. Before we do, we'll add one more testcase:

def "sum of two biggest numbers"(int a, int b, int c, int d) {
expect:
MathUtil.sumBiggestPair(a, b, c) == d

where:
a | b | c | d
2 | 5 | 3 | 8
5 | 2 | 3 | 8
5 | 4 | 1 | 9
3 | 2 | 6 | 9
}

When we re-run our tests, we discover that the last testcase failed!:

2022-07-14 22_49_23-Test results - MathUtilSpec.png

And examining the testcase, we can indeed see that there is a flaw in our algorithm. Basically, having the else logic doesn't cater for when c is greater than both a and b!

2022-07-14 22_49_58-Test results - MathUtilSpec.png

We succumbed to faulty expectations of what 100% coverage would give us.

A 100% code coverage example

The good news is that we can fix this. Here is an updated algorithm:

public static int sumBiggestPair(int a, int b, int c) {
int op1 = a;
int op2 = b;
if (c > Math.min(a, b)) {
op1 = c;
op2 = Math.max(a, b);
}
return op1 + op2;
}

With this new algorithm, all 4 testcases now pass and we again have 100% line and branch coverage.

> Task :SumBiggestPairPitest:test
 Test sum of two biggest numbers [Tests: 4/4/0/0] [Time: 0.317 s]
 Test util.MathUtilSpec [Tests: 4/4/0/0] [Time: 0.320 s]
 Test Gradle Test Run :SumBiggestPairPitest:test [Tests: 4/4/0/0]

But haven't we been here before? How can we be sure there isn't some additional test cases that might reveal another flaw in our algorithm? We could keep writing lots more testcases but we'll look at two other techniques that can help.

Mutation testing with Pitest

An interesting but not widely used technique is mutation testing. It probably deserves to be more widely used. It can test the quality of a testsuite but has the drawback of sometimes being quite resource intensive. It modifies (mutates) production code and re-runs your testsuite. If your test suite still passes with modified code, it possibly indicates that your testsuite is lacking sufficient coverage. Earlier, we had an algorithm with a flaw and our testsuite didn't initially pick it up. You can think of mutation testing as adding a deliberate flaw and seeing whether your testsuite is good enough to detect that flaw.

If your a fan of test-driven development (TDD), it espouses a rule that not a single line of production code should be added unless a failing test forces that line to be added. A corollary is that if you change a single line of production code in any meaningful way, that some test should fail.

So, let's have a look at what mutation testing says about our initial flawed algorithm. We'll use Pitest (also known as PIT). We'll go back to our initial algorithm and the point where we erroneously thought we had 100% coverage. When we run Pitest, we get the following result:

2022-07-14 23_57_17-index.html.png

And looking at the code we see:

2022-07-14 23_53_51-MathUtil.java.html.png

With output including some statistics:

================================================================================
- Statistics
================================================================================
>> Line Coverage: 7/8 (88%)
>> Generated 6 mutations Killed 4 (67%)
>> Mutations with no coverage 0. Test strength 67%
>> Ran 26 tests (4.33 tests per mutation)

What is this telling us? Pitest mutated our code in ways that you might expect to break it but our testsuite passed (survived) in a couple of instances. That means one of two things. Either, there are multiple valid implementations of our algorithm and Pitest found one of those equivalent solutions, or our testsuite is lacking some key testcases. In our case, we know that the testsuite was insufficient.

Let's run it again but this time with all of our tests and the corrected algorithm.

2022-07-15 00_11_21-MathUtil.java.html.png

The output when running the test has also changed slightly:

================================================================================
- Statistics
================================================================================
>> Line Coverage: 6/7 (86%)
>> Generated 4 mutations Killed 3 (75%)
>> Mutations with no coverage 0. Test strength 75%
>> Ran 25 tests (6.25 tests per mutation)

Our warnings from Pitest have reduced but not gone completely away and our test strength has gone up but is still not 100%. It does mean that we are in better shape than before. But should we be concerned?

It turns out in this case, we don't need to worry (too much). As an example, an equally valid algorithm for our function under test would be to replace the conditional with "c >= Math.min(a, b)". Note the greater-than-equals operator rather than just greater-than. For this algorithm, a different path would be taken for the case when c equals a or b, but the end result would be the same. So, that would be an inconsequential or equivalent mutation. In such a case, there may be no additional testcase that we can write to keep Pitest happy. We have to be aware of this possible outcome when using this technique.

Finally, let's look at our build file that ran Spock, Jacoco and Pitest:

plugins {
id 'info.solidsoft.pitest' version '1.7.4'
}
apply plugin: 'groovy'

repositories {
mavenCentral()
}

dependencies {
implementation "org.apache.groovy:groovy-test-junit5:4.0.3"
testImplementation("org.spockframework:spock-core:2.2-M3-groovy-4.0") {
transitive = false
}
}

pitest { junit5PluginVersion = '1.0.0' pitestVersion = '1.9.2' timestampedReports = false
targetClasses = ['util.*']
}

tasks.named('test') {
useJUnitPlatform()
}

The astute reader might note some subtle hints which show that the latest Spock versions run on top of the JUnit 5 platform.

Using Property-based Testing

Property-based testing is another technology which probably deserves much more attention. Here we'll use jqwik which runs on top of JUnit5 but you might also like to consider Genesis which provides random generators and especially targets Spock.

Earlier, we looked at writing more tests to make our coverage stronger. Property-based testing can often lead to writing less tests. Instead, we generate many random tests automatically and see whether certain properties hold.

Previously, we fed in the inputs and the expected output. For property-based testing, the inputs are typically randomly-generated values, we don't know the output. So, instead of testing directly against some known output, we'll just check various properties of the answer.

As an example, here is a test we could use:

@Property
void "result should be bigger than any individual and smaller than sum of all"(
@ForAll @IntRange(min = 0, max = 1000) Integer a,
@ForAll @IntRange(min = 0, max = 1000) Integer b,
@ForAll @IntRange(min = 0, max = 1000) Integer c) {
def result = sumBiggestPair(a, b, c)
assert [a, b, c].every { individual -> result >= individual }
assert result <= a + b + c
}

The @ForAll annotations indicate places where jqwik will insert random values. The @IntRange annotation indicates that we want the random values to be contained between 0 and 1000.

Here we are checking that (at least for small positive numbers) adding the two biggest numbers should be greater than or equal to any individual number and should be less than or equal to adding all three of the numbers. These are necessary but insufficient properties to ensure our system works.

When we run this we see the following output in the logs:

                              |--------------------jqwik-------------------- tries = 1000                  | # of calls to property checks = 1000                 | # of not rejected calls generation = RANDOMIZED       | parameters are randomly generated after-failure = PREVIOUS_SEED | use the previous seed when-fixed-seed = ALLOW       | fixing the random seed is allowed edge-cases#mode = MIXIN       | edge cases are mixed in edge-cases#total = 125        | # of all combined edge cases edge-cases#tried = 117        | # of edge cases tried in current run seed = -311315135281003183    | random seed to reproduce generated values

So, we wrote 1 test and 1000 testcases were executed. The number of tests run is configurable. We won't go into the details here. This looks great at first glance. It turns out however, that this particular property is not very discriminating in terms of the bugs it can find. This test passes for both our original flawed algorithm as well as the fixed one. Let's try a different property:

@Property
void "sum of any pair should not be greater than result"(
@ForAll @IntRange(min = 0, max = 1000) Integer a,
@ForAll @IntRange(min = 0, max = 1000) Integer b,
@ForAll @IntRange(min = 0, max = 1000) Integer c) {
def result = sumBiggestPair(a, b, c)
assert [a + b, b + c, c + a].every { sumOfPair -> result >= sumOfPair }
}

If we calculate the biggest pair, then surely it must be greater than or equal to any arbitrary pair. Trying this on our flawed algorithm gives: 

org.codehaus.groovy.runtime.powerassert.PowerAssertionError:
    assert [a + b, b + c, c + a].every { sumOfPair -> result >= sumOfPair }
            | | |  | | |  | | |  |
            1 1 0  0 2 2  2 3 1  false

                              |--------------------jqwik-------------------- tries = 12                    | # of calls to property checks = 12                   | # of not rejected calls generation = RANDOMIZED       | parameters are randomly generated after-failure = PREVIOUS_SEED | use the previous seed when-fixed-seed = ALLOW       | fixing the random seed is allowed edge-cases#mode = MIXIN       | edge cases are mixed in edge-cases#total = 125        | # of all combined edge cases edge-cases#tried = 2          | # of edge cases tried in current run seed = 4830696361996686755    | random seed to reproduce generated values

Shrunk Sample (6 steps) -----------------------   arg0: 1   arg1: 0   arg2: 2

Original Sample ---------------   arg0: 247   arg1: 32   arg2: 267

  Original Error   --------------   org.codehaus.groovy.runtime.powerassert.PowerAssertionError:     assert [a + b, b + c, c + a].every { sumOfPair -> result >= sumOfPair }             | | |  | | |  | | |  |             | | 32 32| 267| | |  false             | 279    299  | | 247             247           | 514                           267

Not only did it find a case which highlighted the flaw, but it shrunk it down to a very simple example. On our fixed algorithm, the 1000 tests pass!

The previous property can be refactored a little to not only calculate all three pairs but then find the maximum of those. This simplifies the condition somewhat:

@Property
void "result should be the same as alternative oracle implementation"(
@ForAll @IntRange(min = 0, max = 1000) Integer a,
@ForAll @IntRange(min = 0, max = 1000) Integer b,
@ForAll @IntRange(min = 0, max = 1000) Integer c) {
assert sumBiggestPair(a, b, c) == [a+b, a+c, b+c].max()
}

This approach, where an alternative implementation is used, is known as a test oracle. The alternative implementation might be less efficient, so not ideal for production code, but fine for testing. When revamping or replacing some software, the oracle might be the existing system. When run on our fixed algorithm, we again have 1000 testcases passing.

Let's go one step further and remove our @IntRange boundaries on the Integers:

@Property
void "result should be the same as alternative oracle implementation"(@ForAll Integer a, @ForAll Integer b, @ForAll Integer c) {
assert sumBiggestPair(a, b, c) == [a+b, a+c, b+c].max()
}

When we run the test now, we might be surprised:

  org.codehaus.groovy.runtime.powerassert.PowerAssertionError:
    assert sumBiggestPair(a, b, c) == [a+b, a+c, b+c].max()
           |              |  |  |  |   |||  |||  |||  |
           -2147483648    0  1  |  |   0|1  0||  1||  2147483647
                                |  |    1    ||   |2147483647
                                |  false     ||   -2147483648
                                2147483647   |2147483647
                                             2147483647
Shrunk Sample (13 steps)
------------------------
  arg0: 0
  arg1: 1
  arg2: 2147483647

It fails! Is this another bug in our algorithm? Possibly? But it could equally be a bug in our property test. Further investigation is warranted.

It turns out that our algorithm suffers from Integer overflow when trying to add 1 to Integer.MAX_VALUE. Our test partially suffers from the same problem but when we call max(), the negative value will be discarded. There is no always correct answer as to what should happen in this scenario. We go back to the customer and check the real requirement. In this case, let's assume the customer was happy for the overflow to occur - since that is what would happen if performing the operation long-hand in Java. With that knowledge we should fix our test to at least pass correctly when overflow occurs.

We have a number of options to fix this. We already saw previously we can use @IntRange. This is one way to "avoid" the problem and we have a few similar approaches which do the same. We could use a more confined data type, e.g. Short:

@Property
void checkShort(@ForAll Short a, @ForAll Short b, @ForAll Short c) {
assert sumBiggestPair(a, b, c) == [a+b, a+c, b+c].max()
}

Or we could use a customised provider method:

@Property
void checkIntegerConstrainedProvider(@ForAll('halfMax') Integer a,
@ForAll('halfMax') Integer b,
@ForAll('halfMax') Integer c) {
assert sumBiggestPair(a, b, c) == [a+b, a+c, b+c].max()
}

@Provide
Arbitrary<Integer> halfMax() {
int halfMax = Integer.MAX_VALUE >> 1
return Arbitraries.integers().between(-halfMax, halfMax)
}

But rather than avoiding the problem, we could change our test so that it allowed for the possibility of overflow within sumBiggestPair but didn't compound the problem with its own overflow. E.g. we could use Long's to do our calculations within our test:

@Property
void checkIntegerWithLongCalculations(@ForAll Integer a, @ForAll Integer b, @ForAll Integer c) {
def (al, bl, cl) = [a, b, c]*.toLong()
assert sumBiggestPair(a, b, c) == [al+bl, al+cl, bl+cl].max().toInteger()
}

Finally, let's again look at our Gradle build file:

apply plugin: 'groovy'

repositories {
mavenCentral()
}

dependencies {
testImplementation project(':SumBiggestPair')
testImplementation "org.apache.groovy:groovy-test-junit5:4.0.3"
testImplementation "net.jqwik:jqwik:1.6.5"
}

test {
useJUnitPlatform {
includeEngines 'jqwik'
}
}

More information

The examples in this blog post are excerpts from the following repo:

https://github.com/paulk-asert/property-based-testing

Versions used: Gradle 7.5, Groovy 4.0.3, jqwik 1.6.5, pitest 1.9.2, Spock 2.2-M3-groovy-4.0, Jacoco 0.8.8. Tested with JDK 8, 11, 17, 18.

There are many sites with valuable information about the technologies covered here. There are also some great books. Books on Spock include: Spock: Up and Running, Java Testing with Spock, and Spocklight Notebook. Books on Groovy include: Groovy in Action and Learning Groovy 3. If you want general information about using Java and Groovy together, consider Making Java Groovy. And there's a section on mutation testing in Practical Unit Testing With Testng And Mockito. The most recent book for property testing is for the Erlang and Elixir languages.

Conclusion

We have looked at testing Java code using Groovy and Spock with some additional tools like Jacoco, jqwik and Pitest. Generally using Groovy to test Java is a straight-forward experience. Groovy also lends itself to writing testing DSLs which allow non-hard-core programmers to write very simple looking tests; but that's a topic for another blog!


Parsing JSON with Groovy

by paulk


Posted on Sunday July 10, 2022 at 02:00PM in Technology


json logoGroovy has excellent support for processing a range of structured data formats like JSON, TOML, YAML, etc. This blog post looks at JSON.

There is quite good documentation on this topic as part of the Groovy documentation. There are also numerous online sources for more details including Groovy - JSON tutorial, Working with JSON in Groovy, and Groovy Goodness: Relax... Groovy Will Parse Your Wicked JSON to name just a few. This post does a quick summary and provides more setup information and details about various options.

Batteries included experience

If you have installed the Groovy installation zip (or .msi on windows), you will have the `groovy-json` module which includes JsonSlurper, so the bulk of the examples shown here and in the other mentioned links should work out of the box.

JsonSlurper is the main class for parsing JSON.

JsonSlurper in GroovyConsole

This example shows parsing JSON embedded in a string but there are other methods for parsing files, URLs and other streams.

Another example using groovysh:

paulk@pop-os:~$ groovysh
Groovy Shell (4.0.3, JVM: 18.0.1)
Type ':help' or ':h' for help.
--------------------------------------------------------------------------------------------------
groovy:000> new groovy.json.JsonSlurper().parseText('{ "myList": [1, 3, 5] }').myList
===> [1, 3, 5]

Or using a Jupyter/BeakerX notebook:

JsonSlurper in Jupyter notebook

Similarly, if you point your IDE to your Groovy distribution, you should be able to run the examples directly.

Gradle

If you are using a build tool like Gradle, you may prefer to reference your dependencies from a dependency repository rather than having a locally installed distribution.

Suppose you have the following test using JsonSlurper in the file src/test/groovy/JsonTest.groovy:

import groovy.json.JsonSlurper
import org.junit.Test

class JsonTest {
@Test
void testJson() {
def text = '{"person":{"name":"Guillaume","age":33,"pets":["dog","cat"]}}'
def json = new JsonSlurper().parseText(text)
assert json.person.pets.size() == 2
}
}

You can reference the relevant Groovy dependencies, in our case groovy-json and groovy-test, in a build.gradle file like below:

apply plugin: 'groovy'

repositories {
mavenCentral()
}

dependencies {
testImplementation "org.apache.groovy:groovy-json:4.0.3" // for JsonSlurper
testImplementation "org.apache.groovy:groovy-test:4.0.3" // for tests
}

Both these artifacts bring in the core groovy artifact transitively, so there's no need to reference that explicitly.

Running gradle test should run the tests and produce a report:

Test results for Class JsonTest.png

You can if you prefer, use the groovy-all artifact like this:

apply plugin: 'groovy'

repositories {
mavenCentral()
}

dependencies {
testImplementation "org.apache.groovy:groovy-all:4.0.3"
}

This artifact contains no jars but has all of the common Groovy modules as transitive dependencies.

[Note: In early Groovy 4 versions you may have needed to reference Groovy as a platform, e.g.: testImplementation platform("org.apache.groovy:groovy-all:4.0.1"). This is now only required when using the groovy-bom artifact.]

Maven

When using the Maven build tool, you would instead create a pom.xml like this and make use of two plugins: gmavenplus-plugin and maven-surefire-plugin:

<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>myGroupId</groupId>
<artifactId>groovy-json-maven</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.gmavenplus</groupId>
<artifactId>gmavenplus-plugin</artifactId>
<version>1.13.0</version>
<executions>
<execution>
<goals>
<goal>compileTests</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>3.0.0-M7</version>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.groovy</groupId>
<artifactId>groovy-json</artifactId>
<version>4.0.3</version>
</dependency>
<dependency>
<groupId>org.apache.groovy</groupId>
<artifactId>groovy-test</artifactId>
<version>4.0.3</version>
</dependency>
</dependencies>
</project>

Alternatively, you could once again reference the groovy-all artifact as per this alternate build file:

<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>myGroupId</groupId>
<artifactId>groovy-json-maven</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.gmavenplus</groupId>
<artifactId>gmavenplus-plugin</artifactId>
<version>1.13.0</version>
<executions>
<execution>
<goals>
<goal>compileTests</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>3.0.0-M7</version>
<dependencies>
<dependency>
<groupId>org.apache.maven.surefire</groupId>
<artifactId>surefire-junit47</artifactId>
<version>3.0.0-M7</version>
</dependency>
</dependencies>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.groovy</groupId>
<artifactId>groovy-all</artifactId>
<type>pom</type>
<version>4.0.3</version>
</dependency>
</dependencies>
</project>

When referencing the groovy-all artifact, we specify that it is a pom artifact using "<type>pom</type>". We also needed to configure the surefire plugin to use JUnit4. The groovy-all artifact also brings in JUnit5 support and the surefire plugin will use that by default and not find our test.

Running the test should yield:

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running JsonTest
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.36 s - in JsonTest

Advanced features

To cater for different scenarios, JsonSlurper is powered by several internal implementation classes. You don't access these classes directly but rather set a parser type when instantiating your slurper.

TypeWhen to use
CHAR_BUFFERDefault, least surprise parser with eager parsing of ints, dates, etc.
INDEX_OVERLAYFor REST calls, WebSocket messages, AJAX, inter process communication. Fastest parser which uses indexes to some existing char buffer.
CHARACTER_SOURCEFor handling larger JSON files.
LAXAllows comments and no quotes or single quotes in numerous situations.

Here is an example:

import groovy.json.JsonSlurper
import static groovy.json.JsonParserType.*

def slurper = new JsonSlurper(type: LAX)
def json = slurper.parseText('''{person:{'name':"Guillaume","age":33,"pets":["dog" /* ,"cat" */]}}''')
assert json.person.pets == ['dog']

Note the missing quotes for the person key, the single quotes for the name key and "cat" has been commented out. These changes wouldn't be allowed by a strict JSON parser.

Other JSON libraries

Groovy doesn't require you to use the groovy-json classes. You can use your favourite Java library with Groovy. You'll still benefit from many of Groovy's short-hand notations.

Here's an example using Gson:

@Grab('com.google.code.gson:gson:2.9.0')
import com.google.gson.JsonParser

def parser = new JsonParser()
def json = parser.parse('{"person":{"name":"Guillaume","age":33,"pets":["dog","cat"]}}')
assert json.person.pets*.asString == ['dog', 'cat']

Here's an example using the Jackson JSON support:

@Grab('com.fasterxml.jackson.core:jackson-databind:2.13.3')
import com.fasterxml.jackson.databind.ObjectMapper

def text = '{"person":{"name":"Guillaume","age":33,"pets":["dog","cat"]}}'
def json = new ObjectMapper().readTree(text)
assert json.person.pets*.asText() == ['dog', 'cat']

Integrated query

Groovy 4 also supports language integrated query syntax, known as GINQ or GQuery. We can use that with JSON too.

Suppose we have information in JSON format about fruits, their prices (per 100g) and the concentration of vitamin C (per 100g):

{
"prices": [
{"name": "Kakuda plum", "price": 13},
{"name": "Camu camu", "price": 25},
{"name": "Acerola cherries", "price": 39},
{"name": "Guava", "price": 2.5},
{"name": "Kiwifruit", "price": 0.4},
{"name": "Orange", "price": 0.4}
],
"vitC": [
{"name": "Kakuda plum", "conc": 5300},
{"name": "Camu camu", "conc": 2800},
{"name": "Acerola cherries", "conc": 1677},
{"name": "Guava", "conc": 228},
{"name": "Kiwifruit", "conc": 144},
{"name": "Orange", "conc": 53}
]
}

Now, suppose we are on a budget and want to select the most cost-effective fruits to buy to help us achieve our daily vitamin C requirements. We join the prices and vitC information and order by most cost-effective fruit. We’ll select the top 2 in case our first choice isn’t in stock when we go shopping. Our GQuery processing looks like this:

def jsonFile = new File('fruit.json')
def json = new JsonSlurper().parse(jsonFile)
assert GQ {
from p in json.prices
join c in json.vitC on c.name == p.name
orderby c.conc / p.price in desc
limit 2
select p.name
}.toList() == ['Kakuda plum', 'Kiwifruit']

We can see, for this data, Kakadu plums followed by Kiwifruit are our best choices.

Quick performance comparison

As a very crude measure of performance, JsonSlurper with all 4 parser types as well as Gson and Jackson were used to parse the timezone values from https://github.com/flowcommerce/json-reference and check that the current timezone in Brisbane is the same as the timezone in Sydney. The json file is by no means huge. It has just under 3000 lines and is under 60K in size. The best time taken (including compilation time) after 4 runs was taken - definitely a micro-benchmark which shouldn't be taken too seriously, but might be a rough guide. Just for fun, a native version of the Groovy JsonSlurper script with the type set to INDEX_OVERLAY was made using GraalVM. It's timings are included too.

$ time groovy GroovyJsonIndexOverlay.groovy
real    0m1.365s
user    0m4.157s
sys     0m0.145s

$ time groovy GroovyJsonCharacterSource.groovy
real    0m1.447s
user    0m4.472s
sys     0m0.174s

$ time groovy GroovyJsonLax.groovy
real    0m1.452s
user    0m4.338s
sys     0m0.171s

$ time groovy GroovyJson.groovy
real    0m1.383s
user    0m4.050s
sys     0m0.165s

$ time groovy Gson.groovy
real    0m1.814s
user    0m5.543s
sys     0m0.209s

$ time groovy Jackson.groovy
real    0m2.007s
user    0m6.332s
sys     0m0.208s

$ time ./groovyjsonindexoverlay real 0m0.015s user 0m0.011s sys 0m0.004s

Summary

We have seen the basics of setting up our projects to parse JSON using Groovy and some of the numerous options available to use depending on the scenario. We also saw how to use other JSON libraries, utilize GQuery syntax during our processing, and looked at some very crude performance figures.