Groovy Apache Groovy

Entries tagged [groovy]

Whiskey Clustering with Groovy and Apache Ignite

by paulk


Posted on Thursday October 27, 2022 at 11:13AM in Technology


In a previous blog post, we looked at using Apache Wayang (incubating) and Apache Spark to scale up the k-means clustering algorithm. Let's look at another useful technology for scaling up this problem, Apache Ignite. They recently released a new version, but earlier versions are also fine for our example. Before we start, a quick reminder of the problem.

Whiskey Clustering

groovy.pngThis problem looks at the quest of finding the perfect single-malt Scotch whiskey. The whiskies produced from 86 distilleries have been ranked by expert tasters according to 12 criteria (Body, Sweetness, Malty, Smoky, Fruity, etc.). We'll use a K-means algorithm to calculate the centroids.

whiskey_bottles.jpg

K-means is a standard data-science clustering technique. In our case, it groups whiskies with similar characteristics (according to the 12 criteria) into clusters. If we have a favourite whiskey, chances are we can find something similar by looking at other instances in the same cluster. If we are feeling like a change, we can look for a whiskey in some other cluster. The centroid is the notional "point" in the middle of the cluster. For us it reflects the typical measure of each criteria for a whiskey in that cluster.

Apache Ignite

Apache Ignite is a distributed database for high-performance computing with in-memory speed. It makes a cluster (or grid) of nodes appear like an in-memory cache.

This explanation drastically simplifies Ignite's feature set. Ignite can be used as:

  • an in-memory cache with special features like SQL querying and transactional properties
  • an in-memory data-grid with advanced read-through & write-through capabilities on top of one or more distributed databases
  • an ultra-fast and horizontally scalable in-memory database
  • a high-performance computing engine for custom or built-in tasks including machine learning

It is mostly this last capability that we will use. Ignite's Machine Learning API has purpose built, cluster-aware machine learning and deep learning algorithms for Classification, Regression, Clustering, and Recommendation among others. We'll use the distributed K-means Clustering algorithm from their library.

2022-10-27 21_17_19-Machine Learning _ Ignite Documentation.png

Implementation Details

Apache Ignite has special capabilities for reading data into the cache. We could use IgniteDataStreamer or IgniteCache.loadCache() and load data from files, stream sources, various database sources and so forth. This is particularly relevant when using a cluster.

For our little example, our data is in a relatively small CSV file and we will be using a single node, so we'll just read our data using Apache Commons CSV:

var file = getClass().classLoader.getResource('whiskey.csv').file as File
var rows = file.withReader {r -> RFC4180.parse(r).records*.toList() }
var data = rows[1..-1].collect{ it[2..-1]*.toDouble() } as double[][]

We'll configure our single node Ignite data cache using code (but we could place the details in a configuration file in more complex scenarios):

var cfg = new IgniteConfiguration(
peerClassLoadingEnabled: true,
discoverySpi: new TcpDiscoverySpi(
ipFinder: new TcpDiscoveryMulticastIpFinder(
addresses: ['127.0.0.1:47500..47509']
)
)
)

We'll create a few helper variables:

var features = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco',
'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral']
var pretty = this.&sprintf.curry('%.4f')
var dist = new EuclideanDistance()
var vectorizer = new DoubleArrayVectorizer().labeled(FIRST)

Now we start the node, populate the cache, run our k-means algorithm, and print the result.

Ignition.start(cfg).withCloseable { ignite ->
println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols"
var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>(
name: "TEST_${UUID.randomUUID()}",
affinity: new RendezvousAffinityFunction(false, 10)))
data.indices.each { int i -> dataCache.put(i, data[i]) }
var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5)
var mdl = trainer.fit(ignite, dataCache, vectorizer)
println ">>> KMeans centroids:\n${features.join(', ')}"
var centroids = mdl.centers*.all()
centroids.each { c -> println c*.get().collect(pretty).join(', ') }
dataCache.destroy()
}

Results

Here is the output:

[18:13:11]    __________  ________________
[18:13:11]   /  _/ ___/ |/ /  _/_  __/ __/
[18:13:11]  _/ // (7 7    // /  / / / _/
[18:13:11] /___/\___/_/|_/___/ /_/ /x___/
[18:13:11]
[18:13:11] ver. 2.14.0#20220929-sha1:951e8deb
[18:13:11] 2022 Copyright(C) Apache Software Foundation
...
[18:13:11] Configured plugins:
[18:13:11]   ^-- ml-inference-plugin 1.0.0
[18:13:14] Ignite node started OK (id=f731e4ab)
...
>>> Ignite grid started for data: 86 rows X 13 cols
>>> KMeans centroids
Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral
2.7037, 2.4444, 1.4074, 0.0370, 0.0000, 1.8519, 1.6667, 1.8519, 1.8889, 2.0370, 2.1481, 1.6667
1.8500, 1.9000, 2.0000, 0.9500, 0.1500, 1.1000, 1.5000, 0.6000, 1.5500, 1.7000, 1.3000, 1.5000
1.2667, 2.1333, 0.9333, 0.1333, 0.0000, 1.0667, 0.8000, 0.5333, 1.8000, 1.7333, 2.2667, 2.2667
3.6667, 1.5000, 3.6667, 3.3333, 0.6667, 0.1667, 1.6667, 0.5000, 1.1667, 1.3333, 1.1667, 0.1667
1.5000, 2.8889, 1.0000, 0.2778, 0.1667, 1.0000, 1.2222, 0.6111, 0.5556, 1.7778, 1.6667, 2.0000
[18:13:15] Ignite node stopped OK [uptime=00:00:00.663]

We can plot the centroid characteristics in a spider plot.

2022-10-27 20_42_01-Whiskey clusters with Apache Ignite.png


More Information

  • Repo containing the source code: WhiskeyIgnite
  • Repo containing similar examples using a variety of libraries including Apache Commons CSV, Weka, Smile, Tribuo and others: Whiskey
  • A similar example using Apache Spark directly but with a built-in parallelized k-means from the spark-mllib library rather than a hand-crafted algorithm: WhiskeySpark


Groovy Dates And Times Cheat Sheet

by paulk


Posted on Monday October 24, 2022 at 07:27AM in Technology


Java has had a Date class from the very beginning and Groovy supports using it and several related classes like Calendar. Throughout this blog post we refer to those classes as the legacy date classes. Groovy enhances the experience of using the legacy date classes with simpler mechanisms for formatting, parsing and extracting fields from the related classes.

Since Java 8, the JDK has included the JSR-310 Date Time API. We refer to these classes as the new date classes. The new date classes remove many limitations of the legacy date classes and bring some greatly appreciated additional consistency. Groovy provides similar enhancements for the new date classes too.

Groovy's enhancements for the legacy date classes are in the groovy-dateutil module (prior to Groovy 2.5, this functionality was built in to the core module). The groovy-datetime module has the enhancements for the new date classes. You can include a dependency to this module in your build file or reference the groovy-all pom dependency. Both modules are part of a standard Groovy install.

The next few sections illustrate common date and time tasks and the code to perform them using the new and legacy classes with Groovy enhancements in numerous places.

Please note: that some of the formatting commands are Locale dependent and the output may vary slightly if you run these examples yourself.

Representing the current date/time

The legacy date classes have an abstraction which includes date and time. If you are only interested in one of those two aspects, you simply ignore the other aspect. The new date classes allow you to have date-only, time-only and date-time representations.

The examples create instances representing the current date and/or time. Various information is extracted from the instances and they are printed in various ways. Some of the examples use the SV macro which prints the name and string value of one or more variables.

taskjava.timelegacy

current date and time

println LocalDateTime.now()      
println Instant.now()
2022-10-24T12:40:02.218130200
2022-10-24T02:40:02.223131Z
println new Date()               
println Calendar.instance.time
Mon Oct 24 12:40:02 AEST 2022
Mon Oct 24 12:40:02 AEST 2022
day of current year &
day of current month
println LocalDateTime.now().dayOfYear
println LocalDateTime.now().dayOfMonth
297
24
println Calendar.instance[DAY_OF_YEAR]
println Calendar.instance[DAY_OF_MONTH]
297
24
extract today's
year, month & day
var now = LocalDate.now()     // or LocalDateTime

println SV(now.year, now.monthValue, now.dayOfMonth)

(Y, M, D) = now[YEAR, MONTH_OF_YEAR, DAY_OF_MONTH]
println "Today is $Y $M $D"
now.year=2022, now.monthValue=10, now.dayOfMonth=24
Today is 2022 10 24
var now = Calendar.instance
(_E, Y, M, _WY, _WM, D) = now
println "Today is $Y ${M+1} $D"

(Y, M, D) = now[YEAR, MONTH, DAY_OF_MONTH]
println "Today is $Y ${M+1} $D"
Today is 2022 10 24
Today is 2022 10 24
alternatives to print today
println now.format("'Today is 'YYYY-MM-dd")
printf 'Today is %1$tY-%1$tm-%1$te%n', now
Today is 2022-10-24
Today is 2022-10-24
println now.format("'Today is 'YYYY-MM-dd")
printf 'Today is %1$tY-%1$tm-%1$te%n', now
Today is 2022-10-24
Today is 2022-10-24

extract parts of current time

now = LocalTime.now() // or LocalDateTime
println SV(now.hour, now.minute, now.second)
(H, M, S) = now[HOUR_OF_DAY, MINUTE_OF_HOUR,
SECOND_OF_MINUTE]
printf 'The time is %02d:%02d:%02d\n', H, M, S
now.hour=12, now.minute=40, now.second=2
The time is 12:40:02
(H, M, S) = now[HOUR_OF_DAY, MINUTE, SECOND]

println SV(H, M, S)
printf 'The time is %02d:%02d:%02d%n', H, M, S
H=12, M=40, S=2
The time is 12:40:02
alternatives to print time
println now.format("'The time is 'HH:mm:ss")
printf 'The time is %1$tH:%1$tM:%1$tS%n', now
The time is 12:40:02
The time is 12:40:02
println now.format("'The time is 'HH:mm:ss")
printf 'The time is %1$tH:%1$tM:%1$tS%n', now
The time is 12:40:02
The time is 12:40:02

Processing times

The new date classes have a LocalTime class specifically for representing time-only quantities. The legacy date classes don't have such a purpose-built abstraction; you essentially just ignore the day, month, and year parts of a date. The java.sql.Time class could be used as an alterative but rarely is. The Java documentation comparing the new date classes to their legacy equivalents, talks about using GregorianCalendar with the date set to the epoch value of 1970-01-01 as an approximation of the LocalTime class. We'll follow that approach here to provide a comparison but we strongly recommend upgrading to the new classes if you need to represent time-only values or use the Joda-Time library on JDK versions prior to 8.

The examples look at representing a minute before and after midnight, and some times at which you might eat your meals. For the meals, as well as printing various values, we might be interested in calculating new times in terms of existing times, e.g. lunch and dinner are 7 hours apart.
taskjava.timelegacy
one min after midnight
LocalTime.of(0, 1).with {
println format('HH:mm')
println format('hh:mm a')
println format('K:mm a')
}
00:01
12:01 am
0:01 am
Calendar.instance.with {
clear()
set(MINUTE, 1)
println format('HH:mm')
println format('hh:mm a')
println format('K:mm a')
}
00:01
12:01 am
0:01 am
one min before midnight
LocalTime.of(23, 59).with {
println format('HH:mm')
println format('hh:mm a')
println format('K:mm a')
}
23:59
11:59 pm
11:59 pm
Calendar.instance.with {
clear()
set(hourOfDay: 23, minute: 59)
println format('HH:mm')
println format('hh:mm a')
println format('K:mm a')
}
23:59
11:59 pm
11:59 pm
meal times
var breakfast = LocalTime.of(7, 30)
var lunch = LocalTime.parse('12:30')
assert lunch == LocalTime.parse('12:30.00 pm', 'hh:mm.ss a')
lunch.with { assert hour == 12 && minute == 30 }
var dinner = lunch.plusHours(7)
assert dinner == lunch.plus(7, ChronoUnit.HOURS)
assert Duration.between(lunch, dinner).toHours() == 7
assert breakfast.isBefore(lunch) // Java API
assert lunch < dinner // Groovy shorthand
assert lunch in breakfast..dinner
assert dinner.format('hh:mm a') == '07:30 pm'
assert dinner.format('k:mm') == '19:30'
assert dinner.format(FormatStyle.MEDIUM) == '7:30:00 pm'
assert dinner.timeString == '19:30:00'
var breakfast = Date.parse('hh:mm', '07:30')
var lunch = Calendar.instance.tap {
clear()
set(hourOfDay: 12, minute: 30)
}
assert lunch[HOUR_OF_DAY, MINUTE] == [12, 30]
var dinner = lunch.clone().tap { it[HOUR_OF_DAY] += 7 }
assert dinner == lunch.copyWith(hourOfDay: 19)
assert dinner.format('hh:mm a') == '07:30 pm'
assert dinner.format('k:mm') == '19:30'
assert dinner.time.timeString == '7:30:00 pm'
assert breakfast.before(lunch.time) // Java API
assert lunch < dinner // Groovy shorthand

Processing dates

To represent date-only information with the legacy date classes, you set the time aspects to zero, or simply ignore them. Alternatively, you could consider the less commonly used java.sql.Date class. The new date classes have the special LocalDate class for this purpose which we highly recommend.

The examples create dates for Halloween and Melbourne Cup day (a public holiday in the Australia state of Victoria). We look at various properties of those two dates.

taskjava.timelegacy
holidays
var halloween22 = LocalDate.of(2022, 10, 31)
var halloween23 = LocalDate.parse('2023-Oct-31', 'yyyy-LLL-dd')
assert halloween22 == halloween23 - 365
assert halloween23 == halloween22.plusYears(1)
var melbourneCup22 = LocalDate.of(2022, 11, 1)
assert halloween22 < melbourneCup22
assert melbourneCup22 - halloween22 == 1
assert Period.between(halloween22, melbourneCup22).days == 1
assert ChronoUnit.DAYS.between(halloween22, melbourneCup22) == 1L
var days = []
halloween22.upto(melbourneCup22) {days << "$it.dayOfWeek" }
assert days == ['MONDAY', 'TUESDAY']
var hols = halloween22..melbourneCup22
assert hols.size() == 2
var halloween21 = Date.parse('dd/MM/yyyy', '31/10/2021')
var halloween22 = Date.parse('yyyy-MMM-dd', '2022-Oct-31')
assert halloween21 + 365 == halloween22
var melbourneCup22 = new GregorianCalendar(2022, 10, 1).time
assert melbourneCup22.dateString == '1/11/22' // AU Locale
assert halloween22 < melbourneCup22
assert melbourneCup22 - halloween22 == 1
assert melbourneCup22 == halloween22.copyWith(month: 10, date: 1)
var days = []
halloween22.upto(melbourneCup22) { days << it.format('EEEEE') }
assert days == ['Monday', 'Tuesday']
var hols = halloween22..melbourneCup22
assert hols.size() == 2

Processing date and time combinations

The new date classes use LocalDateTime to represent a quantity with both date and time aspects. Many of the methods seen earlier will also be applicable here.

The examples show creating and printing a representation of lunch on Melbourne Cup day.

taskjava.timelegacy
holidays
var melbourneCupLunch = LocalDateTime.of(2022, 11, 1, 12, 30)
assert melbourneCupLunch.timeString == '12:30:00'
assert melbourneCupLunch.dateString == '2022-11-01'
assert melbourneCupLunch.dateTimeString == '2022-11-01T12:30:00'
assert melbourneCupLunch.toLocalDate() == melbourneCup22
assert melbourneCupLunch.toLocalTime() == lunch
assert melbourneCupLunch == melbourneCup22 << lunch
var melbourneCupLunch = new GregorianCalendar(2022, 10, 1, 12, 30).time
assert melbourneCupLunch.timeString == '12:30:00 pm' // Locale specific
assert melbourneCupLunch.dateString == '1/11/22' // Locale specific
assert melbourneCupLunch.dateTimeString == '1/11/22, 12:30:00 pm' // Locale specific
assert melbourneCupLunch.clearTime() == melbourneCup22

Processing zoned date and times

The legacy date classes have the concept of a TimeZone, predominantly used by the Calendar class. The new date classes has a similar concept but uses the ZoneId, ZoneOffset, and ZonedDateTime classes (among others).

The examples show various properties of zones and show that during the Melbourne cup breakfast, it would still be the night before (Halloween) in Los Angeles. They also show that those two zones are 18 hours apart at that time of the year.

taskjava.timelegacy
holidays
var aet = ZoneId.of('Australia/Sydney')
assert aet.fullName == 'Australian Eastern Time' && aet.shortName == 'AET'
assert aet.offset == ZoneOffset.of('+11:00')
var melbCupBreakfastInAU = ZonedDateTime.of(melbourneCup22, breakfast, aet)
var melbCupBreakfast = LocalDateTime.of(melbourneCup22, breakfast)
assert melbCupBreakfastInAU == melbCupBreakfast << aet
var pst = ZoneId.of('America/Los_Angeles')
assert pst.fullName == 'Pacific Time' && pst.shortName == 'GMT-08:00'
var meanwhileInLA = melbCupBreakfastInAU.withZoneSameInstant(pst)
assert halloween22 == meanwhileInLA.toLocalDate()
assert aet.offset.hours - pst.offset.hours == 18
var aet = TimeZone.getTimeZone('Australia/Sydney')
assert aet.displayName == 'Australian Eastern Standard Time'
assert aet.observesDaylightTime()
var melbourneCupBreakfast = new GregorianCalendar(aet).tap {
set(year: 2022, month: 10, date: 1, hourOfDay: 7, minute: 30)
}
var pst = TimeZone.getTimeZone('America/Los_Angeles')
assert pst.displayName == 'Pacific Standard Time'
var meanwhileInLA = new GregorianCalendar(pst).tap {
setTimeInMillis(melbourneCupBreakfast.timeInMillis)
}
assert meanwhileInLA.time.format('MMM dd', pst) == halloween22.format('MMM dd')
assert aet.rawOffset / 3600000 - pst.rawOffset / 3600000 == 18

Other useful classes

The new date classes offer a few more useful classes. Here are some of the common ones:

  • OffsetDateTime - like ZonedDateTime but with just an offset from UTC rather than a full time-zone
  • Instant - like OffsetDateTime but tied to UTC
  • YearMonth - like a LocalDate but with no day component
  • MonthDay - like a LocalDate but with no year component
  • Period - used to represent periods of time, e.g. Period.ofDays(14), Period.ofYears(2); see also the LocalDate example above.
  • Duration - a time-based amount of time, e.g. Duration.ofSeconds(30), Duration.ofHours(7); see also the LocalTime example above.

Conversions

It is useful to convert between the new and legacy classes. Some useful conversion methods are shown below with Groovy enhancements shown in blue.

FromConversion method/property
GregorianCalendar 
toInstant()
toZonedDateTime()
from(ZonedDateTime)
Calendar
toInstant()
toZonedDateTime()
toOffsetDateTime()
toLocalDateTime()
toLocalDate()
toLocalTime()
toOffsetTime()
toDayOfWeek()
toYear()
toYearMonth()
toMonth()
toMonthDay()
zoneOffset
zoneId
Date
toInstant()
from(Instant)
toZonedDateTime()
toOffsetDateTime()
toLocalDateTime()
toLocalDate()
toLocalTime()
toOffsetTime()
toDayOfWeek()
toYear()
toYearMonth()
toMonth()
toMonthDay()
zoneOffset
zoneId
ZonedDateTime
OffsetDateTime
LocalDateTime
LocalDate
LocalTime
toDate()
toCalendar()

SimpleDateFormat patterns

We saw several examples above using the format and parse methods. For the legacy date classes, numerous Groovy enhancements delegate to SimpleDateFormat. This class represents date/time formats using pattern strings. These are special letters to represent some time or date component mixed with escaped literal strings. The special letters are often repeated to represent the minimum size field for number components and whether the full or an abbreviated form is used for other components.

As an example, for the U.S. locale and U.S. Pacific Time time zone, the following pattern:

yyyy.MM.dd G 'at' HH:mm:ss z
would apply to the following text:
2001.07.04 AD at 12:08:56 PDT

Letter Description
GEra designator AD
yYear 1996; 96
YWeek year (similar to year but allotted by weeks; the first/last few days of a year might be allotted to finish/start the last/previous week)
MMonth in year (context sensitive) July; Jul; 07
LMonth in year (standalone form) July; Jul; 07
wWeek in year 27
WWeek in month 2
DDay in year 189
dDay in month 10
FDay of week in month 2
EDay name in week Tuesday; Tue
uDay number of week (1 = Monday, ..., 7 = Sunday)
aAm/pm marker PM
HHour in day (0-23) 0
kHour in day (1-24) 24
KHour in am/pm (0-11) 0
hHour in am/pm (1-12) 12
mMinute in hour 30
sSecond in minute 55
SMillisecond 978
zTime zone Pacific Standard Time; PST; GMT-08:00
ZTime zone (RFC 822) -0800
XTime zone (ISO 8601) -08; -0800; -08:00
'to escape text put a single quote on either side
''two single quotes for a literal single quote '


DateTimeFormatter patterns

Groovy's format and parse enhancements for the new date classes delegate to the DateTimeFormatter class. It's behavior is similar to what we saw for SimpleDateFormat but with slightly different conversion letters:

Conversion suffix Description
Gera AD
uyear 2004; 04
yyear-of-era 2004; 04
Dday-of-year 189
M/Lmonth-of-year 7; 07; Jul; July; J
dday-of-month 10
Q/qquarter-of-year 3; 03; Q3; 3rd quarter
Yweek-based-year 1996; 96
wweek-of-week-based-year 27
Wweek-of-month 4
Eday-of-week Tue; Tuesday; T
e/clocalized day-of-week 2; 02; Tue; Tuesday; T
Fweek-of-month 3
aam-pm-of-day PM
hclock-hour-of-am-pm (1-12) 12
Khour-of-am-pm (0-11) 0
kclock-hour-of-am-pm (1-24) 0
Hhour-of-day (0-23) 0
mminute-of-hour 30
ssecond-of-minute 55
Sfraction-of-second 978
Amilli-of-day 1234
nnano-of-second 987654321
Nnano-of-day 1234000000
Vtime-zone ID America/Los_Angeles; Z; -08:30
ztime-zone name Pacific Standard Time; PST
Olocalized zone-offset GMT+8; GMT+08:00; UTC-08:00;
Xzone-offset 'Z' for zero Z; -08; -0830; -08:30; -083015; -08:30:15;
xzone-offset +0000; -08; -0830; -08:30; -083015; -08:30:15;
Zzone-offset +0000; -0800; -08:00;
ppad next
'to escape text put a single quote on either side
''two single quotes for a literal single quote '


Formatter formats

The java.util.Formatter class is a base class in Java for various kinds of formatting. It can be used directly, via String.format, parse, printf, or Groovy's sprintf. We saw several examples of using printf and parse formatting in the above examples.

The Formatter class has methods which take a format string as its first argument and zero or more additional arguments. The format string typically has one or more format specifiers (starting with a percent character) which indicate that a formatted version of one of the additional arguments should be placed into the string at that point. The general form of a format specifier is:

%[argument_index$][flag][width][.precision]conversion

Most of the parts are optional. The argument_index part is only used when referencing one of the additional arguments more than once (or out of order). The precision part is only used for floating point numbers. The flag part is used to indicate always include sign(+), zero-padding(0), locale-specific comma delimiters(,), and left justification(-). The width indicates the minimum number of characters for the output. The conversion indicates how the argument should be processed, e.g. as a numeric field, a date, a special character, or some other special processing. Upper and lowercase variants exist for most conversions which, for the uppercase variant, will call toUpperCase after the conversion is complete.

Conversion Description
'b', 'B'Treat as a boolean or false if null
'h', 'H'Output the arguments hashcode as a hex string
's', 'S'Treat as a String
'c', 'C'Treat as a Unicode character
'd'Treat as a decimal integer
'o'Treat as an octal integer
'x', 'X'Treat as a hexadecimal integer
'e', 'E'Treat as a decimal number in scientific notation
'f'Treat as a floating point number
'g', 'G'Treat as a floating point in either decimal or scientific notation
'a', 'A'Treat as a hexadecimal floating-point number
't', 'T'Treat as the prefix for a date/time conversion
'%'A literal percent
'n'A line separator

When the date/time prefix is used, additional suffixes are applicable.

For formatting times:

Conversion suffix Description
'H'Hour of the day for the 24-hour clock as two digits 00 - 23
'I'Hour for the 12-hour clock as two digits 01 - 12
'k'Hour of the day for the 24-hour clock 0 - 23
'l'Hour for the 12-hour clock 1 - 12
'M'Minute within the hour as two digits 00 - 59
'S'Seconds within the minute as two digits 00 - 60
("60" is used for leap seconds)
'L'Millisecond within the second as three digits 000 - 999
'N'Nanosecond within the second as nine digits 000000000 - 999999999
'p'Locale-specific morning or afternoon marker in lower case, am or pm
(The conversion prefix 'T' forces this output to upper case)
'z'RFC 822 style numeric time zone offset from GMT -0800
(Adjusted as needed for Daylight Saving Time)
'Z'Abbreviated time zone
's'Seconds since the beginning of the epoch starting at 1 January 1970 00:00:00 UTC
'Q'Milliseconds since the beginning of the epoch starting at 1 January 1970 00:00:00 UTC


For formatting dates:

Conversion suffix Description
'B'Locale-specific full month name January
'b', 'h'Locale-specific abbreviated month name Jan
'A'Locale-specific full name of the day of the week Sunday
'a'Locale-specific short name of the day of the week Sun
'C'First two digits of four-digit year 00 - 99
'Y'Year as four digits 0092
'y'Last two digits of the year 00 - 99
'j'Day of year as three digits 001 - 366
'm'Month as two digits 01 - 13
'd'Day of month as two digits 01 - 31
'e'Day of month 1 - 31


For formatting date/time compositions:

Conversion suffix Description
'R'Time formatted for the 24-hour clock as "%tH:%tM"
'T'Time formatted for the 24-hour clock as "%tH:%tM:%tS"
'r'Time formatted for the 12-hour clock as "%tI:%tM:%tS %Tp"
The location of the morning or afternoon marker ('%Tp') may be locale-dependent.
'D'Date formatted as "%tm/%td/%ty"
'F'ISO 8601 date formatted as "%tY-%tm-%td"
'c'Date and time formatted as "%ta %tb %td %tT %tZ %tY" Sun Jul 21 15:17:00 EDT 1973


Further information


Fruity Eclipse Collections

by paulk


Posted on Thursday October 13, 2022 at 11:05AM in Technology


This blog post continues on to some degree from the previous post, but instead of deep learning, we'll look at clustering using k-means after first exploring some top methods of Eclipse Collections with fruit emoji examples.

Eclipse Collections Fruit Salad

First, we'll define a Fruit enum (it adds one additional fruit compared to the related Eclipse Collections kata):

code for fruit enum

We can use this enum in the following examples:

usage.png

The last example calculates red fruit in parallel threads. As coded, it uses virtual threads when run on JDK19 with preview features enabled. You can follow the suggestion in the comment to run on other JDK versions or with normal threads. In addition to Eclipse Collections, we have the GPars library on our classpath. Here we are only using one method which is managing pool lifecycle for us.

Exploring emoji colors

For some fun, let's look at whether the nominated color of each fruit matches the color of the related emoji. As in the previous blog, we'll use the slightly nicer Noto Color Emoji fonts for our fruit as shown here:

2022-10-12 14_16_42-Noto Color Emoji - Google Fonts.png

We'll use an Eclipse Collection BiMap to switch back and forth between the color names and java.awt colors:

@Field public static COLOR_OF = BiMaps.immutable.ofAll([
WHITE: WHITE, RED: RED, GREEN: GREEN, BLUE: BLUE,
ORANGE: ORANGE, YELLOW: YELLOW, MAGENTA: MAGENTA
])
@Field public static NAME_OF = COLOR_OF.inverse()

We are also going to use some helper functions to switch between RGB and HSB color values:

static hsb(int r, int g, int b) {
float[] hsb = new float[3]
RGBtoHSB(r, g, b, hsb)
hsb
}

static rgb(BufferedImage image, int x, int y) {
int rgb = image.getRGB(x, y)
int r = (rgb >> 16) & 0xFF
int g = (rgb >> 8) & 0xFF
int b = rgb & 0xFF
[r, g, b]
}

The HSB color space represents colors in a spectrum from 0 to 360 degrees:

Color Circle, Credit: https://nycdoe-cs4all.github.io/units/1/lessons/lesson_3.2

Image credit: https://nycdoe-cs4all.github.io/units/1/lessons/lesson_3.2

We have two helper methods to assist with colors. The first picks out "mostly black" and "mostly white" colors while the second uses a switch expression to carve out some regions of the color space for our colors of interest:

static range(float[] hsb) {
if (hsb[1] < 0.1 && hsb[2] > 0.9) return [0, WHITE]
if (hsb[2] < 0.1) return [0, BLACK]
int deg = (hsb[0] * 360).round()
return [deg, range(deg)]
}

static range(int deg) {
switch (deg) {
case 0..<16 -> RED
case 16..<35 -> ORANGE
case 35..<75 -> YELLOW
case 75..<160 -> GREEN
case 160..<250 -> BLUE
case 250..<330 -> MAGENTA
default -> RED
}
}

Note that the JDK doesn't have a standard color of PURPLE, so we combine purple with magenta by choosing an appropriate broad spectrum for MAGENTA.

We used a Plotly 3D interactive scatterplot (as supported by the Tablesaw Java dataframe and visualization library) to visualize our emoji colors (as degrees on the color spectrum) vs the XY coordinates:

2022-10-13 20_04_10-Color vs xy.png

We are going to try out 3 approaches for determining the predominant color of each emoji:

  1. Most common color: We find the color spectrum value for each point and count up the number of points of each color. The color with the most points will be selected. This is simple and works in many scenarios but if an apple or cherry has 100 shades of red but only one shade of green for the stalk or a leaf, green may be selected.
  2. Most common range: We group each point into a color range. The range with the most points will be selected.
  3. Centroid of biggest cluster: We divide our emoji image into a grid of sub-images. We will perform k-means clustering of the RGB values for each point in the sub-image. This will cluster similar colored points together in a cluster. The cluster with the most points will be selected and its centroid will be chosen as the selected pre-dominant color. This approach has the affect of pixelating our sub-image by color. This approach is inspired by this python article.

Most Common Color

Ignoring the background white color, the most common color for our PEACH emoji is a shade of orange. The graph below shows the count of each color:

2022-10-17 15_57_40-Color histogram for PEACH.png

Most Common Range

If instead of counting each color, we group colors into their range and count the numbers in each range, we get the following graph for PEACH:

2022-10-17 15_56_58-Range histogram for PEACH.png


K-Means

K-Means is an algorithm for finding cluster centroids. For k=3, we would start by picking 3 random points as our starting centroids.

kmeans_step1.png

We allocate all points to their closest centroid:

kmeans_step2.png

Given this allocation, we re-calculate each centroid from all of its points:

kmeans_step3.png

We repeat this process until either a stable centroid selection is found, or we have reached a certain number of iterations.

We used the K-Means algorithm from Apache Commons Math.

Here is the kind of result we would expect if run on the complete set of points for the PEACH emoji. The black dots are the centroids. It has found one green, one orange and one red centroid. The centroid with the most points allocated to it should be the most predominant color. (This is another interactive 3D scatterplot.)


RGB_3D_PEACH.png


We can plot the number of points allocated to each cluster as a bar chart. (We used a Scala plotting library to show Groovy integration with Scala.)

2022-10-17 16_56_28-Centroid sizes for PEACH.png

The code for drawing the above chart looks like this:

var trace = new Bar(intSeq([1, 2, 3]), intSeq(sizes))
.withMarker(new Marker().withColor(oneOrSeq(colors)))

var traces = asScala([trace]).toSeq()

var layout = new Layout()
.withTitle("Centroid sizes for $fruit")
.withShowlegend(false)
.withHeight(600)
.withWidth(800)

Plotly.plot(path, traces, layout, defaultConfig, false, false, true)

K-Means with subimages

The approach we will take for our third option enhances K-Means. Instead of finding centroids for the whole image as the graphs just shown do, we divide the image into subimages and perform the K-Means on each subimage. Our overall pre-dominant color is determined to be the most common color predicated across all of our subimages.

Putting it all together

Here is the final code covering all three approaches (including printing some pretty images highlighting the third approach and the Plotly 3D scatter plots):

var results = Fruit.ALL.collect { fruit ->
var file = getClass().classLoader.getResource("${fruit.name()}.png").file as File
var image = ImageIO.read(file)

var colors = [:].withDefault { 0 }
var ranges = [:].withDefault { 0 }
for (x in 0..<image.width) {
for (y in 0..<image.height) {
def (int r, int g, int b) = rgb(image, x, y)
float[] hsb = hsb(r, g, b)
def (deg, range) = range(hsb)
if (range != WHITE) { // ignore white background
ranges[range]++
colors[deg]++
}
}
}
var maxRange = ranges.max { e -> e.value }.key
var maxColor = range(colors.max { e -> e.value }.key)

int cols = 8, rows = 8
int grid = 5 // thickness of black "grid" between subimages
int stepX = image.width / cols
int stepY = image.height / rows
var splitImage = new BufferedImage(image.width + (cols - 1) * grid, image.height + (rows - 1) * grid, image.type)
var g2a = splitImage.createGraphics()
var pixelated = new BufferedImage(image.width + (cols - 1) * grid, image.height + (rows - 1) * grid, image.type)
var g2b = pixelated.createGraphics()

ranges = [:].withDefault { 0 }
for (i in 0..<rows) {
for (j in 0..<cols) {
def clusterer = new KMeansPlusPlusClusterer(5, 100)
List<DoublePoint> data = []
for (x in 0..<stepX) {
for (y in 0..<stepY) {
def (int r, int g, int b) = rgb(image, stepX * j + x, stepY * i + y)
var dp = new DoublePoint([r, g, b] as int[])
var hsb = hsb(r, g, b)
def (deg, col) = range(hsb)
data << dp
}
}
var centroids = clusterer.cluster(data)
var biggestCluster = centroids.max { ctrd -> ctrd.points.size() }
var ctr = biggestCluster.center.point*.intValue()
var hsb = hsb(*ctr)
def (_, range) = range(hsb)
if (range != WHITE) ranges[range]++
g2a.drawImage(image, (stepX + grid) * j, (stepY + grid) * i, stepX * (j + 1) + grid * j, stepY * (i + 1) + grid * i,
stepX * j, stepY * i, stepX * (j + 1), stepY * (i + 1), null)
g2b.color = new Color(*ctr)
g2b.fillRect((stepX + grid) * j, (stepY + grid) * i, stepX, stepY)
}
}
g2a.dispose()
g2b.dispose()

var swing = new SwingBuilder()
var maxCentroid = ranges.max { e -> e.value }.key
swing.edt {
frame(title: 'Original vs Subimages vs K-Means',
defaultCloseOperation: DISPOSE_ON_CLOSE, pack: true, show: true) {
flowLayout()
label(icon: imageIcon(image))
label(icon: imageIcon(splitImage))
label(icon: imageIcon(pixelated))
}
}

[fruit, maxRange, maxColor, maxCentroid]
}

println "Fruit Expected By max color By max range By k-means"
results.each { fruit, maxRange, maxColor, maxCentroid ->
def colors = [fruit.color, maxColor, maxRange, maxCentroid].collect {
NAME_OF[it].padRight(14)
}.join().trim()
println "${fruit.emoji.padRight(6)} $colors"
}

Here are the resulting images:

2022-10-13 20_37_25-Original.png

2022-10-13 20_37_08-Original.png

2022-10-13 20_36_49-Original.png

2022-10-13 20_36_27-Original.png

2022-10-13 20_36_07-Original.png

2022-10-13 20_35_21-Original.png

And, here are the final results:

final results

In our case, all three approaches yielded the same results. Results for other emojis may vary.

Further information



Deep Learning and Eclipse Collections

by paulk


Posted on Tuesday October 11, 2022 at 10:41AM in Technology


DeepLearning4J and Eclipse Collections revisited

In previous blogs, we have covered Eclipse Collections and Deep Learning. Recently, a couple of the highly recommended katas for Eclipse Collections have been revamped to include "pet" and "fruit" emojis for a little bit of extra fun. What could be better than Learning Eclipse Collections? Deep Learning and Eclipse Collections of course!

First, we create a PetType enum with the emoji toString, and then Pet and Person records. We'll populate a people list as is done in the kata. The full details are in the repo.

Let's use a GQuery expression to explore the pre-populated list:

println GQ {
from p in people
select p.fullName, p.pets
}

The result is:

deep-learning-eclipse-collections pre-populated field

Now let's duplicate the assertion from the getCountsByPetType test in exercise3 which checks pet counts:

2022-10-11 20_06_12-Groovy web console.png

As we expect, it passes.

Now, for a bit of fun, we will use a neural network trained to detect cat and dog images and apply it to our emojis. We'll follow the process described here. It uses DeepLearning4J to train and then use a model. The images used to train the model were real cat and dog images, not emojis, so we aren't expecting our model to be super accurate.

The first attempt was to write the emojis into swing JLabel components and then save using a buffered image. This lead to poor looking images:

PetAsFonts.jpg

And consequently, poor image inference. Recent JDK versions on some platforms might do better but we gave up on this approach.

Instead, emoji image files from the Noto Color Emoji font were used and saved under the pet type in the resources folder. These look much nicer:

2022-10-11 18_24_38-Noto Color Emoji - Google Fonts.png

Here is the code which makes use of those saved images to detect the animal types (note the use of type aliasing since we have two PetType classes; we rename one to PT):

import ramo.klevis.ml.vg16.PetType as PT
import ramo.klevis.ml.vg16.VG16ForCat

var vg16ForCat = new VG16ForCat().tap{ loadModel() }
var results = []
people.each{ p ->
results << p.pets.collect { pet ->
var file = new File("resources/${pet.type.name()}.png")
PT petType = vg16ForCat.detectCat(file, 0.675d)
var desc = switch(petType) {
case PT.CAT -> 'is a cat'
case PT.DOG -> 'is a dog'
default -> 'is unknown'
}
"$pet.name $desc"
}
}
println results.flatten().join('\n')

Note that the model exceeds the maximum allowable size for normal github repos, so you should create it following the original repo instructions and then store the resulting model.zip in the resources folder.

When we run the script, we get the following output:


[main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
...
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [OPENBLAS]
...
============================================================================================================================================
VertexName (VertexType)                 nIn,nOut       TotalParams    ParamsShape                    Vertex Inputs
============================================================================================================================================
input_1 (InputVertex)                   -,-            -              -                              -
block1_conv1 (Frozen ConvolutionLayer)  3,64           1792           b:{1,64}, W:{64,3,3,3}         [input_1]
block1_conv2 (Frozen ConvolutionLayer)  64,64          36928          b:{1,64}, W:{64,64,3,3}        [block1_conv1]
block1_pool (Frozen SubsamplingLayer)   -,-            0              -                              [block1_conv2]
block2_conv1 (Frozen ConvolutionLayer)  64,128         73856          b:{1,128}, W:{128,64,3,3}      [block1_pool]
block2_conv2 (Frozen ConvolutionLayer)  128,128        147584         b:{1,128}, W:{128,128,3,3}     [block2_conv1]
block2_pool (Frozen SubsamplingLayer)   -,-            0              -                              [block2_conv2]
block3_conv1 (Frozen ConvolutionLayer)  128,256        295168         b:{1,256}, W:{256,128,3,3}     [block2_pool]
block3_conv2 (Frozen ConvolutionLayer)  256,256        590080         b:{1,256}, W:{256,256,3,3}     [block3_conv1]
block3_conv3 (Frozen ConvolutionLayer)  256,256        590080         b:{1,256}, W:{256,256,3,3}     [block3_conv2]
block3_pool (Frozen SubsamplingLayer)   -,-            0              -                              [block3_conv3]
block4_conv1 (Frozen ConvolutionLayer)  256,512        1180160        b:{1,512}, W:{512,256,3,3}     [block3_pool]
block4_conv2 (Frozen ConvolutionLayer)  512,512        2359808        b:{1,512}, W:{512,512,3,3}     [block4_conv1]
block4_conv3 (Frozen ConvolutionLayer)  512,512        2359808        b:{1,512}, W:{512,512,3,3}     [block4_conv2]
block4_pool (Frozen SubsamplingLayer)   -,-            0              -                              [block4_conv3]
block5_conv1 (Frozen ConvolutionLayer)  512,512        2359808        b:{1,512}, W:{512,512,3,3}     [block4_pool]
block5_conv2 (Frozen ConvolutionLayer)  512,512        2359808        b:{1,512}, W:{512,512,3,3}     [block5_conv1]
block5_conv3 (Frozen ConvolutionLayer)  512,512        2359808        b:{1,512}, W:{512,512,3,3}     [block5_conv2]
block5_pool (Frozen SubsamplingLayer)   -,-            0              -                              [block5_conv3]
flatten (PreprocessorVertex)            -,-            -              -                              [block5_pool]
fc1 (Frozen DenseLayer)                 25088,4096     102764544      b:{1,4096}, W:{25088,4096}     [flatten]
fc2 (Frozen DenseLayer)                 4096,4096      16781312       b:{1,4096}, W:{4096,4096}      [fc1]
predictions (OutputLayer)               4096,2         8194           b:{1,2}, W:{4096,2}            [fc2]
--------------------------------------------------------------------------------------------------------------------------------------------
            Total Parameters:  134268738
        Trainable Parameters:  8194
           Frozen Parameters:  134260544
============================================================================================================================================
...
Tabby is a cat
Dolly is a cat
Spot is a dog
Spike is a dog
Serpy is a cat
Tweety is unknown
Speedy is a dog
Fuzzy is unknown
Wuzzy is unknown

As we can see, it correctly predicted the cats (Tabby and Dolly) and dogs (Spot and Spike) but incorrectly thought a snake (Serpy) was a cat and a turtle (Speedy) was a dog. Given the lack of detail in the emoji images compared to the training images, this lack of accuracy isn't unexpected. We could certainly use better images or train our model differently if we wanted better results but it is fun to see our model not doing too badly even with emojis!


Deck of cards with Groovy, JDK collections and Eclipse Collections

by paulk


Posted on Friday September 23, 2022 at 10:18AM in Technology


Once again, Donald Raab has produced an interesting blog post on Eclipse Collections; this one shows some code for modelling and manipulating cards with Java 17 and Eclipse Collections. His related katas are highly recommended.

Here is the same example in Groovy (4.0.5 was used here) with just a few tweaks:

enum Rank { ACE, TWO, THREE, FOUR, FIVE, SIX, SEVEN, EIGHT, NINE, TEN, JACK, QUEEN, KING }

enum Suit { SPADES, DIAMONDS, HEARTS, CLUBS }

@Sortable(includes='suit,rank')
record Card(Rank rank, Suit suit) { }

var cards = Sets.cartesianProduct(EnumSet.allOf(Rank), EnumSet.allOf(Suit), Card::new)
var cardsBySuit = cards.groupBy(Card::suit)
var cardsByRank = cards.groupBy(Card::rank)

assert [cards, cardsByRank, cardsBySuit]*.size() == [52, 13 ,4]

var random = new Random(42L)
var deck = cards.toList().shuffleThis(random).shuffleThis(random).shuffleThis(random).toStack()
(1..5).collect(i -> deck.pop(5).toSortedList()).each(this::println)

And here is the output:

[Card[rank=FOUR, suit=SPADES], Card[rank=FOUR, suit=DIAMONDS], Card[rank=SIX, suit=HEARTS], Card[rank=NINE, suit=CLUBS], Card[rank=JACK, suit=CLUBS]]
[Card[rank=FIVE, suit=DIAMONDS], Card[rank=TWO, suit=HEARTS], Card[rank=FIVE, suit=HEARTS], Card[rank=TEN, suit=CLUBS], Card[rank=QUEEN, suit=CLUBS]]
[Card[rank=FIVE, suit=SPADES], Card[rank=NINE, suit=SPADES], Card[rank=QUEEN, suit=SPADES], Card[rank=THREE, suit=DIAMONDS], Card[rank=TWO, suit=CLUBS]]
[Card[rank=EIGHT, suit=SPADES], Card[rank=TWO, suit=DIAMONDS], Card[rank=EIGHT, suit=DIAMONDS], Card[rank=KING, suit=HEARTS], Card[rank=FIVE, suit=CLUBS]]
[Card[rank=SIX, suit=SPADES], Card[rank=KING, suit=DIAMONDS], Card[rank=THREE, suit=HEARTS], Card[rank=TEN, suit=HEARTS], Card[rank=QUEEN, suit=HEARTS]]

We can do a similar example with the built-in JDK collections and make some additional tweaks for nicer output:

enum Rank {
ACE, TWO, THREE, FOUR, FIVE, SIX, SEVEN, EIGHT, NINE, TEN, JACK, QUEEN, KING
String toString() { ['A', *'2'..'9', '10', 'J', 'Q', 'K'][ordinal()] }
}

enum Suit {
SPADES, DIAMONDS, HEARTS, CLUBS
String toString() { '♠♦♥♣'[ordinal()] }
}

@Sortable(includes='suit,rank')
record Card(Rank rank, Suit suit) {
Card(List srPair) { this(srPair[1], srPair[0]) }
String toString() { "${rank}${suit}" }
}

var cards = [Suit.values(), Rank.values()].combinations().collect(Card::new)
var cardsBySuit = cards.groupBy(Card::suit)
var cardsByRank = cards.groupBy(Card::rank)

assert [cards, cardsByRank, cardsBySuit]*.size() == [52, 13, 4]

var random = new Random(42L)
3.times { cards.shuffle(random) }
var deck = cards as Stack
5.times { println "Hand ${it+1}: ${(0..5).collect{ deck.pop() }.sort()}" }

println "Remaining cards sorted:\n${deck.sort()}"

Which has this output:

Cards3.png

Both these Groovy examples run on JDK versions from 8 to 19
(emulated records are used in JDK versions 8 to 15).

Updated 25/Sep/2022: Use same ordering for the JDK collections version.


Encryption and decryption with Groovy

by paulk


Posted on Monday September 19, 2022 at 02:34PM in Technology


Inspired by this recent blog entry, here is an example showing how to encrypt and decrypt with Groovy.

Using the JDK crypto classes

First, we need some text to encrypt. We'll use an excerpt of the one from the aforementioned blog post:

var text = 'Contrary to popular belief, Lorem Ipsum is not simply random text.\
It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.'

Next, we'll create a factory for our cipher instance, generate a key, and set up an initialization vector.

First, the cipher factory:

var factory = { Cipher.getInstance('AES/CBC/PKCS5Padding') }

For our cipher algorithm, we are using the Advanced Encryption Standard (AES) algorithm, in Cipher Block Chaining (CBC) mode, with PKCS5 padding. We'll look at other options later.

Next we generate our secret key. Our secret key is our password. Only someone who has the password will be able to decrypt the encrypted message. We could use any random bits for our key, but like passwords, we want to choose a strong key rather than a weak one. Cryptographic libraries provide classes to generate such keys. We just need to provide the key size. AES supports 128, 192 and 256 bit keys. We'll choose 192 here:

var key = generateKey('AES', 192)

Our code uses this helper method:

def generateKey(String algorithm, Integer size) {
var generator = KeyGenerator.getInstance(algorithm)
generator.init(size)
generator.generateKey()
}

Next, we generate an initialization vector:

var ivParameterSpec = randomParameterSpec(factory)

It uses this helper method (we're using the algorithm block size for our initialization vector size):

def randomParameterSpec(Closure<Cipher> factory) {
var block = new byte[factory().blockSize]
SecureRandom.instanceStrong.nextBytes(block)
new IvParameterSpec(block)
}

An initialization vector is used to introduce some additional randomness to avoid repeating patterns in the input leading to repeating patterns in the encrypted bytes.

With all these things in place, we are almost ready to encrypt or decrypt, but first, let's define two more helper methods:
def encrypt(byte[] bytes, Key key, IvParameterSpec spec, Closure<Cipher> factory) {
var cipher = factory()
cipher.init(ENCRYPT_MODE, key, spec)
cipher.doFinal(bytes)
}

def decrypt(byte[] bytes, Key key, IvParameterSpec spec, Closure<Cipher> factory) {
var cipher = factory()
cipher.init(DECRYPT_MODE, key, spec)
cipher.doFinal(bytes)
}
And here is how we encrypt and decrypt:
var encrypted = encrypt(text.bytes, key, ivParameterSpec, factory)
println "Encrypted bytes : $encrypted"
println "Encrypted text : ${new String(encrypted)}"

var decrypted = decrypt(encrypted, key, ivParameterSpec, factory)
println "Decrypted bytes : $decrypted"
println "Decrypted text : ${new String(decrypted)}"

Which has this output:

Encrypted bytes : [-117, 36, 18, 69, -101, -8, 35, 93, -102, -49, -12, ..., -19, -100]
Encrypted text : ‹$E›ø#]šÏôæ”Á˜çp^µ³=L(Ö^_ŒC>CIË„ö,1É8ÆŸ.Š?vßG,Èw‰å¼zÜf>?µ›D¹éÆk€	°˜2êÔ}í©àhl$>?¹¡Kå3ÔO?±&…êî¶Ê–¾°®q®à—0ú‘ÔhO<H¦ç®Ç”ÈhAëjó QPyƒy6Ĥ*´un¼ï¯m¨´ÙjeJtëº\ó6ƪKªœíœ
Decrypted bytes : [67, 111, 110, 116, 114, 97, 114, 121, 32, 116, 111, 32, ..., 100, 46]
Decrypted text : Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.

We can see that everything worked as expected, since the final output matches our original input text.

Using the Bouncy Castle library

We can alternatively, swap algorithms. There are numerous algorithms and modes supported by the JDK and others supported by third-party libraries. A nice summary can be found here.

We'll swap to use the CAST5 (CAST-128) algorithm which supports up to a 128-bit key. We'll use HMAC-SHA1 to generate our key.

import org.bouncycastle.jce.provider.BouncyCastleProvider
var bc = new BouncyCastleProvider()
factory = { Cipher.getInstance('CAST5', bc) }
key = generateKey('HmacSHA1', 128)
ivParameterSpec = randomParameterSpec(factory)

CAST5 is the default algorithm used in some versions of GPG and PGP. It isn't included by default in the JDK, so for this we'll use the Bouncy Castle library.

[Just as a note. If you are wanting to encrypt or decrypt GPG/PGP files, don't use the above code. Libraries like Bouncy Castle have dedicated classes for such scenarios.]

We now encrypt and decrypt as before:

encrypted = encrypt(text.bytes, key, ivParameterSpec, factory)
println "Encrypted text : ${new String(encrypted)}"
decrypted = decrypt(encrypted, key, ivParameterSpec, factory)
println "Decrypted text : ${new String(decrypted)}"

Which has this output:

Encrypted text : Mªá?r?v9£÷~4µT'›ÙÝÁl¿Þg¾0ñŽ¡?Ü=³9Q¬»3«ÖÁ¡µ ¾@4÷`FñÙŠfø7¥#›v¤Í–‰¼Ü¢ƒE6ôŽTÙlæÏz>o?àL›¡¢z1nÖo9]šOÔ¼SÔOÍ#Ý7LœÀî}ó5m%q•»l%/AWT´¢zH#t솱l¶£—Œ«©wˆÃ®>®Ü6ër-E
Decrypted text : Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.

Other useful functionality

Passing around binary data like our secret key or the encrypted data, has its own problems. Groovy provides extension methods to encode such data (and corresponding decode methods). For example, we can encode our secret key in various ways:

var keyBytes = key.encoded
println keyBytes.encodeHex()
println keyBytes.encodeBase64()
println keyBytes.encodeBase64Url()

Which has this output (the key is random, so the output will differ for each run):

85a0d3f0ce0cbe6402dc9579fbffcf1d
haDT8M4MvmQC3JV5+//PHQ==
haDT8M4MvmQC3JV5-__PHQ

Groovy also provides extension methods for various checksums (but you might want to look at stronger checksum algorithms in security sensitive scenarios):

println "SHA256 : ${text.sha256()}"
println "MD5 : ${text.md5()}"

Which has this output:

SHA256 : ccb184e35e4c32bafc730d84ec924ea2980035ea5fadb012e3b2b31abf4323c9
MD5 : 46c61a174c2dc99204521ca89f09f63c

If you are encrypting and decrypting entire files, the JDK has special classes for that too which are also easy to use from Groovy. That's all for now.

References

Conclusion

We have taken a brief look at encrypting and decrypting with Apache Groovy.


Calculating Fibonacci with Groovy revisited

by paulk


Posted on Thursday September 08, 2022 at 10:59AM in Technology


In a recent post, we looked at using Matrices with Groovy including using matrices to calculate Fibonacci terms. But do you need that complexity to calculate Fibonacci? Not, not at all. You can do various one-liners for that scenario (to repeat the calculation from that post):

Stream.iterate([1, 1], f -> [f[1], f.sum()]).limit(8).toList()*.head()

The previous post was more about using matrices than Fibonacci per se, though hopefully learning that the Fibonacci matrix was a specialisation of the Leslie matrix was an added bonus.

Let's have a look at a few other options to write Fibonacci methods in Groovy.

Iterative style

Unless you learned a functional programming language as your first language, you may have written an iterative Factorial or Fibonacci as one of your first programming learning exercises. Such an algorithm for Fibonacci could look something like this:

def fib(n) {
if (n <= 1) return n

def previous = n.valueOf(1), next = previous, sum
(n-2).times {
sum = previous
previous = next
next = sum + previous
}
next
}

assert fib(10) == 55
assert fib(50L) == 12586269025L
assert fib(100G) == 354224848179261915075G

The only interesting part to this solution is the use of dynamic idioms. We didn't provide an explicit type for n, so duck-typing means the method works fine for Integer, Long and BigInteger values. This implementation does all calculations using the type of the supplied n, so the user of the method controls that aspect.

Groovy gives the option to also specify an explicit type like Number or use TypeChecked or CompileStatic for type inference if you wanted. We'll see an example of those options later.

Recursive

Once you mastered iterative programming, your next programming learning exercise may have been the recursive version of Factorial or Fibonacci. For Fibonacci, you may have coded something like this:

def fib(n) {
if (n <= 1) return n
fib(n - 1) + fib(n - 2)
}

assert fib(10) == 55
assert fib(50L) == 12586269025L
assert fib(100G) == 354224848179261915075G

This naïve version is incredibly inefficient. Calling fib(6) ends up calculating fib(2) five times for instance:

There are several ways to avoid that repetition. One option is to use the @Memoized annotation. Adding this annotation on a method causes the compiler to inject appropriate code for caching results into the method. This is ideal for pure functions like Fibonacci since they always return the same output for a given input. There are annotation attributes to adjust how big such a cache might be, but that sophistication isn't needed here.

@Memoized
def fib(n) {
if (n <= 1) return n
fib(n - 1) + fib(n - 2)
}

assert fib(10) == 55
assert fib(50L) == 12586269025L
assert fib(100G) == 354224848179261915075G

This now runs just as quickly as the iterative method. If you happened to use a Closure instead of a method, you can call one of the memoize methods on Closure.

A problem with this approach (in fact recursion in general) is that you hit a stack overflow exception for larger values of n, e.g. fib(500G). Groovy supports tail call elimination with the inclusion of the TailRecursive annotation. In this case the compiler injects an "unrolled" non-recursive version of the algorithm. In order for the "unrolling" to succeed, the algorithm needs to be re-worked so that at most one call to fib occurs in any return statement. Here is a version of the algorithm re-worked in this way:

@TailRecursive
static fib(n, a, b) {
if (n == 0) return a
if (n == 1) return b
fib(n - 1, b, a + b)
}

assert fib(10, 0, 1) == 55
assert fib(50L, 0L, 1L) == 12586269025L
assert fib(100G, 0G, 1G) == 354224848179261915075G
assert fib(500G, 0G, 1G) == 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125G

This is slightly more complicated to understand than the original if you haven't seen it before but now is both efficient and handles large values of n. We can compile statically for even faster speed like this:

@TailRecursive
@CompileStatic
static fib(Number n, Number a, Number b) {
if (n == 0) return a
if (n == 1) return b
fib(n - 1, b, a + b)
}

assert fib(10, 0, 1) == 55
assert fib(50L, 0L, 1L) == 12586269025L
assert fib(100G, 0G, 1G) == 354224848179261915075G
assert fib(500G, 0G, 1G) == 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125G

If you are using a Closure, you would look at using the trampoline method on Closure to achieve a similar result.

Streams

We saw the Stream based "one-liner" solution at the start of this blog post. Let's adopt the duck-typing idioms we have used so far and define a fib method. It could look like this:

def fib(n) {
def zero = n.valueOf(0)
def one = n.valueOf(1)
Stream.iterate([zero, one], t -> [t[1], t.sum()])
.skip(n.longValue())
.findFirst().get()[0]
}

assert fib(10) == 55
assert fib(50L) == 12586269025L
assert fib(100G) == 354224848179261915075G

Bytecode and AST transforms

Finally, just so you know all your options, here is a version using the @Bytecode AST transform which lets you write JVM bytecode directly in your Groovy! Note this falls into the category of "don't ever ever do this" but just so you know you can, it is included here:

@Bytecode
int fib(int i) {
l0
iload 1
iconst_2
if_icmpgt l1
iconst_1
_goto l2
l1
frame SAME
aload 0
iload 1
iconst_2
isub
invokevirtual '.fib','(I)I'
aload 0
iload 1
iconst_1
isub
invokevirtual '.fib', '(I)I'
iadd
l2
frame same1,'I'
ireturn
}

assert fib(10) == 55

Please read the caveats for that transform before considering using it for anything but extreme situations. It's meant more as a fun thing to try than something anyone would want to do in production.

Have fun writing your own algorithms!



Solving cryptarithmetic puzzles with Groovy and constraint programming using Choco, JaCoP, and OR-Tools

by paulk


Posted on Monday September 05, 2022 at 01:43PM in Technology



Introduction

When writing solutions to problems, we frequently strive to hide away implementation details. In Object-oriented (OO) programming, we might build a rich hierarchy of classes with well-thought out methods so that our final solution can be expressed in terms of simple nouns and verbs (methods and class instances) in our domain model. When applying functional programming idioms, we will strive to emphasise the relationship between inputs and outputs and hide away side effects and iterative steps. Constraint programming (within the same family as logic programming) also strives to hide away details. Instead of expressing an iterative implementation, it focuses on expressing declarative properties of a solution. A solver is responsible for working out the exact implementation steps.

When using constraint programming, we develop a model consisting of variables, the domain of values each variable may hold, and additional constraints between the variables. A solver does all the work. It may employ heuristic searching, inference, propagation, symmetry and backtracking to find possible solutions. We may be able to (or want to, or need to) guide the solver as to which techniques and strategies it should use. Constraint programming has been used to solve various kinds of problems including scheduling problems, and excels at problems with combinatorial possibilities that are too irregular for other mathematical optimisations. Frequently used illustrative problems are Sudoku and Wordle solvers, N-Queen problems, and other kinds of puzzles. We'll just look at cryptarithmetic puzzles.

Cryptarithmetic Problems

Cryptarithmetic problems (also known as alphametics, verbal arithmetic, cryptarithm, and word addition) are a type of mathematical game where a mathematical equation is presented where digits in the equation are replaced by letters. Traditionally, each letter usually represents a unique number, and numbers don't start with the digit zero. If we look at one sample problem:

TO
+GO
=OUT

We can reason about what the solution can be by hand:

  • T, O, U, and G must be all different (game rule)
  • T, G, and O will be between 1 and 9, and U is between 0 and 9 (game rule)
  • If we added the two biggest 2-digit numbers, (99 + 99) we'd get 198, so O must be 1
  • Looking at the right-most "units" column, 1 + 1 equals 2, so T must be 2
  • Looking at the "tens" column, we know there is a carry of 1 (since O is 1) and we know T is 2, so G must be 8 or 9. If G was 9, U would be 1 but it can't be the same as O, so G must be 8 and U must be 0.

When solving by hand, we typically reason about individual columns and account for the "carry" to the next column. We'll come back to this point later but first, let's look at a slightly bigger problem:

SEND
+MORE
=MONEY

Solving with Brute Force

This problem isn't huge, so we can solve with brute force. We simply try all possible values for the letters in the puzzle:

for (s in 1..9)
for (e in 0..9)
for (n in 0..9)
for (d in 0..9)
for (m in 1..9)
for (o in 0..9)
for (r in 0..9)
for (y in 0..9)
if ([s, e, n, d, m, o, r, y].toSet().size() == 8) {
def send = 1000 * s + 100 * e + 10 * n + d
def more = 1000 * m + 100 * o + 10 * r + e
def money = 10000 * m + 1000 * o + 100 * n + 10 * e + y
if (send + more == money) {
println "s = $s, e = $e, n = $n, d = $d"
println "m = $m, o = $o, r = $r, y = $y"
}
}

This isn't very efficient though. It calculates 81 million combinations for the variables before skipping all but 1.5 million of them (since most won't be unique). All up it might execute in the low tens of seconds.

Alternatively, Groovy supports calculating permutations, so we can simplify our solution to a single for loop (with some tests to eliminate unhelpful iterations):

def digits = 0..9
for (p in digits.permutations()) {
if (p[-1] < p[-2]) continue
def (s, e, n, d, m, o, r, y) = p
if (s == 0 || m == 0) continue
def send = 1000 * s + 100 * e + 10 * n + d
def more = 1000 * m + 100 * o + 10 * r + e
def money = 10000 * m + 1000 * o + 100 * n + 10 * e + y
if (send + more == money) {
println "s = $s, e = $e, n = $n, d = $d"
println "m = $m, o = $o, r = $r, y = $y"
}
}

This has the advantage of only generating unique combinations. It will execute in seconds.

Running either of these solutions yields:

s = 9, e = 5, n = 6, d = 7
m = 1, o = 0, r = 8, y = 2

Using Constraint Programming

For the brute force approaches, we had a condition which checked any potential candidate answer to see if it was a correct solution. We had to be very explicit in how we wanted the potential candidates to be created. For constraint programming, we instead define variables to represent the problem, any known bounds on those variables, and we specify any other known properties of the solution, which in our case will be something similar to the condition we had to check if the answer was correct previously. Let's examine how to do that with three libraries, one with a variation.

Choco

Here is the code using the Choco library:

new Model("SEND+MORE=MONEY").with {
def (S, M) = ['S', 'M'].collect { intVar(it, 1, 9) }
def (E, N, D, O, R, Y) = ['E', 'N', 'D', 'O', 'R', 'Y'].collect { intVar(it, 0, 9) }

allDifferent(S, E, N, D, M, O, R, Y).post()

IntVar[] ALL = [
S, E, N, D,
M, O, R, E,
M, O, N, E, Y ]
int[] COEFFS = [
1000, 100, 10, 1,
1000, 100, 10, 1,
-10000, -1000, -100, -10, -1 ]

scalar(ALL, COEFFS, "=", 0).post()

println solver.findSolution()
}

We define our variables and their bounds (domain). We use an allDifferent global constraint to specify the uniqueness requirement and a scalar constraint that ensures that our variables multiplied by their respective scalar coefficients equal 0. This lets us factor in whether the particular variable is representing the "units" column, the "10s" column, the "100s" column etc. This captures the "puzzle addition" constraint. We then ask the solver to find the solution. We could just as easily have asked for all solutions (if more than one existed).

This is typical of how we solve such problems. We either define constraints directly between one or more variables or use whatever global constraints our library might support. If our library doesn't support the constraint we need, we find a way to express it using multiple simpler constraints.

The end result is that our code is more declarative than our brute force approaches, and the solution is found in tens of milliseconds. The solver has very efficient strategies for solving such puzzles.

JaCoP

We can solve the same problem using JaCoP:

def store = new Store()
def (S, M) = ['S', 'M'].collect { new IntVar(store, it, 1, 9) }
def (E, N, D, O, R, Y) = ['E', 'N', 'D', 'O', 'R', 'Y'].collect { new IntVar(store, it, 0, 9) }
var ctr = new Alldifferent(S, E, N, D, M, O, R, Y)
store.impose(ctr)

IntVar[] ALL = [
S, E, N, D,
M, O, R, E,
M, O, N, E, Y ]
int[] COEFFS = [
1000, 100, 10, 1,
1000, 100, 10, 1,
-10000, -1000, -100, -10, -1 ]
var lin = new LinearInt(ALL, COEFFS, "==", 0)
store.impose(lin)

var label = new DepthFirstSearch()
var select = new InputOrderSelect(store, ALL, new IndomainMin())
label.labeling(store, select)

There are some slight differences in this API, but nearly everything has a one-to-one correspondence to what we saw earlier. We are explicitly selecting search strategies and selection strategies here whereas with Choco, defaults were chosen for us. In both cases, explicit creation of such classes allows the strategies to be altered for particular scenarios if needed.

When run, the output looks like this:

Labeling has finished with return value of true
DFS1: DFS([S = 9, E = 5, N = 6, D = 7, M = 1, O = 0, R = 8, Y = 2], InputOrder, (org.jacop.search.IndomainMin@45394b31))

We can see here the code is very similar as is the execution time.

OR-Tools

We can repeat the solution using OR-Tools. Here is the code:

Loader.loadNativeLibraries()

new Solver('Send+More=Money').with {
def s = makeIntVar(1, 9, 's')
def e = makeIntVar(0, 9, 'e')
def n = makeIntVar(0, 9, 'n')
def d = makeIntVar(0, 9, 'd')
def m = makeIntVar(1, 9, 'm')
def o = makeIntVar(0, 9, 'o')
def r = makeIntVar(0, 9, 'r')
def y = makeIntVar(0, 9, 'y')

IntVar[] all = [s, e, n, d, m, o, r, y]
IntVar[] scalar = [s, e, n, d, m, o, r, e, m, o, n, e, y]
int[] coeffs = [
1000, 100, 10, 1, // S E N D +
1000, 100, 10, 1, // M O R E =
-10000, -1000, -100, -10, -1 // M O N E Y
]

addConstraint(makeScalProdEquality(scalar, coeffs, 0))
addConstraint(makeAllDifferent(all))

def db = makePhase(all, INT_VAR_DEFAULT, INT_VALUE_DEFAULT)
newSearch(db)
while (nextSolution()) {
println all.join(' ')
}
endSearch()

// Statistics
println "Solutions: ${solutions()}"
println "Failures: ${failures()}"
println "Branches: ${branches()}"
println "Wall time: ${wallTime()}ms"
}

which has this output:

s(9) e(5) n(6) d(7) m(1) o(0) r(8) y(2)
Solutions: 1
Failures: 5
Branches: 10
Wall time: 60ms

OR-Tools is written in C++ but has interfaces for numerous languages including Java - which is perfect for Groovy use.

Choco with JSR331

It is great to have multiple libraries to pick from but having a standard API can help switching between such libraries. This is where JSR331 comes in. It defines a standard API for interacting with constraint solvers and linear solves. Here we use a JSR331 implementation backed by an earlier version of the Choco library. The code looks like this:

import javax.constraints.*

ProblemFactory.newProblem('SEND+MORE=MONEY').with {
def (S, M) = ['S', 'M'].collect { variable(it, 1, 9) }
def (E, N, D, O, R, Y) = ['E', 'N', 'D', 'O', 'R', 'Y'].collect { variable(it, 0, 9) }

postAllDifferent(S, E, N, D, M, O, R, Y)

Var[] ALL = [
S, E, N, D,
M, O, R, E,
M, O, N, E, Y]
int[] COEFFS = [
1000, 100, 10, 1,
1000, 100, 10, 1,
-10000, -1000, -100, -10, -1]

post(COEFFS, ALL, '=', 0)

def solver = getSolver()
def solution = solver.findSolution()
println solution ?: 'No solution'
solver.logStats()
}
It is quite similar to earlier examples but now exclusively uses the JSR331 classes in the javax.constraint package. There are implementations of those classes backed by several implementations. So, indeed it would be possible to swap between them. When run, the output is:
Solution #1:
	 S[9] M[1] E[5] N[6] D[7] O[0] R[8] Y[2]

Having said that, at the time of writing, JSR331 popularity doesn't appear to be on the rise. Most folks using constraint programming libraries seem to be using the direct library classes. Indeed, the version of the Choco implementation used by the JSR331 implementation is over 10 years old.

Incorporating Carry

The scalar product global constraint we have used in the previous examples is very powerful and probably would be our first choice for this problem. We can, however, model constraint programming problems in multiple ways, so let's look at a solution that avoids that global constraint.

Instead, we will develop a model that mirrors how we reasoned about the original TO + GO = OUT problem that we solved by hand. For that, we just considered a column at a time and accounted for the carry. We'll explicitly introduce variables to hold the carry (0 if no carry, or 1 if there is a carry) into our model. Then we'll express the mathematical constraints that are applicable for each column.

Here is the code:

new Model("SEND+MORE=MONEY").with {
def (S, M) = ['S', 'M'].collect { intVar(it, 1, 9) }
def (E, N, D, O, R, Y) = ['E', 'N', 'D', 'O', 'R', 'Y'].collect { intVar(it, 0, 9) }
def C = (0..3).collect{ intVar("C$it", 0, 9) }

allDifferent(S, E, N, D, M, O, R, Y).post()
C[3] .eq(M).post() // C3 C2 C1 C0
C[2].add(S).add(M).eq(O.add(C[3].mul(10))).post() // S E N D
C[1].add(E).add(O).eq(N.add(C[2].mul(10))).post() // M O R E
C[0].add(N).add(R).eq(E.add(C[1].mul(10))).post() // -------------
D .add(E).eq(Y.add(C[0].mul(10))).post() // M O N E Y

println solver.findSolution()
}

We can see that there is now no scalar product global constraint any more but instead the constraints for each column.

When run, the output looks like this:

Solution: S=9, M=1, E=5, N=6, D=7, O=0, R=8, Y=2, C0=1, C1=1, C2=0, C3=1, sum_exp_1=9,
sum_exp_2=10, (C3*10)=10, sum_exp_3=10, sum_exp_4=6, sum_exp_5=6, (C2*10)=0, sum_exp_6=6,
sum_exp_7=7, sum_exp_8=15, (C1*10)=10, sum_exp_9=15, sum_exp_10=12, (C0*10)=10, sum_exp_11=12,

We can see that as we were defining our constraints for each column, subexpressions were being created in the model which are reflected in the solution. They are if you like, temporary calculations along the way to getting the answer - or more accurately a snapshot of ever changing temporary calculations. They don't form part of the answer that interests us, so we would be free to just print out the part of the solution which interests us if we wanted.

Creating a DSL

The previous example has lots of calls to add and mul methods. We can create a little bit of a DSL to provide some syntactic sugar to our previous examples to allow use of Groovy's operator overloading, support ranges when specifying the domain of a variable, and a few other niceties. Our code becomes:

model("SEND+MORE=MONEY") {
def (S, M) = ['S', 'M'].collect { intVar(it, 1..9) }
def (E, N, D, O, R, Y) = ['E', 'N', 'D', 'O', 'R', 'Y'].collect { intVar(it, 0..9) }
def C = intVarArray(4, 0..1)

[allDifferent(S, E, N, D, M, O, R, Y), // C3 C2 C1 C0
C[3] .eq(M), // S E N D
(C[2] + S + M).eq(O + C[3] * 10), // M O R E
(C[1] + E + O).eq(N + C[2] * 10), // -------------
(C[0] + N + R).eq(E + C[1] * 10), // M O N E Y
(D + E).eq(Y + C[0] * 10)]*.post()

println solver.findSolution()
}

It has the same output as previously.

You might wonder how the solver finds the solution. You can watch the variables in the debugger and use tools like choco-cpviz but it is a quite convoluted process until you are used to it. We'll try to give you a flavor of what is going on here. Basically, there will be various steps of pruning wherever possible and branching with possible backtracking. Below are some snapshots for our example above.

To start with, we have nearly 90 light green squares which represents our problem search space. We walk our way through the rules looking for ways to prune the search space:

choco_step1.png

 choco_step2.png

 choco_step3.png

choco_step4.png

choco_step5.png

choco_step6.png

choco_step7.png

choco_step8.png

choco_step9.png

choco_step10.png

choco_step11.png

choco_step12.png

choco_step13.png

choco_step14.png

As we are locking in the value of variables, we can substitute them into and simplify our constraints. When we reapply them, they will be quicker to evaluate and may reveal more information.

At this point we only have 2 of our variables locked down but our search space is nearly half what we started with and we have simplified some of our constraints. We would continue branching and solving at this point until we find our solution or determine that no solution is possible.

Looking at other languages

The example repo also contains solutions for this problem in other languages so you can compare and contrast including Clojure, Haskell (Frege), Java, JavaScript (Nashorn), Ruby (JRuby), Python (Jython), Kotlin, Lua (Luaj), Prolog (tuprolog), and Scala.

other language logos

Other examples

To wrap up, let's look at solving a few more examples (using Choco). We'll solve some of the examples from an interesting blog on the history of Cryptarithmetic problems:

  • ABCD * 4 = DCBA
  • AA + BB + CC = ABC
  • HALF + HALF = WHOLE
  • HALF + FIFTH + TENTH + TENTH + TENTH = WHOLE

Here is the code:

new Model("ABCD*4=DCBA").with {
def (A, D) = ['A', 'D'].collect { intVar(it, 1, 9) }
def (B, C) = ['B', 'C'].collect { intVar(it, 0, 9) }
def R = (0..2).collect { intVar(0, 9) }

allDifferent(A, B, C, D).post()
R[2].add(A.mul(4)).eq(D).post()
R[1].add(B.mul(4)).eq(C.add(R[2].mul(10))).post()
R[0].add(C.mul(4)).eq(B.add(R[1].mul(10))).post()
D.mul(4).eq(A.add(R[0].mul(10))).post()
solver.findAllSolutions().each {
println "$name: ${pretty(it, [A, B, C, D, ' * 4 = ', D, C, B, A])}\n$it\n"
}
}

new Model("AA+BB+CC=ABC").with {
def (A, B, C) = ['A', 'B', 'C'].collect { intVar(it, 1, 9) }
allDifferent(A, B, C).post()
A.mul(11).add(B.mul(11).add(C.mul(11))).eq(A.mul(100).add(B.mul(10)).add(C)).post()
solver.findAllSolutions().each {
println "$name: ${pretty(it, [A, A, ' + ', B, B, ' + ', C, C, ' = ', A, B, C])}\n$it\n"
}
}

new Model("HALF+HALF=WHOLE").with {
def (H, W) = ['H', 'W'].collect { intVar(it, 1, 9) }
def (A, E, F, L, O) = ['A', 'E', 'F', 'L', 'O'].collect { intVar(it, 0, 9) }
allDifferent(H, W, A, E, F, L, O).post()
IntVar[] ALL = [
H, A, L, F,
W, H, O, L, E]
int[] COEFFS = [
2000, 200, 20, 2,
-10000, -1000, -100, -10, -1]
scalar(ALL, COEFFS, "=", 0).post()
solver.findAllSolutions().each {
println "$name: ${pretty(it, [H, A, L, F, ' + ', H, A, L, F, ' = ', W, H, O, L, E])}\n$it\n"
}
}

new Model("HALF+FIFTH+TENTH+TENTH+TENTH=WHOLE").with {
def (H, F, T, W) = ['H', 'F', 'T', 'W'].collect { intVar(it, 1, 9) }
def (A, L, I, E, N, O) = ['A', 'L', 'I', 'E', 'N', 'O'].collect { intVar(it, 0, 9) }
allDifferent(H, F, T, W, A, L, I, E, N, O).post()
IntVar[] ALL = [
H, A, L, F,
F, I, F, T, H,
T, E, N, T, H,
T, E, N, T, H,
T, E, N, T, H,
W, H, O, L, E]
int[] COEFFS = [
1000, 100, 10, 1,
10000, 1000, 100, 10, 1,
10000, 1000, 100, 10, 1,
10000, 1000, 100, 10, 1,
10000, 1000, 100, 10, 1,
-10000, -1000, -100, -10, -1]
scalar(ALL, COEFFS, "=", 0).post()
solver.findAllSolutions().each {
def parts = [H, A, L, F, '+', F, I, F, T, H, '+', T, E, N, T, H, '+',
T, E, N, T, H, '+', T, E, N, T, H, '=', W, H, O, L, E]
println "$name: ${pretty(it, parts)}\n$it\n"
}
}

// helper method to print solutions
def pretty(model, parts) {
parts.collect { p -> p instanceof IntVar ? model.getIntVal(p) : p }.join()
}

which has this output:

ABCD*4=DCBA: 2178 * 4 = 8712
Solution: A=2, D=8, B=1, C=7, IV_1=3, IV_2=3, IV_3=0, (A*4)=8, sum_exp_4=8, (B*4)=4, ..., 

AA+BB+CC=ABC: 11 + 99 + 88 = 198
Solution: A=1, B=9, C=8, (A*11)=11, (B*11)=99, (C*11)=88, ..., 

HALF+HALF=WHOLE: 9604 + 9604 = 19208
Solution: H=9, W=1, A=6, E=8, F=4, L=0, O=2, 

HALF+HALF=WHOLE: 9703 + 9703 = 19406
Solution: H=9, W=1, A=7, E=6, F=3, L=0, O=4, 

HALF+HALF=WHOLE: 9802 + 9802 = 19604
Solution: H=9, W=1, A=8, E=4, F=2, L=0, O=6, 

HALF+FIFTH+TENTH+TENTH+TENTH=WHOLE: 6701+14126+25326+25326+25326=96805
Solution: H=6, F=1, T=2, W=9, A=7, L=0, I=4, E=5, N=3, O=8, 

You should see the common patterns used for solving these puzzles.

Further Information

Conclusion

We have looked at using Groovy and a few constraint programming libraries to solve a cryptarithmetic puzzles. Why not try solving some of your own puzzles.


Groovy List Processing Cheat Sheet

by paulk


Posted on Sunday August 28, 2022 at 08:46AM in Technology


Declaring lists

Several styles are supported for declaring lists:

var pets    = ['cat', 'canary', 'dog', 'fish', 'gerbil']  // idiomatic Groovy
var nums = [1, 2, 3, 4] as LinkedList // use 'as' for other kinds of list
var primes = List.of(2, 3, 5, 7) // Java 9+ style (immutable)
var range = 1..4 // a range is also a list (immutable)
var bigNums = [1000, 2000].asUnmodifiable() // unmodifiable (backed by original)
var negNums = [-100, -200].asImmutable() // immutable (backed by copy)

List elements and properties

Java methods for accessing list elements and list properties:

assert !pets.isEmpty()
assert pets.size() == 5
assert pets.get(0) == 'cat'
assert pets.contains('dog')
assert pets.containsAll('cat', 'dog')
pets.forEach { assert it.size() > 2 }
assert ['a', 'b', 'a'].indexOf('a') == 0
assert ['a', 'b', 'a'].lastIndexOf('a') == 2

Groovy extensions for accessing list elements and list properties:

assert pets[0] == 'cat'
assert pets?[0] == 'cat' // safe indexing returns null if pets was null instead of NPE
assert pets.first() == 'cat'
assert pets.head() == 'cat'
assert pets[-1] == 'gerbil'
assert pets[1..3] == ['canary', 'dog', 'fish']
assert pets[3..1] == ['fish', 'dog', 'canary'] // reverse range
assert pets[1, 3, 3] == ['canary', 'fish', 'fish'] // arbitrary collection
assert pets[0..1, [3, 3]] == ['cat', 'canary', 'fish', 'fish'] // nested collections
assert [1, 2, 3, 1].count(1) == 2
assert [1, 2, 3, 4].min() == 1
assert [1, 2, 3, 4].max() == 4
assert [1, 2, 3, 4].sum() == 10
assert [1, 2, 3, 4].average() == 2.5
[1, 2, 3, 4].eachWithIndex{ val, idx -> assert val == idx + 1 }
def cpets = pets[0..1]
assert cpets == ['cat', 'canary']
assert pets.findAll { it =~ /c.*/ } == cpets
assert pets.find { it =~ /c.*/ } == cpets[0]
assert pets.grep(~/c.*/) == cpets
assert cpets.min { it.size() } == 'cat' // cpet with smallest size name
assert cpets.max { it.size() } == 'canary' // cpet with largest name
assert pets.groupBy { it.size() } == [3: ['cat', 'dog'], 4: ['fish'], 6: ['canary', 'gerbil']]
assert pets.countBy { it.size() } == [3: 2, 4: 1, 6: 2]
assert pets.sum{ it.size() } == 22 // total size of all pet names
assert pets.average{ it.size() } == 4.4 // average size of pet names
assert pets.indices == 0..4
assert pets.findIndexValues { it.size() == 6 } == [1, 4]
assert cpets.indexed() == [0: 'cat', 1: 'canary']
assert cpets.withIndex()*.toList() == [['cat', 0], ['canary', 1]]

Modifying mutable lists

Methods from Java for modifying lists:

pets.remove(2);                        assert pets == ['cat', 'canary', 'fish', 'gerbil']  // remove by index
pets.remove('fish'); assert pets == ['cat', 'canary', 'gerbil'] // remove by element
pets.removeIf(p -> p.startsWith("c")); assert pets == ['gerbil'] // remove by condition
pets.clear(); assert pets.isEmpty() // make empty
pets.add("kangaroo"); assert pets == ['kangaroo'] // add element
pets.add(0, "koala"); assert pets == ['koala', 'kangaroo'] // add element at index position
pets.addAll(['ant', 'bee']); assert pets == ['koala', 'kangaroo', 'ant', 'bee'] // add collection
pets.addAll(1, ['ant']); assert pets == ['koala', 'ant', 'kangaroo', 'ant', 'bee'] // add collection at index
pets.removeAll(['ant', 'flea']); assert pets == ['koala', 'kangaroo', 'bee'] // remove from collection
pets.retainAll(['bee', 'koala']); assert pets == ['koala', 'bee'] // retain from collection
pets.set(0, "zebra"); assert pets == ['zebra', 'bee'] // set element at index
pets.replaceAll(String::toUpperCase); assert pets == ['ZEBRA', 'BEE'] // replace elements by unary operator
pets.sort(Comparator.naturalOrder()); assert pets == ['BEE', 'ZEBRA'] // sort

Groovy extensions for modifying lists:

pets << 'rock';                        assert pets == ['BEE', 'ZEBRA', 'rock']  // leftShift append
pets += ['rabbit', 'rock', 'hare']; assert pets == ['BEE', 'ZEBRA', 'rock', 'rabbit', 'rock', 'hare'] // append collection
pets.unique(); assert pets == ['BEE', 'ZEBRA', 'rock', 'rabbit', 'hare'] // remove duplicates
pets.sort{ it.size() }; assert pets == ['BEE', 'rock', 'hare', 'ZEBRA', 'rabbit'] // sort by size
pets[0] = 'ant'; assert pets == ['ant', 'rock', 'hare', 'ZEBRA', 'rabbit'] // replace element by index
pets[1..3] = ['snake', 'SNAKE']; assert pets == ['ant', 'snake', 'SNAKE', 'rabbit'] // replace range of elements
pets.unique{it.toLowerCase() }; assert pets == ['ant', 'snake', 'rabbit'] // remove duplicates ignoring case
pets.reverse(true); assert pets == ['rabbit', 'snake', 'ant'] // flip
pets.shuffle(); assert pets.size() == 3; // shuffle elements; resulting order will vary
def dice = [1, 2, 3, 4, 5, 6]
dice.removeAt(2); assert dice == [1, 2, 4, 5, 6]
dice.removeElement(2); assert dice == [1, 4, 5, 6]
dice.removeLast(); assert dice == [1, 4, 5]
dice.swap(0, 2); assert dice == [5, 4, 1]

Additional list functionality

Methods and operators which return new lists or values:

assert [1, 2, 3] + [1] == [1, 2, 3, 1]
assert [1, 2, 3, 1] - [1] == [2, 3]
assert [1, 2, 3] * 2 == [1, 2, 3, 1, 2, 3]
assert [1, [2, 3]].flatten() == [1, 2, 3]
assert [1, 2, 3].disjoint([4, 5, 6])
assert [1, 2, 3].intersect([4, 3, 1]) == [1, 3]
assert [1, 2, 3].collect { it + 3 } == [4, 5, 6]
assert [1, 2, 3, 4].collect { 1 } == [1, 1, 1, 1]
assert [4, 2, 1, 3].findAll { it % 2 == 0 } == [4, 2]
assert [1, 2, 3, 4].take(3) == [1, 2, 3]
assert [1, 2, 3, 4].takeRight(3) == [2, 3, 4]
assert [1, 2, 3, 4].takeWhile{ it < 3 } == [1, 2]
assert [1, 2, 3, 4].drop(2) == [3, 4]
assert [1, 2, 3, 4].dropRight(2) == [1, 2]
assert [1, 2, 3, 4].dropWhile{it < 3 } == [3, 4]
assert [1, 2, 3, 4].join('-') == '1-2-3-4'
assert [1, 2, 3, 4].tail() == [2, 3, 4]
assert [1, 2, 3, 4].init() == [1, 2, 3]
assert [1, 2, 3, 4].tails() == [[1, 2, 3, 4], [2, 3, 4], [3, 4], [4], []]
assert [1, 2, 3, 4].inits() == [[1, 2, 3, 4], [1, 2, 3], [1, 2], [1], []]
assert [1, 2, 3, 4].reverse() == [4, 3, 2, 1]
assert [1, 2, 3, 1].toUnique() == [1, 2, 3]
assert [1, 2, 3, 1].toSorted() == [1, 1, 2, 3]
assert [1, 2, 3, 4].collect { it * 2 } == [2, 4, 6, 8]
assert [[1, 2], [3, 4]].collectNested { it * 2 } == [[2, 4], [6, 8]]
def squaresAndCubesOfEvens = { it % 2 ? [] : [it**2, it**3] }
assert [1, 2, 3, 4].collectMany(squaresAndCubesOfEvens) == [4, 8, 16, 64]
assert [1, 2, 3, 4].any { it > 3 }
assert [1, 2, 3, 4].every { it < 5 }
assert ![1, 2, 3, 4].every { it > 3 }
assert [1, 2, 3, 4].find { it > 2 } == 3
assert [1, 2, 3, 4].findAll { it > 2 } == [3, 4]
assert [1, 2, 3, 4].findIndexOf { it > 2 } == 2
assert [1, 2, 3, 1].findLastIndexOf { it > 2 } == 2
assert [1, 2, 3, 4].inject { acc, i -> acc + i } == 10
assert (1..10).collate(3) == [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]
assert (1..10).chop(1, 3, 2, -1) == [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10]]
assert [1,2,3].permutations().toList() == [
[1, 2, 3], [3, 2, 1], [2, 1, 3], [3, 1, 2], [1, 3, 2], [2, 3, 1]
]
def matrix = [['a', 'b'], [ 1 , 2 ]]
assert matrix.transpose() == [ ['a', 1], ['b', 2] ]
assert matrix.combinations() == [ ['a', 1], ['b', 1], ['a', 2], ['b', 2] ]
assert [1, 2, 3].subsequences()*.toList() == [[1], [1, 2, 3], [2], [2, 3], [1, 2], [3], [1, 3]]
def answers = [1, 2, 3].withDefault{ 42 }
assert answers[2] == 3 && answers[99] == 42

GINQ processing

Groovy also supports language integrated query support to process lists:

// squares of odd numbers between 1 and 5
assert [1, 9, 25] == GQL {
from n in 1..5 where n % 2 != 0 select n ** 2
}

// group pets by name size
assert ["3:[cat, dog]", "4:[fish]", "6:[canary, gerbil]"] == GQL {
from p in pets
groupby p.size() as size
select size, agg(p) as names
}*.with{ "$it.size:$it.names" }

Stream methods

Useful stream methods (it is worthwhile comparing some of the examples to earlier non-stream variants):

pets = ['cat', 'canary', 'dog', 'fish', 'gerbil']
assert pets.stream().filter(p -> p.size() == 3).map(String::toUpperCase).toList() == ['CAT', 'DOG']
assert pets.stream().map(p -> p.size()).distinct().sorted().toList() == [3, 4, 6] // ordered pet name sizes
assert nums.stream().reduce{ a, b -> a + b }.get() == 10
assert (1..10).stream().skip(3).limit(5).filter(i -> i % 2 == 0).map(i -> i ** 2).toList() == [16, 36, 64]
assert [1, 2, 3, 4].stream().flatMap(i -> i % 2 ? Stream.empty() : Stream.of(i**2, i**3)).toList() == [4, 8, 16, 64]
assert pets.stream().collect(Collectors.groupingBy(p -> p.size())) == [3:['cat', 'dog'], 4:['fish'], 6:['canary', 'gerbil']]
assert [1, 2, 3, 4].stream().map(Integer::toString).collect(Collectors.joining('-')) == '1-2-3-4'
Arrays.stream(0..9 as int[]).summaryStatistics().with {
assert sum == 45 && min == 0 && max == 9 && average == 4.5 && count == 10
}
assert pets.stream().allMatch(w -> w ==~ /.*[aeiou].*/) // all pet names contain a vowel

GPars

Before looking at GPars, it is worth looking at parallel stream processing:

// calculate squares of odd numbers from input list
assert (1..5).parallelStream().filter{ it % 2 != 0 }.map(n -> n ** 2).toList() == [1, 9, 25]

GPars was designed to provide similar functionality long before streams processing was available. It still has some useful features.

Groovy has several tricks for removing the outer "withPool" clauses but we'll do the longhand here. Two GPars variations of above streams example:

GParsPool.withPool {
assert (1..5).findAllParallel{ it % 2 }.collectParallel{ it ** 2 } == [1, 9, 25]
assert (1..5).parallel.filter{ it % 2 }.map{ it ** 2 }.collection == [1, 9, 25]
}

Or using (JEP 425) virtual threads:

GParsExecutorsPool.withExistingPool(Executors.newVirtualThreadPerTaskExecutor()) {
assert (1..5).findAllParallel{ it % 2 }.collectParallel{ it ** 2 } == [1, 9, 25]
}

Other libraries

There are numerous list-related libraries on the JVM. We'll look at just a few.

Eclipse collections

Eclipse collections comes with many container types including immutable collections, primitive collections, bimaps, multimaps and bags as well as numerous utility methods. It focuses on reduced memory footprint and efficient containers. It might be particularly of interest if you need primitive collections, immutable collections or some more exotic collection types like bag or bidirectional maps. Here are just a few examples:

var certainties = Lists.immutable.of('death', 'taxes')
assert certainties.reduce{ a, b -> "$a & $b" }.get() == 'death & taxes'
var numBag = Bags.immutable.with('One', 'One', 'Two', 'Three')
assert numBag.toMapOfItemToCount() == [One:2, Two:1, Three:1]
var biMap = BiMaps.immutable.with(6, "six", 2, "two")
assert biMap.inverse().six == 6

Guava

Guava provides a number of extensions to the JDK collections ecosystem. In particular, it has immutable collections, new collection types like multisets and bidirectional maps and various powerful extensions and utilities. Here are a few examples:

var set = TreeMultiset.create([1, 2, 3])
assert set == TreeMultiset.create([3, 2, 1])
set.addAll([1, 3, 5])
assert set.size() == 6 && set.elementSet().size() == 4
assert set.toList() == [1, 1, 2, 3, 3, 5]
var bimap = HashBiMap.create()
bimap.five = 5
assert bimap.inverse()[5] == 'five'

Apache Commons Collections

The Apache Commons Collections library extends upon the JDK collections framework adding some new types like bidirectional maps and bags as well as providing many comparator and iterator implementations. The library was designed to fill gaps in the JDK offerings and while some of those gaps in the JDK have now been filled by the JDK itself, Commons Collections still contains much useful functionality. Here are a few examples:

var six = [six: 6] as TreeBidiMap
assert six.inverseBidiMap() == [6: 'six']
var bag = new HashBag(['one'] * 6)
bag.remove('one', 2)
assert bag.getCount('one') == 4

Further Information

Conclusion

We have looked at the more common methods for list processing with Groovy and a few other useful libraries.



Matrix calculations with Groovy, Apache Commons Math, ojAlgo, Nd4j and EJML

by paulk


Posted on Thursday August 18, 2022 at 01:41PM in Technology


This blogs looks at performing matrix calculations with Groovy using various libraries: Apache Commons Math, ojAlgo, EJML, and Nd4j (part of Eclipse Deeplearning4j). We'll also take a quick look at using the incubating Vector API for matrix calculations (JEPs 338, 414417426).

Fibonacci

The Fibonacci sequence has origins in India centuries earlier but is named after the Italian author of the publication, The Book of Calculation, published in 1202. In that publication, Fibonacci proposed the sequence as a means to calculate the growth of idealized (biologically unrealistic) rabbit populations. He proposed that a newly born breeding pair of rabbits are put in a field; each breeding pair mates at the age of one month, and at the end of their second month they always produce another pair of rabbits; and rabbits never die, but continue breeding forever. Fibonacci posed the puzzle: how many pairs will there be in one year? The sequence goes like this:

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233

We can solve this problem using matrices. If we multiply the matrix 2022-08-18 23_42_31-ASF Blogs_ Edit Entry.png by itself n times we get 2022-08-18 23_42_48-ASF Blogs_ Edit Entry.png. This is an operation known as matrix exponentiation. Let's explore this problem using four of the most popular and maintained matrix libraries.

Apache Commons Math

Let's explore solving this problem using Apache Commons Math. Apache Commons Math is a library of lightweight self-contained mathematics and statics components. Matrices are part of the linear algebra part of this library and for that context, matrices of double values are relevant. So, we'll represent our Fibonacci numbers as double values.

double[][] data = [[1d, 1d], [1d, 0d]]
def m = MatrixUtils.createRealMatrix(data)
println m * m * m
println m**6

Commons math has a factory method for creating matrixes from double arrays. The names of the methods for multiplication and exponentiation happen to align with Groovy's methods available for operator overloading, namely multiply and power, so we can use Groovy's convenient shorthands.

When we run the script, the output looks like this:

Array2DRowRealMatrix{{3.0,2.0},{2.0,1.0}}
Array2DRowRealMatrix{{13.0,8.0},{8.0,5.0}}

We could go a little further and print the values from the matrix, but the result is clear enough. We see the values in the Fibonacci sequence appearing in the output.

EJML

EJML (Efficient Java Matrix Library) is a linear algebra library for manipulating real, complex, dense, and sparse matrices. It is also a 100% Java solution. It has some novel features including import from Matlab and support for semirings (GraphBLAS) which can be used for graph algorithms where a sparse matrix may be used to represent a graph as an adjacency matrix or incidence matrix.

We can do the same calculation using EJML:

def m = new SimpleMatrix([[1d, 1d], [1d, 0d]] as double[][])
def ans = m.mult(m).mult(m)
println ans
6.times { ans = ans.mult(m) }
println ans

The name of the multiply method differs from the one where we can use automatic operator overloading shorthands, so we just call the method that EJML provides. See a little later under language extensibility on how we could in fact add support to use the same shorthands as we saw for Commons Math.

EJML doesn't have an exponentiation method but we just call multiply the requisite number of times to achieve the same effect. Note that we bumped the number of iterations called for the second matrix, to reveal the next bunch of elements from the Fibonacci sequence.

Running this script has this output:

Type = DDRM , rows = 2 , cols = 2
 3.0000E+00  2.0000E+00 
 2.0000E+00  1.0000E+00 

Type = DDRM , rows = 2 , cols = 2
 5.5000E+01  3.4000E+01 
 3.4000E+01  2.1000E+01

The first matrix has the same number as previously, and the second reveals the next numbers in the Fibonacci sequence (21, 34, and 55).

Nd4j

Nd4j provides functionality available on Python to Java users. It contains a mix of numpy operations and tensorflow/pytorch operations. Nd4l makes use of native backends to allow it to work on different platforms and provide efficient operation when scaling up.

The code for our Fibonacci solution is very similar to EJML:

def m = Nd4j.create([[1, 1], [1, 0]] as int[][])
def ans = m.mmul(m).mmul(m)
println ans
9.times { ans = ans.mmul(m) }
println ans

One feature that is different to the previous two libraries is that Nd4j supports integer matrices (as well as doubles and numerous other numerical and related types).

Running the script gives the following output:

...
[main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
...
[main] INFO org.nd4j.linalg.cpu.nativecpu.CpuNDArrayFactory - Binary level Generic x86 optimization level AVX/AVX2
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [OPENBLAS]
...
[[         3,         2], 
 [         2,         1]]
[[       233,       144], 
 [       144,        89]]

Again, the first matrix is the same as we have seen previously, the second has been bumped along the Fibonacci sequence by three more elements.

ojAlgo

The next library we'll look at is ojAlgo (oj! Algorithms). It is an open source all-Java offering for mathematics, linear algebra and optimisation, supporting data science, machine learning and scientific computing. It claims to be the fastest pure-Java linear algebra library available and the project website provide links to numerous benchmarks backing that claim.

Here is the code for our Fibonacci example:

def factory = Primitive64Matrix.FACTORY
def m = factory.rows([[1d, 1d], [1d, 0d]] as double[][])
println m * m * m
println m**15

We can see it supports the same operator overloading features we saw for Commons Math.

When we run the script, it has the following output:

org.ojalgo.matrix.Primitive64Matrix < 2 x 2 >
{ { 3.0,	2.0 },
{ 2.0,	1.0 } }
org.ojalgo.matrix.Primitive64Matrix < 2 x 2 >
{ { 987.0,	610.0 },
{ 610.0,	377.0 } }

As expected, the first matrix is as we've seen before, while the second reveals the next three numbers in the sequence.

Exploring the Vector API and EJML

From JDK16, various versions (JEPs 338414417426) of the Vector API have been available as an incubating preview feature. The HotSpot compiler has previously already had some minimal optimisations that can leverage vector hardware instructions but the Vector API expands the scope of possible optimisations considerably. We could look at writing our own code that might make use of the Vector API and perhaps perform our matrix multiplications ourselves. But, one of the libraries has already done just that, so we'll explore that path.

The main contributor to the EJML library has published a repo for the purposes of very early prototyping and benchmarking. We'll use the methods from one of its classes to explore use of the vector API for our Fibonacci example. The MatrixMultiplication class has three methods: mult_ikj_simple is coded in the way any of us might write a multiplication method as a first pass from its definition without any attempts at optimisation, mult_ikj is coded in a highly-optimised fashion and corresponds to the code EJML would normally use, and mult_ikj_vector uses the Vector API. Note, you can think of these methods as "one layer down" from the mult method we called in the previous example, i.e. the mult method we used previously would be calling one of these under the covers. That's why we pass the internal "matrix" representation instead of our SimpleMatrix instance.

For our little calculations, the optimisations offered by the Vector API would not be expected to be huge. However, we'll do our calculation for generating the matrix we did as a first step for all of the libraries and we'll do it in a loop with 1000 iterations for each of the three methods (mult_ikj_simple, mult_ikj, and mult_ikj_vector). The code looks like this:

def m = new SimpleMatrix([[1, 1], [1, 0]] as double[][])
double[] expected = [3, 2, 2, 1]
def step1, result
long t0 = System.nanoTime()
1000.times {
step1 = new SimpleMatrix(2, 2)
result = new SimpleMatrix(2, 2)
MatrixMultiplication.mult_ikj_simple(m.matrix, m.matrix, step1.matrix)
MatrixMultiplication.mult_ikj_simple(step1.matrix, m.matrix, result.matrix)
assert result.matrix.data == expected
}
long t1 = System.nanoTime()
1000.times {
step1 = new SimpleMatrix(2, 2)
result = new SimpleMatrix(2, 2)
MatrixMultiplication.mult_ikj(m.matrix, m.matrix, step1.matrix)
MatrixMultiplication.mult_ikj(step1.matrix, m.matrix, result.matrix)
assert result.matrix.data == expected
}
long t2 = System.nanoTime()
1000.times {
step1 = new SimpleMatrix(2, 2)
result = new SimpleMatrix(2, 2)
MatrixMultiplication.mult_ikj_vector(m.matrix, m.matrix, step1.matrix)
MatrixMultiplication.mult_ikj_vector(step1.matrix, m.matrix, result.matrix)
assert result.matrix.data == expected
}
long t3 = System.nanoTime()
printf "Simple: %6.2f ms\n", (t1 - t0)/1000_000
printf "Optimized: %6.2f ms\n", (t2 - t1)/1000_000
printf "Vector: %6.2f ms\n", (t3 - t2)/1000_000

This example was run on JDK16 with the following VM options: "--enable-preview --add-modules jdk.incubator.vector".

The output looks like this:

WARNING: Using incubator modules: jdk.incubator.vector
Simple:    116.34 ms
Optimized:  34.91 ms
Vector:     21.94 ms

We can see here that we have some improvement even for our trivial little calculation. Certainly, for biggest problems, the benefit of using the Vector API could be quite substantial.

We should give a big disclaimer here. This little microbenchmark using a loop of 1000 will give highly variable results and was just done to give a very simple performance comparison. For a more predictable comparison, consider running the jmh benchmark in the aforementioned repo. And you may wish to wait until the Vector API is out of preview before relying on it for any production code - but by all means, consider trying it out now.

Leslie Matrices

Earlier, we described the Fibonacci sequence as being for unrealistic rabbit populations, where rabbits never died and continued breeding forever. It turns out that Fibonacci matrices are a special case of a more generalised model which can model realistic rabbit populations (among other things). These are Leslie matrices. For Leslie matrices, populations are divided into classes, and we keep track of birth rates and survival rates over a particular period for each class. We store this information in a matrix in a special form. The populations for each class for the next period can be calculated from those for the current period through multiplication by the Leslie matrix.

This technique can be used for animal populations or human population calculations. A Leslie matrix can help you find out if there will be enough GenX, Millenials, and GenZ tax payers to support an aging and soon retiring baby boomer population. Sophisticated animal models might track populations for an animal and for its predators or its prey. The survival and birth rates might be adjusted based on such information. Given that only females give birth, Leslie models will often be done only in terms of the female population, with the total population extrapolated from that.

We'll show an example for kangaroo population based on this video tutorial for Leslie matrices. It can help us find out if the kangaroo population is under threat (perhaps drought, fires or floods have impacted their feeding habitat) or if favorable conditions are leading to overpopulation.

Following that example, we divide kangaroos into 3 population classes: ages 0 to 3, 3 to 6, and 6 to 9. We are going to look at the population every three years. The 0-3 year olds birthrate (B1) is 0 since they are pre-fertile. The most fertile 3-6 year olds birthrate (B2) is 2.3. The old roos (6-9) have a birthrate (B3) of 0.4. We assume no kangaroos survive past 9 years old. 60% (S1) of the young kangaroos survive to move into the next age group. 30% (S2) of the middle-aged kangaroos survive into old age. Initially, we have 400 kangaroos in each age group.

Here is what the code looks like for this model:

double[] init = [400,   // 0..<3
400, // 3..<6
400] // 6..9
def p0 = MatrixUtils.createRealVector(init)
println "Initial populations: $p0"

double[][] data = [
[0 , 2.3, 0.4], // B1 B2 B3
[0.6, 0, 0 ], // S1 0 0
[0 , 0.3, 0 ] // 0 S2 0
]
def L = MatrixUtils.createRealMatrix(data)
def p1 = L.operate(p0)
println "Population after 1 round: $p1"

def p2 = L.operate(p1)
println "Population after 2 rounds: $p2"

def L10 = L ** 10
println "Population after 10 rounds: ${L10.operate(p0).toArray()*.round()}"

This code produces the following output:

Initial populations: {400; 400; 400}
Population after 1 round: {1,080; 240; 120}
Population after 2 rounds: {600; 648; 72}
Population after 10 rounds: [3019, 2558, 365]

After the first round, we see many young roos but a worrying drop off in the older age groups. After the second round, only the oldest age group looks worryingly small. However, with the healthy numbers in the young generation, we can see that after 10 generations that indeed, the overall population is not at risk. In fact, overpopulation might become a problem.

Encryption with matrices

An early technique to encrypt a message was the Caesar cipher. It substitutes letters in the alphabet by the letter shifted a certain amount along in the alphabet, e.g. "IBM" becomes "HAL" if shifting to the previous letter and "VMS" becomes "WNT" if shifting one letter forward in the alphabet. This kind of cipher can be broken by looking at frequency analysis of letters or pattern words.

The Hill cipher improves upon the Caesar cipher by factoring multiple letters into each letter of the encrypted text. Using matrices made it practical to look at three or more symbols at once. In general, an N-by-N matrix (the key) is multiplied by an encoding of the message in matrix form. The result is a matrix representation (encoded form) of the encrypted text. We use the inverse matrix of our key to decrypt or message.

We need to have a way to convert our text message to and from a matrix. A common scheme is to encode A as 1, B as 2, and so on. We'll just use the ascii value for each character. We define encode and decode methods to do this:

double[][] encode(String s) { s.bytes*.intValue().collate(3) as double[][] }
String decode(double[] data) { data*.round() as char[] }

We'll define a 2-by-2 matrix as our key and use it to encrypt. We'll find the inverse of our key and use that to decrypt. If we wanted to, we could use a 3-by-3 key for improved security at the cost of more processing time.

Our code looks like this:

def message = 'GROOVY'
def m = new SimpleMatrix(encode(message))
println "Original: $message"

def enKey = new SimpleMatrix([[1, 3], [-1, 2]] as double[][])
def encrypted = enKey.mult(m)
println "Encrypted: ${decode(encrypted.matrix.data)}"

def deKey = enKey.invert()
def decrypted = deKey.mult(encrypted)
println "Decrypted: ${decode(decrypted.matrix.data)}"

When run, it has the following output:

Original: GROOVY
Encrypted: ĴŔŚWZc
Decrypted: GROOVY

This offers far more security than the Caesar cipher, however, given today's computing availability, Hill ciphers can still eventually be broken with sufficient brute force. For this reason, Hill ciphers are seldom used on their own for encryption but they are often used in combination with other techniques to add diffusion - strengthening the security offered by the other techniques.

Shape manipulation

Our final example looks at geometrically transforming shapes. To do this, we represent the points of the shape as vectors and multiply them using transforms represented as matrices. We need only worry about the corners, since we'll use Swing to draw our shape and it has a method for drawing a polygon by giving its corners.

First we'll use Groovy's SwingBuilder to set up our frame:

new SwingBuilder().edt {
def frame = frame(title: 'Shapes', size: [420, 440], show: true, defaultCloseOperation: DISPOSE_ON_CLOSE) {
//contentPane.background = Color.WHITE
widget(new CustomPaintComponent())
}
frame.contentPane.background = Color.WHITE
}

We aren't really use much of SwingBuilder's functionality here but if we wanted to add more functionality, SwingBuilder would make that task easier.

We will actually draw our shapes within a custom component. We'll define a few color constants, a drawPolygon method which given a matrix of points will draw those points as a polygon. We'll also define a vectors method to convert a list of points (the corners) into vectors, and a transform method which is a factory method for creating a transform matrix.

Here is the code:

class CustomPaintComponent extends Component {
static final Color violet = new Color(0x67, 0x27, 0x7A, 127)
static final Color seaGreen = new Color(0x69, 0xCC, 0x67, 127)
static final Color crystalBlue = new Color(0x06, 0x4B, 0x93, 127)
static drawPolygon(Graphics g, List pts, boolean fill) {
def poly = new Polygon().tap {
pts.each {
addPoint(*it.toRawCopy1D()*.round()*.intValue().collect { it + 200 })
}
}
fill ? g.fillPolygon(poly) : g.drawPolygon(poly)
}

static List<Primitive64Matrix> vectors(List<Integer>... pts) {
pts.collect{ factory.column(*it) }
}

static transform(List<Number>... lists) {
factory.rows(lists as double[][])
}

void paint(Graphics g) {
g.drawLine(0, 200, 400, 200)
g.drawLine(200, 0, 200, 400)
g.stroke = new BasicStroke(2)

def triangle = vectors([-85, -150], [-145, -30], [-25, -30])
g.color = seaGreen
drawPolygon(g, triangle, true)
// transform triangle ...

def rectangle = vectors([0, -110], [0, -45], [95, -45], [95, -110])
g.color = crystalBlue
drawPolygon(g, rectangle, true)
// transform rectangle ...

def trapezoid = vectors([50, 50], [70, 100], [100, 100], [120, 50])
g.color = violet
drawPolygon(g, trapezoid, true)
// transform trapezoid ...
}
}

When we run this code we see our three shapes:

2022-08-18 00_02_11-Shapes.png

We can now add our transforms. We'll have one which rotate by 90 degrees anti-clockwise. Another which enlarges a shape by 25% and one that shrinks a shape by 25%. We can combine transforms simply by multiplying them together. We'll make two transformations of our triangle. We'll rotate in both cases but we'll shrink one and enlarge the other. We apply the transform simply by multiplying each point by the transform matrix. Then we'll draw both transformed shapes as an outline rather than filled (to make it easier to distinguish the original and transformed versions). Here is the code:

def rot90 = transform([0, 1], [-1, 0])
def bigger = transform([1.25, 0], [0, 1.25])
def smaller = transform([0.75, 0], [0, 0.75])
def triangle_ = triangle.collect{ rot90 * bigger * it }
drawPolygon(g, triangle_, false)
triangle_ = triangle.collect{ rot90 * smaller * it }
drawPolygon(g, triangle_, false)

For our rectangle, we'll have one simple transform which flips the shape in the vertical axis. A second transform combines multiple changes in one transform. We could have split this into smaller transforms and the multiplied them together - but here they are all in one. We flip in the horizontal access and then apply a shear. We then draw the transformed shapes as outlines:

def flipV = transform([1, 0], [0, -1])
def rectangle_ = rectangle.collect{ flipV * it }
drawPolygon(g, rectangle_, false)
def flipH_shear = transform([-1, 0.5], [0, 1])
rectangle_ = rectangle.collect{ flipH_shear * it }
drawPolygon(g, rectangle_, false)

For our trapezoid, we create a transform which rotates 45 degrees anti-clockwise (recall sin 45° = cos 45° = 0.707). Then we create 6 transforms rotating at 45, 90, 135 and so forth. We draw each transformed shape as an outline:

def rot45 = transform([0.707, 0.707], [-0.707, 0.707])
def trapezoid_
(1..6).each { z ->
trapezoid_ = trapezoid.collect{ rot45 ** z * it }
drawPolygon(g, trapezoid_, false)
}

When we run the entire example, here is what the output looks like:

2022-08-18 00_00_48-Shapes.png

We can see here that matrix transforms give us powerful ways to manipulate images. We have used 2D images here, but the same techniques would apply to 3D shapes.

Language and tool extensibility

We saw earlier that some of the examples could make use of Groovy operator shorthand syntax, while others couldn't. Here is a summary of some common methods in the libraries:

Groovy operator+-***
Groovy methodplusminusmultiplypower
Commons mathaddsubtractmultiplypower
EJMLplusminusmult-
Nd4jaddsubmmul-
ojAlgoaddsubtractmultiplypower

Where the library used the same name as Groovy's method, the shorthand could be used.

Groovy offers numerous extensibility features. We won't go into all of the details but instead refer readers to the History of Groovy paper which gives more details.

In summary, that paper defined the following extensions for Commons Math matrices:

RealMatrix.metaClass {
plus << { RealMatrix ma -> delegate.add(ma) }
plus << { double d -> delegate.scalarAdd(d) }
multiply { double d -> delegate.scalarMultiply(d) }
bitwiseNegate { -> new LUDecomposition(delegate).solver.inverse }
constructor = { List l -> MatrixUtils.createRealMatrix(l as double[][]) }
}

This fixes some of the method name discrepancies in the above table. We can now use the operator shorthand for matrix and scalar addition as well as scalar multiply. We can also use the ~ (bitwiseNegate) operator when finding the inverse. The mention of double[][] during matrix construction is now also gone.

The paper goes on to discuss how to automatically add the necessary imports for a matrix library and provide aliases if needed. The imports aren't shown for the code listings in this blog but are in the listings in the sample code repo. In any case, the imports can be "configured away" as the paper discusses. The end result is our code in its entirety can look like this:

var m = [[1, 1], [1, 0]] as Matrix
m**6

The paper also discusses tooling extensibility, in particular the visualisation aspects of the GroovyConsole. It shows how to define an output transform which renders any matrix result using a jlatexmath library. So instead of seeing "Array2DRowRealMatrix{{13.0,8.0},{8.0,5.0}}", they will see a graphical rendition of the matrix. So, the final end-user experience when using the GroovyConsole looks like this:

2022-08-18 00_28_04-bin.png

When using in Jupyter style environments, other pretty output styles may be supported.

Further information

Conclusion

We have examined a number of simple applications of matrices using the Groovy programming language and the Apache Commons Math, ojAlgo, Nd4j, and JEML libraries. You should be convinced that using matrices on the JVM isn't hard and you have numerous options. We also saw a brief glimpse at the Vector API which looks like an exciting addition to the JVM (hopefully arriving soon in non-preview mode).



Life on Mars: Units of Measurement systems, Groovy, and domain specific languages (DSLs)

by paulk


Posted on Saturday August 13, 2022 at 06:31AM in Technology


The Mars Climate Orbiter was launched in 1998 as part of a multi-faceted Mars exploration program. It was lost due to a trajectory calculation error when nearing Mars. An investigation attributed the failure to a measurement mismatch between two software systems: metric units were used by NASA and US customary units were used by spacecraft builder Lockheed Martin.

Mars Climate Orbiter - image credit: wikipedia

If the system in question had been developed using a units of measurement system, perhaps the failure could have been avoided.

Units of measurement systems

All programming languages have types for representing numbers. As an example, we may have three integers, one representing a height, one a weight and one a temperature. We can write code to add those three integers together, but the result may have no useful meaning. We could start writing our own class hierarchies for each integer type but it can quickly become cumbersome to use the resulting classes. Units of measurement systems attempt to provide a range of commonplace units, ways to construct quantities from those units and ways to manipulate them. Manipulation might involve performing numerical calculation or conversions. Goals of such systems include the ability to provide runtime and/or compile-time type safety. So, we should fail early when trying to perform the earlier mentioned addition of three unrelated quantities.

Units of measurement systems aren't new. There are existing systems in F# and Swift and Java has had several versions (and earlier attempts) around standardising such a system including:

  • JSR 275: Units Specification (rejected earlier attempt)
  • JSR 363: Units of Measurement API
  • JSR 385: Units of Measurement API 2.0

There are also existing Java libraries like JScience which have been developed, and shown early promise, but now don't seem to be actively maintained. While the jury is still out about whether units of measurement systems will spread further into mainstream programming, now seems like a great time to check the status. The JSR 385 maintenance release was approved last year and the latest version of the reference implementation was released earlier this year.

JSR 385: Units of Measurement API 2.0

The first thing we need to do is bring in our dependencies (shown for Gradle - other options). The main one is the reference implementation (it brings in the javax.measure API transitively):

implementation 'tech.units:indriya:2.1.3'

JSR 385 is extensible. We'll also bring in some units from the Unicode CLDR units, e.g. MILE:

implementation 'systems.uom:systems-unicode:2.1'
implementation 'systems.uom:systems-quantity:2.1'

Let's follow on with the theme of visiting Mars. We can create variables for the mass and diameter of Mars as follows:

var mass= Quantities.getQuantity(6.39E23, KILO(GRAM))
var diameter= Quantities.getQuantity(6_779, KILOMETER)
println mass
println diameter

JSR 385 has metric prefix qualifiers like MICRO, MILLI, CENTI, KILO, MEGA, GIGA and many more. The CLDR units also define some commonly used units like KILOMETER. We could choose either here.

When we run this script, it has the following output:

639000000000000000000000 kg
6779 km

If we try to compare or add those two values, we'll see an error. Groovy has both static and dynamic natures. Using dynamic code like this:

println mass> diameter

We'll see a runtime error like this:

javax.measure.IncommensurableException: km is not compatible with kg

Or, with TypeChecked or CompileStatic in play for a statement like this:

println mass.add(diameter)

We'd see a compile-time error like this:

[Static type checking] - Cannot call tech.units.indriya.ComparableQuantity#add(javax.measure.Quantity<javax.measure.quantity.Mass>)
with arguments [tech.units.indriya.ComparableQuantity<javax.measure.quantity.Length>]

If for some strange reason we did want to compare or perform calculations between incommensurable types, we can explicitly get the value:

assert mass.value > diameter.value

This escape hatch takes off a layer of type safety but requires explicit work to do so. It is normally never the case that you would want to do this.

JSR 385 also supports ranges and conversions. We can look at the minimum and maximum temperatures on Mars:

var minTemp = Quantities.getQuantity(-128, CELSIUS)
var maxTemp = Quantities.getQuantity(70, FAHRENHEIT)
println minTemp
println minTemp.to(FAHRENHEIT)
println maxTemp
println maxTemp.to(CELSIUS)
println QuantityRange.of(minTemp, maxTemp)

It's quite a bit colder than Earth! When run, this script has the following output:

-128 ℃
-198.400 ((K*5)/9)+459.67
70 ((K*5)/9)+459.67
21.1111111111111111111111111111111 ℃
min= -128 ℃, max= 70 ((K*5)/9)+459.67

In case you're wondering about the strange looking unit display for Fahrenheit temperatures, the definition of that unit that we are using, defines degrees Fahrenheit using a formula calculated from the temperature in degrees Kelvin.

We'd see the same thing if using the MILE unit:

println diameter.to(MILE)

Which shows us that the diameter of Mars is a little over 4200 miles:

4212.275312176886980036586335798934 (m*1609344)/1000

Adding some metaprogramming

Groovy has various features which allow methods to be (apparently) added to classes. We'll use extension methods. This technique involves writing static methods in a helper class using certain conventions. The first parameter in all such methods is the target of the extension. Groovy code referencing instances of the target class have code that can call such a method as if it existed on the target class. In reality, the Groovy compiler or runtime funnels the call through the helper class. For us, it means we will have methods like getMeters() on the Number class which using Groovy's shorthand's for property notation allows for very compact quantity definitions like 5.meters. We'll also add some methods to allow Groovy's normal operator overloading syntax to apply:

class UomExtensions {
static Quantity<Length> getCentimeters(Number num) { Quantities.getQuantity(num, CENTI(METRE)) }

static Quantity<Length> getMeters(Number num) { Quantities.getQuantity(num, METRE) }

static Quantity<Length> getKilometers(Number num) { Quantities.getQuantity(num, KILO(METRE)) }

static Quantity<Length> getCm(Number num) { getCentimeters(num) }

static Quantity<Length> getM(Number num) { getMeters(num) }

static Quantity<Length> getKm(Number num) { getKilometers(num) }

static Quantity<Mass> getKilograms(Number num) { Quantities.getQuantity(num, KILO(GRAM)) }

static Quantity<Mass> getKgs(Number num) { getKilograms(num) }

static Quantity<Time> getHours(Number num) { Quantities.getQuantity(num, HOUR) }

static Quantity<Time> getSeconds(Number num) { Quantities.getQuantity(num, SECOND) }

static Quantity<Time> getHr(Number num) { getHours(num) }

static Quantity<Time> getS(Number num) { getSeconds(num) }

static Quantity<Speed> div(Quantity<Length> q, Quantity<Time> divisor) { q.divide(divisor) as Quantity<Speed> }

static <Q> Quantity<Q> div(Quantity<Q> q, Number divisor) { q.divide(divisor) }

static <Q> Quantity<Q> plus(Quantity<Q> q, Quantity<Q> divisor) { q.add(divisor) }

static <Q> Quantity<Q> minus(Quantity<Q> q, Quantity<Q> divisor) { q.subtract(divisor) }
}

Note that we have longer and shorter versions of many of the methods, e.g. kg and kilogram, m and meter. We didn't need a method for multiply since it is already using the name Groovy expects.

Now we can write very short definitions to declare or compare times and lengths:

def s = 1.s
assert 1000.meters == 1.km && 1.m == 100.cm

We can also declare variables for acceleration due to gravity on Earth and Mars. Gravity is a lot less on Mars:

var g= 3.7.m/s/s
var g= 9.8.m/s/s
assert g.toString() == '9.8 m/s²'
assert g> g

We can also use the operator overloading in calculations (here showing that the Earth has a diameter that is between 1.8 and 2 times bigger than that of Mars):

var diameter= 12_742.kilometers
assert diameter+ diameter> diameter
assert diameter- diameter< diameter
assert diameter* 1.8 < diameter

Even though we have more compact expressions, the same data types are in play that we saw previously. They're just a little nicer to type.

A dynamic DSL for controlling a Mars Rover

Let's now look at how you could write a little Domain-Specific-Language (DSL) to control a Mars rover robot.

Mars Rover Selfie - image credit: wikipedia

First, we'll write a Direction enum as part of our robot domain model:

enum Direction {
left, right, forward, backward
}

There are many ways to write DSLs in Groovy. We'll use a little trick where the verbs are represented as keys in a map. Our DSL then looks like this:

def move(Direction dir) {
[by: { Quantity<Length> dist ->
[at: { Quantity<Speed> speed ->
println "robot moved $dir by $dist at $speed"
}]
}]
}

Here the implementation is just going to print out a message indicating all of the values it is processing. The real robot would send signals to the rover's robotic subsystems.

Our script for controlling the rover now looks like this:

move right by 2.m at 5.cm/s

Which when run gives this output:

robot moved right by 2 m at 5 cm/s

As we saw earlier, this is backed by our JSR 385 types. We'll certainly get fail-early runtime errors if there are any calculations involving mismatched types.

If we enable static typing, some additional errors will be detected at compile but because of the very dynamic style of our DSL implementation, not all runtime errors are reflected by typing information. If we want, we can change our DSL implementation to use richer types and that will support better static typing checking. We'll look at one way to do that next.

A type-rich DSL for the Rover

Now, instead of using our nested map style we saw previously, we create several richly-typed helper classes and define our move method in terms of those classes:

class MoveHolder {
Direction dir
ByHolder by(Quantity<Length> dist) {
new ByHolder(dist: dist, dir: dir)
}
}

class ByHolder {
Quantity<Length> dist
Direction dir
void at(Quantity<Speed> speed) {
println "robot moved $dir by $dist at $speed"
}
}

static MoveHolder move(Direction dir) {
new MoveHolder(dir: dir)
}

While our DSL implementation has changed, the robot scripts remain the same:

move right by 2.m at 5.cm/s

Indeed, if we use Groovy dynamic nature, we can still run the same script and will notice no change.

If however, we enable static checking and have a script with an error like this:

move forward by 2.kgs

We'll now see a compile-time error:

[Static type checking] - Cannot call MoveHolder#by(javax.measure.Quantity<javax.measure.quantity.Length>) with arguments [javax.measure.Quantity<javax.measure.quantity.Mass>]

It is great to get this additional earlier feedback on script errors, so you may wonder why we don't write our DSL implementations like this all of the time? Actually, both the dynamic and static flavors of our DSL can be useful at different times. When prototyping our script DSL, deciding on all the nouns and verbs that we should be using to control our robot, the dynamic flavored style can be much quicker to write especially during early iterations which might evolve and change rapidly. Once the DSL language has been locked down, we can invest in adding the richer types. In the rover scenario, it might also be the case that the rover itself has limited power and so may not want to perform additional type checking steps. We might run all scripts through a type checker back at mission control before sending them through to the rover where they may be enacted in dynamic mode.

Adding custom type checking

There is one additional language feature of Groovy we haven't mentioned. Groovy's type checking mechanism is extensible, so we'll have a look at using that feature here. The rover's speed is rather limited, "In the case of exploring Mars, however, speed isn't the most relevant quality. It's about the journey and the destinations along the way. The slow pace is energy-efficient...". Let's look at limiting the speed to avoid unsafe or energy wasting movement.

We could put early defensive checks in our DSL implementation to detect undesirable manoeuvres but we can also use type checking extensions for certain kinds of errors. Groovy in fact has its own DSL for writing such extensions. That's a topic for its own blog but here's what the code looks like:

afterMethodCall { call ->
def method = getTargetMethod(call)
if (method.name != 'at') return
if (call.arguments.size() != 1) return
def arg = call.arguments[0]
if (arg !instanceof BinaryExpression) return
def left = arg.leftExpression
if (left !instanceof PropertyExpression) return
def obj = left.objectExpression
if (obj !instanceof ConstantExpression) return
if (obj.value > 5) {
addStaticTypeError("Speed of $obj.value is too fast!",call)
handled = true
}
}

This is only a partial implementation, it's make numerous assumptions. We could remove those assumptions by adding more code but for now we'll keep this simplified version.

So, now the following script (with the above type checking extension applied) compiles fine:

move right by 2.m at 5.cm/s

But this script fails:

move right by 2.m at 6.cm/s

The error message is:

[Static type checking] - Speed of 6 is too fast!

Further information

Conclusion

We have looked at using the JSR 385 javax.measure API using Groovy and added some DSL examples to make using the API a little nicer.


Natural Language Processing with Groovy, OpenNLP, CoreNLP, Nlp4j, Datumbox, Smile, Spark NLP, DJL and TensorFlow

by paulk


Posted on Sunday August 07, 2022 at 07:34AM in Technology


Natural Language Processing is certainly a large and sometimes complex topic with many aspects. Some of those aspects deserve entire blogs in their own right. For this blog, we will briefly look at a few simple use cases illustrating where you might be able to use NLP technology in your own project.

Language Detection

Knowing what language some text represents can be a critical first step to subsequent processing. Let's look at how to predict the language using a pre-built model and Apache OpenNLP. Here, ResourceHelper is a utility class used to download and cache the model. The first run may take a little while as it downloads the model. Subsequent runs should be fast. Here we are using a well-known model referenced in the OpenNLP documentation.

def helper = new ResourceHelper('https://dlcdn.apache.org/opennlp/models/langdetect/1.8.3/')
def model = new LanguageDetectorModel(helper.load('langdetect-183'))
def detector = new LanguageDetectorME(model)

[ spa: 'Bienvenido a Madrid', fra: 'Bienvenue à Paris',
dan: 'Velkommen til København', bul: 'Добре дошли в София'
].each { k, v ->
assert detector.predictLanguage(v).lang == k
}

The LanguageDetectorME class lets us predict the language. In general, the predictor may not be accurate on small samples of text but it was good enough for our example. We've used the language code as the key in our map and we check that against the predicted language.

A more complex scenario is training your own model. Let's look at how to do that with Datumbox. Datumbox has a pre-trained models zoo but its language detection model didn't seem to work well for the small snippets in the next example, so we'll train our own model. First, we'll define our datasets:

def datasets = [
English: getClass().classLoader.getResource("training.language.en.txt").toURI(),
French: getClass().classLoader.getResource("training.language.fr.txt").toURI(),
German: getClass().classLoader.getResource("training.language.de.txt").toURI(),
Spanish: getClass().classLoader.getResource("training.language.es.txt").toURI(),
Indonesian: getClass().classLoader.getResource("training.language.id.txt").toURI()
]

The de training dataset comes from the Datumbox examples. The training datasets for the other languages are from Kaggle.

We set up the training parameters needed by our algorithm:

def trainingParams = new TextClassifier.TrainingParameters(
numericalScalerTrainingParameters: null,
featureSelectorTrainingParametersList: [new ChisquareSelect.TrainingParameters()],
textExtractorParameters: new NgramsExtractor.Parameters(),
modelerTrainingParameters: new MultinomialNaiveBayes.TrainingParameters()
)

We'll use a Naïve Bayes model with Chisquare feature selection.

Next we create our algorithm, train it with our training dataset, and then validate it against the training dataset. We'd normally want to split the data into training and testing datasets, to give us a more accurate statistic of the accuracy of our model. But for simplicity, while still illustrating the API, we'll train and validate with our entire dataset:

def config = Configuration.configuration
def classifier = MLBuilder.create(trainingParams, config)
classifier.fit(datasets)
def metrics = classifier.validate(datasets)
println "Classifier Accuracy (using training data): $metrics.accuracy"

When run, we see the following output:

Classifier Accuracy (using training data): 0.9975609756097561

Our test dataset will consist of some hard-coded illustrative phrases. Let's use our model to predict the language for each phrase:

[   'Bienvenido a Madrid', 'Bienvenue à Paris', 'Welcome to London',
'Willkommen in Berlin', 'Selamat Datang di Jakarta'
].each { txt ->
def r = classifier.predict(txt)
def predicted = r.YPredicted
def probability = sprintf '%4.2f', r.YPredictedProbabilities.get(predicted)
println "Classifying: '$txt', Predicted: $predicted, Probability: $probability"
}

When run, it has this output:

Classifying: 'Bienvenido a Madrid',  Predicted: Spanish,  Probability: 0.83
Classifying: 'Bienvenue à Paris',  Predicted: French,  Probability: 0.71
Classifying: 'Welcome to London',  Predicted: English,  Probability: 1.00
Classifying: 'Willkommen in Berlin',  Predicted: German,  Probability: 0.84
Classifying: 'Selamat Datang di Jakarta',  Predicted: Indonesian,  Probability: 1.00
Given these phrases are very short, it is nice to get them all correct, and the probabilities all seem reasonable for this scenario.

Parts of Speech

Parts of speech (POS) analysers examine each part of a sentence (the words and potentially punctuation) in terms of the role they play in a sentence. A typical analyser will assign or annotate words with their role like identifying nouns, verbs, adjectives and so forth. This can be a key early step for tools like the voice assistants from Amazon, Apple and Google.

We'll start by looking at a perhaps lesser known library Nlp4j before looking at some others. In fact, there are multiple Nlp4j libraries. We'll use the one from nlp4j.org, which seems to be the most active and recently updated.

This library uses the Stanford CoreNLP library under the covers for its English POS functionality. The library has the concept of documents, and annotators that work on documents. Once annotated, we can print out all of the discovered words and their annotations:

var doc = new DefaultDocument()
doc.putAttribute('text', 'I eat sushi with chopsticks.')
var ann = new StanfordPosAnnotator()
ann.setProperty('target', 'text')
ann.annotate(doc)
println doc.keywords.collect{ k -> "${k.facet - 'word.'}(${k.str})" }.join(' ')

When run, we see the following output:

PRP(I) VBP(eat) NN(sushi) IN(with) NNS(chopsticks) .(.)

The annotations, also known as tags or facets, for this example are as follows:

PRPPersonal pronoun
VBPPresent tense verb
NNNoun, singular
INPreposition
NNSNoun, plural

The documentation for the libraries we are using give a more complete list of such annotations.

A nice aspect of this library is support for other languages, in particular, Japanese. The code is very similar but uses a different annotator:

doc = new DefaultDocument()
doc.putAttribute('text', '私は学校に行きました。')
ann = new KuromojiAnnotator()
ann.setProperty('target', 'text')
ann.annotate(doc)
println doc.keywords.collect{ k -> "${k.facet}(${k.str})" }.join(' ')

When run, we see the following output:

名詞(私) 助詞(は) 名詞(学校) 助詞(に) 動詞(行き) 助動詞(まし) 助動詞(た) 記号(。)

Before progressing, we'll highlight the result visualization capabilities of the GroovyConsole. This feature lets us write a small Groovy script which converts results to any swing component. In our case we'll convert lists of annotated strings to a JLabel component containing HTML including colored annotation boxes. The details aren't included here but can be found in the repo. We need to copy that file into our ~/.groovy folder and then enable script visualization as shown here:

Screenshot from 2022-08-04 21-57-35.png

Then we should see the following when running the script:

Screenshot from 2022-08-04 21-59-47.png

The visualization is purely optional but adds a nice touch. If using Groovy in notebook environments like Jupyter/BeakerX, there might be visualization tools in those environments too.

Let's look at a larger example using the Smile library.

First, the sentences that we'll examine:

def sentences = [
'Paul has two sisters, Maree and Christine.',
'No wise fish would go anywhere without a porpoise',
'His bark was much worse than his bite',
'Turn on the lights to the main bedroom',
"Light 'em all up",
'Make it dark downstairs'
]

A couple of those sentences might seem a little strange but they are selected to show off quite a few of the different POS tags.

Smile has a tokenizer class which splits a sentence into words. It handles numerous cases like contractions and abbreviations ("e.g.", "'tis", "won't"). Smile also has a POS class based on the hidden Markov model and a built-in model is used for that class. Here is our code using those classes:

def tokenizer = new SimpleTokenizer(true)
sentences.each {
def tokens = Arrays.stream(tokenizer.split(it)).toArray(String[]::new)
def tags = HMMPOSTagger.default.tag(tokens)*.toString()
println tokens.indices.collect{tags[it] == tokens[it] ? tags[it] : "${tags[it]}(${tokens[it]})" }.join(' ')
}

We run the tokenizer for each sentence. Each token is then displayed directly or with its tag if it has one.

Running the script gives this visualization:

Paul
NNP
has
VBZ
two
CD
sisters
NNS
,
Maree
NNP
and
CC
Christine
NNP
.
No
DT
wise
JJ
fish
NN
would
MD
go
VB
anywhere
RB
without
IN
a
DT
porpoise
NN
His
PRP$
bark
NN
was
VBD
much
RB
worse
JJR
than
IN
his
PRP$
bite
NN
Turn
VB
on
IN
the
DT
lights
NNS
to
TO
the
DT
main
JJ
bedroom
NN
Light
NNP
'em
PRP
all
RB
up
RB
Make
VB
it
PRP
dark
JJ
downstairs
NN

[Note: the scripts in the repo just print to stdout which is perfect when using the command-line or IDEs. The visualization in the GoovyConsole kicks in only for the actual result. So, if you are following along at home and wanting to use the GroovyConsole, you'd change the each to collect and remove the println, and you should be good for visualization.]

The OpenNLP code is very similar:

def tokenizer = SimpleTokenizer.INSTANCE
sentences.each {
String[] tokens = tokenizer.tokenize(it)
def posTagger = new POSTaggerME('en')
String[] tags = posTagger.tag(tokens)
println tokens.indices.collect{tags[it] == tokens[it] ? tags[it] : "${tags[it]}(${tokens[it]})" }.join(' ')
}

OpenNLP allows you to supply your own POS model but downloads a default one if none is specified.

When the script is run, it has this visualization:

Paul
PROPN
has
VERB
two
NUM
sisters
NOUN
,
PUNCT
Maree
PROPN
and
CCONJ
Christine
PROPN
.
PUNCT
No
DET
wise
ADJ
fish
NOUN
would
AUX
go
VERB
anywhere
ADV
without
ADP
a
DET
porpoise
NOUN
His
PRON
bark
NOUN
was
AUX
much
ADV
worse
ADJ
than
ADP
his
PRON
bite
NOUN
Turn
VERB
on
ADP
the
DET
lights
NOUN
to
ADP
the
DET
main
ADJ
bedroom
NOUN
Light
NOUN
'
PUNCT
em
NOUN
all
ADV
up
ADP
Make
VERB
it
PRON
dark
ADJ
downstairs
NOUN

The observant reader may have noticed some slight differences in the tags used in this library. They are essentially the same but using slightly different names. This is something to be aware of when swapping between POS libraries or models. Make sure you look up the documentation for the library/model you are using to understand the available tag types.

Entity Detection

Named entity recognition (NER), seeks to identity and classify named entities in text. Categories of interest might be persons, organizations, locations dates, etc. It is another technology used in many fields of NLP.

We'll start with our sentences to analyse:

String[] sentences = [
"A commit by Daniel Sun on December 6, 2020 improved Groovy 4's language integrated query.",
"A commit by Daniel on Sun., December 6, 2020 improved Groovy 4's language integrated query.",
'The Groovy in Action book by Dierk Koenig et. al. is a bargain at $50, or indeed any price.',
'The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.',
'I saw Ms. May Smith waving to June Jones.',
'The parcel was passed from May to June.',
'The Mona Lisa by Leonardo da Vinci has been on display in the Louvre, Paris since 1797.'
]

We'll use some well-known models, we'll focus on the person, money, date, time, and location models:

def base = 'http://opennlp.sourceforge.net/models-1.5'
def modelNames = ['person', 'money', 'date', 'time', 'location']
def finders = modelNames.collect { model ->
new NameFinderME(DownloadUtil.downloadModel(new URL("$base/en-ner-${model}.bin"), TokenNameFinderModel))
}

We'll now tokenize our sentences:

def tokenizer = SimpleTokenizer.INSTANCE
sentences.each { sentence ->
String[] tokens = tokenizer.tokenize(sentence)
Span[] tokenSpans = tokenizer.tokenizePos(sentence)
def entityText = [:]
def entityPos = [:]
finders.indices.each {fi ->
// could be made smarter by looking at probabilities and overlapping spans
Span[] spans = finders[fi].find(tokens)
spans.each{span ->
def se = span.start..<span.end
def pos = (tokenSpans[se.from].start)..<(tokenSpans[se.to].end)
entityPos[span.start] = pos
entityText[span.start] = "$span.type(${sentence[pos]})"
}
}
entityPos.keySet().sort().reverseEach {
def pos = entityPos[it]
def (from, to) = [pos.from, pos.to + 1]
sentence = sentence[0..<from] + entityText[it] + sentence[to..-1]
}
println sentence
}

And when visualized, shows this:

A commit by
Daniel Sun
person
on
December 6, 2020
date
improved Groovy 4's language integrated query.
A commit by
Daniel
person
on Sun.,
December 6, 2020
date
improved Groovy 4's language integrated query.
The Groovy in Action book by
Dierk Koenig
person
et. al. is a bargain at
$50
money
, or indeed any price.
The conference wrapped up
yesterday
date
at
5:30 p.m.
time
in
Copenhagen
location
,
Denmark
location
.
I saw Ms.
May Smith
person
waving to
June Jones
person
.
The parcel was passed from
May to June
date
.
The Mona Lisa by
Leonardo da Vinci
person
has been on display in the Louvre,
Paris
location
since 1797
date
.

We can see here that most examples have been categorized as we might expect. We'd have to improve our model for it to do a better job on the "May to June" example.

Scaling Entity Detection

We can also run our named entity detection algorithms on platforms like Spark NLP which adds NLP functionality to Apache Spark. We'll use glove_100d embeddings and the onto_100 NER model.

var assembler = new DocumentAssembler(inputCol: 'text', outputCol: 'document', cleanupMode: 'disabled')

var tokenizer = new Tokenizer(inputCols: ['document'] as String[], outputCol: 'token')

var embeddings = WordEmbeddingsModel.pretrained('glove_100d').tap {
inputCols = ['document', 'token'] as String[]
outputCol = 'embeddings'
}

var model = NerDLModel.pretrained('onto_100', 'en').tap {
inputCols = ['document', 'token', 'embeddings'] as String[]
outputCol ='ner'
}

var converter = new NerConverter(inputCols: ['document', 'token', 'ner'] as String[], outputCol: 'ner_chunk')

var pipeline = new Pipeline(stages: [assembler, tokenizer, embeddings, model, converter] as PipelineStage[])

var spark = SparkNLP.start(false, false, '16G', '', '', '')

var text = [
"The Mona Lisa is a 16th century oil painting created by Leonardo. It's held at the Louvre in Paris."
]
var data = spark.createDataset(text, Encoders.STRING()).toDF('text')

var pipelineModel = pipeline.fit(data)

var transformed = pipelineModel.transform(data)
transformed.show()

use(SparkCategory) {
transformed.collectAsList().each { row ->
def res = row.text
def chunks = row.ner_chunk.reverseIterator()
while (chunks.hasNext()) {
def chunk = chunks.next()
int begin = chunk.begin
int end = chunk.end
def entity = chunk.metadata.get('entity').get()
res = res[0..<begin] + "$entity($chunk.result)" + res[end<..-1]
}
println res
}
}

We won't go into all of the details here. In summary, the code sets up a pipeline that transforms our input sentences, via a series of steps, into chunks, where each chunk corresponds to a detected entity. Each chunk has a start and ending position, and an associated tag type.

This may not seem like it is much different to our earlier examples, but if we had large volumes of data and we were running in a large cluster, the work could be spread across worker nodes within the cluster.

Here we have used a utility SparkCategory class which makes accessing the information in Spark Row instances a little nicer in terms of Groovy shorthand syntax. We can use row.text instead of row.get(row.fieldIndex('text')). Here is the code for this utility class:

class SparkCategory {
static get(Row r, String field) { r.get(r.fieldIndex(field)) }
}

If doing more than this simple example, the use of SparkCategory could be made implicit through various standard Groovy techniques.

When we run our script, we see the following output:

22/08/07 12:31:39 INFO SparkContext: Running Spark version 3.3.0
...
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
...
onto_100 download started this may take some time.
Approximate size to download 13.5 MB
...
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|          embeddings|                 ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|The Mona Lisa is ...|[{document, 0, 98...|[{token, 0, 2, Th...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 0, 12, T...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
PERSON(The Mona Lisa) is a DATE(16th century) oil painting created by PERSON(Leonardo). It's held at the FAC(Louvre) in GPE(Paris).

The result has the following visualization:

The Mona Lisa
PERSON
is a
16th century
DATE
oil painting created by
Leonardo
PERSON
. It's held at the
Louvre
FAC
in
Paris
GPE
.

Here FAC is facility (buildings, airports, highways, bridges, etc.) and GPE is Geo-Political Entity (countries, cities, states, etc.).

Sentence Detection

Detecting sentences in text might seem a simple concept at first but there are numerous special cases.

Consider the following text:

def text = '''
The most referenced scientific paper of all time is "Protein measurement with the
Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall,
R. J. and was published in the J. BioChem. in 1951. It describes a method for
measuring the amount of protein (even as small as 0.2 γ, were γ is the specific
weight) in solutions and has been cited over 300,000 times and can be found here:
https://www.jbc.org/content/193/1/265.full.pdf. Dr. Lowry completed
two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago
before moving to Harvard under A. Baird Hastings. He was also the H.O.D of
Pharmacology at Washington University in St. Louis for 29 years.
'''

There are full stops at the end of each sentence (though in general, it could also be other punctuation like exclamation marks and question marks). There are also full stops and decimal points in abbreviations, URLs, decimal numbers and so forth. Sentence detection algorithms might have some special hard-coded cases, like "Dr.", "Ms.", or in an emoticon, and may also use some heuristics. In general, they might also be trained with examples like above.

Here is some code for OpenNLP for detecting sentences in the above:

def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5')
def model = new SentenceModel(helper.load('en-sent'))
def detector = new SentenceDetectorME(model)
def sentences = detector.sentDetect(text)
assert text.count('.') == 28
assert sentences.size() == 4
println "Found ${sentences.size()} sentences:\n" + sentences.join('\n\n')

It has the following output:

Downloading en-sent
Found 4 sentences:
The most referenced scientific paper of all time is "Protein measurement with the
Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall,
R. J. and was published in the J. BioChem. in 1951.

It describes a method for
measuring the amount of protein (even as small as 0.2 γ, were γ is the specific
weight) in solutions and has been cited over 300,000 times and can be found here:
https://www.jbc.org/content/193/1/265.full.pdf.

Dr. Lowry completed
two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago
before moving to Harvard under A. Baird Hastings.

He was also the H.O.D of
Pharmacology at Washington University in St. Louis for 29 years.

We can see here, it handled all of the tricky cases in the example.

Relationship Extraction with Triples

The next step after detecting named entities and the various parts of speech of certain words is to explore relationships between them. This is often done in the form of subject-predicate-object triplets. In our earlier NER example, for the sentence "The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.", we found various date, time and location named entities.

We can extract triples using the MinIE library (which in turns uses the Standford CoreNLP library) with the following code:

def parser = CoreNLPUtils.StanfordDepNNParser()
sentences.each { sentence ->
def minie = new MinIE(sentence, parser, MinIE.Mode.SAFE)

println "\nInput sentence: $sentence"
println '============================='
println 'Extractions:'
for (ap in minie.propositions) {
println "\tTriple: $ap.tripleAsString"
def attr = ap.attribution.attributionPhrase ? ap.attribution.toStringCompact() : 'NONE'
println "\tFactuality: $ap.factualityAsString\tAttribution: $attr"
println '\t----------'
}
}

The output for the previously mentioned sentence is shown below:

Input sentence: The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.
=============================
Extractions:
        Triple: "conference"    "wrapped up yesterday at"       "5:30 p.m."
        Factuality: (+,CT)      Attribution: NONE
        ----------
        Triple: "conference"    "wrapped up yesterday in"       "Copenhagen"
        Factuality: (+,CT)      Attribution: NONE
        ----------
        Triple: "conference"    "wrapped up"    "yesterday"
        Factuality: (+,CT)      Attribution: NONE

We can now piece together the relationships between the earlier entities we detected.

There was also a problematic case amongst the earlier NER examples, "The parcel was passed from May to June.". Using the previous model, detected "May to June" as a date. Let's explore that using CoreNLP's triple extraction directly. We won't show the source code here but CoreNLP supports simple and more powerful approaches to solving this problem. The output for the sentence in question using the more powerful technique is:

Sentence #7: The parcel was passed from May to June.
root(ROOT-0, passed-4)
det(parcel-2, The-1)
nsubj:pass(passed-4, parcel-2)
aux:pass(passed-4, was-3)
case(May-6, from-5)
obl:from(passed-4, May-6)
case(June-8, to-7)
obl:to(passed-4, June-8)
punct(passed-4, .-9)

Triples:
1.0	parcel	was	passed
1.0	parcel	was passed to	June
1.0	parcel	was	passed from May to June
1.0	parcel	was passed from	May

We can see that this has done a better job of piecing together what entities we have and their relationships.

Sentiment Analysis

Sentiment analysis is a NLP technique used to determine whether data is positive, negative, or neutral. Standford CoreNLP has default models it uses for this purpose:

def doc = new Document('''
StanfordNLP is fantastic!
Groovy is great fun!
Math can be hard!
''')
for (sent in doc.sentences()) {
println "${sent.toString().padRight(40)} ${sent.sentiment()}"
}

Which has the following output:

[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.6 sec].
[main] INFO edu.stanford.nlp.sentiment.SentimentModel - Loading sentiment model edu/stanford/nlp/models/sentiment/sentiment.ser.gz ... done [0.1 sec].
StanfordNLP is fantastic!                POSITIVE
Groovy is great fun!                     VERY_POSITIVE
Math can be hard!                        NEUTRAL

We can also train our own. Let's start with two datasets:

def datasets = [
positive: getClass().classLoader.getResource("rt-polarity.pos").toURI(),
negative: getClass().classLoader.getResource("rt-polarity.neg").toURI()
]

We'll first use Datumbox which, as we saw earlier, requires training parameters for our algorithm:

def trainingParams = new TextClassifier.TrainingParameters(
numericalScalerTrainingParameters: null,
featureSelectorTrainingParametersList: [new ChisquareSelect.TrainingParameters()],
textExtractorParameters: new NgramsExtractor.Parameters(),
modelerTrainingParameters: new MultinomialNaiveBayes.TrainingParameters()
)

We now create our algorithm, train it with or training dataset, and for illustrative purposes validate against the training dataset:

def config = Configuration.configuration
TextClassifier classifier = MLBuilder.create(trainingParams, config)
classifier.fit(datasets)
def metrics = classifier.validate(datasets)
println "Classifier Accuracy (using training data): $metrics.accuracy"

The output is shown here:

[main] INFO com.datumbox.framework.core.common.dataobjects.Dataframe$Builder - Dataset Parsing positive class
[main] INFO com.datumbox.framework.core.common.dataobjects.Dataframe$Builder - Dataset Parsing negative class
...
Classifier Accuracy (using training data): 0.8275959103273615

Now we can test our model against several sentences:

['Datumbox is divine!', 'Groovy is great fun!', 'Math can be hard!'].each {
def r = classifier.predict(it)
def predicted = r.YPredicted
def probability = sprintf '%4.2f', r.YPredictedProbabilities.get(predicted)
println "Classifing: '$it', Predicted: $predicted, Probability: $probability"
}

Which has this output:

...
[main] INFO com.datumbox.framework.applications.nlp.TextClassifier - predict()
...
Classifing: 'Datumbox is divine!',  Predicted: positive,  Probability: 0.83
Classifing: 'Groovy is great fun!',  Predicted: positive,  Probability: 0.80
Classifing: 'Math can be hard!',  Predicted: negative,  Probability: 0.95

We can do the same thing but with OpenNLP. First, we collect our input data. OpenNLP is expecting it in a single dataset with tagged examples:

def trainingCollection = datasets.collect { k, v ->
new File(v).readLines().collect{"$k $it".toString() }
}.sum()

Now, we'll train two models. One uses naïve bayes, the other maxent. We train up both variants.

def variants = [
Maxent : new TrainingParameters(),
NaiveBayes: new TrainingParameters((CUTOFF_PARAM): '0', (ALGORITHM_PARAM): NAIVE_BAYES_VALUE)
]
def models = [:]
variants.each{ key, trainingParams ->
def trainingStream = new CollectionObjectStream(trainingCollection)
def sampleStream = new DocumentSampleStream(trainingStream)
println "\nTraining using $key"
models[key] = DocumentCategorizerME.train('en', sampleStream, trainingParams, new DoccatFactory())
}

Now we run sentiment predictions on our sample sentences using both variants:

def w = sentences*.size().max()

variants.each { key, params ->
def categorizer = new DocumentCategorizerME(models[key])
println "\nAnalyzing using $key"
sentences.each {
def result = categorizer.categorize(it.split('[ !]'))
def category = categorizer.getBestCategory(result)
def prob = sprintf '%4.2f', result[categorizer.getIndex(category)]
println "${it.padRight(w)} $category ($prob)}"
}
}

When we run this we get:

Training using Maxent ...done.
...

Training using NaiveBayes ...done.
...

Analyzing using Maxent
OpenNLP is fantastic! positive (0.64)}
Groovy is great fun!  positive (0.74)}
Math can be hard!     negative (0.61)}

Analyzing using NaiveBayes
OpenNLP is fantastic! positive (0.72)}
Groovy is great fun!  positive (0.81)}
Math can be hard!     negative (0.72)}

The models here appear to have lower probability levels compared to the model we trained for Datumbox. We could try tweaking the training parameters further if this was a problem. We'd probably also need a bigger testing set to convince ourselves of the relative merits of each model. Some models can be over-trained on small datasets and perform very well with data similar to their training datasets but perform much worse for other data.

Universal Sentence Encoding

This example is inspired from the UniversalSentenceEncoder example in the DJL examples module. It looks at using the universal sentence encoder model from TensorFlow Hub via the DeepJavaLibrary (DJL) api.

First we define a translator. The Translator interface allow us to specify pre and post processing functionality.

class MyTranslator implements NoBatchifyTranslator<String[], double[][]> {
@Override
NDList processInput(TranslatorContext ctx, String[] raw) {
var factory = ctx.NDManager
var inputs = new NDList(raw.collect(factory::create))
new NDList(NDArrays.stack(inputs))
}

@Override
double[][] processOutput(TranslatorContext ctx, NDList list) {
long numOutputs = list.singletonOrThrow().shape.get(0)
NDList result = []
for (i in 0..<numOutputs) {
result << list.singletonOrThrow().get(i)
}
result*.toFloatArray() as double[][]
}
}

Here, we manually pack our input sentences into the required n-dimensional data types, and extract our output calculations into a 2D double array.

Next, we create our predict method by first defining the criteria for our prediction algorithm. We are going to use our translator, use the TensorFlow engine, use a predefined sentence encoder model from the TensorFlow Hub, and indicate that we are creating a text embedding application:

def predict(String[] inputs) {
String modelUrl = "https://storage.googleapis.com/tfhub-modules/google/universal-sentence-encoder/4.tar.gz"

Criteria<String[], double[][]> criteria =
Criteria.builder()
.optApplication(Application.NLP.TEXT_EMBEDDING)
.setTypes(String[], double[][])
.optModelUrls(modelUrl)
.optTranslator(new MyTranslator())
.optEngine("TensorFlow")
.optProgress(new ProgressBar())
.build()
try (var model = criteria.loadModel()
var predictor = model.newPredictor()) {
predictor.predict(inputs)
}
}

Next, let's define our input strings:

String[] inputs = [
"Cycling is low impact and great for cardio",
"Swimming is low impact and good for fitness",
"Palates is good for fitness and flexibility",
"Weights are good for strength and fitness",
"Orchids can be tricky to grow",
"Sunflowers are fun to grow",
"Radishes are easy to grow",
"The taste of radishes grows on you after a while",
]
var k = inputs.size()

Now, we'll use our predictor method to calculate the embeddings for each sentence. We'll print out the embeddings and also calculate the dot product of the embeddings. The dot product (the same as the inner product for this case) reveals how related the sentences are.

var embeddings = predict(inputs)

var z = new double[k][k]
for (i in 0..<k) {
println "Embedding for: ${inputs[i]}\n${embeddings[i]}"
for (j in 0..<k) {
z[i][j] = dot(embeddings[i], embeddings[j])
}
}

Finally, we'll use the Heatmap class from Smile to present a nice display highlighting what the data reveals:

new Heatmap(inputs, inputs, z, Palette.heat(20).reverse()).canvas().with {
title = 'Semantic textual similarity'
setAxisLabels('', '')
window()
}

The output shows us the embeddings:

Loading:     100% |========================================|
2022-08-07 17:10:43.212697: ... This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
...
2022-08-07 17:10:52.589396: ... SavedModel load for tags { serve }; Status: success: OK...
...
Embedding for: Cycling is low impact and great for cardio
[-0.02865048497915268, 0.02069241739809513, 0.010843578726053238, -0.04450441896915436, ...]
...
Embedding for: The taste of radishes grows on you after a while
[0.015841705724596977, -0.03129228577017784, 0.01183396577835083, 0.022753292694687843, ...]

The embeddings are an indication of similarity. Two sentences with similar meaning typically have similar embeddings.

The displayed graphic is shown below:

2022-08-06 22_18_05-Smile Plot 1.png

This graphic shows that our first four sentences are somewhat related, as are the last four sentences, but that there is minimal relationship between those two groups.

More information

Further examples can be found in the related repos:

https://github.com/paulk-asert/groovy-data-science/blob/master/subprojects/LanguageProcessing

https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/LanguageProcessingSparkNLP

https://github.com/paulk-asert/groovy-data-science/tree/master/subprojects/LanguageProcessingDjl

Conclusion

We have look at a range of NLP examples using various NLP libraries. Hopefully you can see some cases where you could use additional NLP technologies in some of your own applications.



Detecting objects with Groovy, the Deep Java Library (DJL), and Apache MXNet

by paulk


Posted on Monday August 01, 2022 at 11:52AM in Technology


This blog posts looks at using Apache Groovy with the Deep Java Library (DJL) and backed by the Apache MXNet engine to detect objects within an image. (Apache MXNet is an incubating project at the ASF.)

Deep Learning

Deep learning falls under the branches of machine learning and artificial intelligence. It involves multiple layers (hence the "deep") of an artificial neural network. There are lots of ways to configure such networks and the details are beyond the scope of this blog post, but we can give some basic details. We will have four input nodes corresponding to the measurements of our four characteristics. We will have three output nodes corresponding to each possible class (species). We will also have one or more additional layers in between.

deep_network.png

Each node in this network mimics to some degree a neuron in the human brain. Again, we'll simplify the details. Each node has multiple inputs, which are given a particular weight, as well as an activation function which will determine whether our node "fires". Training the model is a process which works out what the best weights should be.

deep_node.png

Deep Java Library (DJL) & Apache MXNet

Rather than writing your own neural networks, libraries such as DJL provide high-level abstractions which automate to some degree the creation of the necessary neural network layers. DJL is engine agnostic, so it's capable of supporting different backends including Apache MXNet, PyTorch, TensorFlow and ONNX Runtime. We'll use the default engine which for our application (at the time of writing) is Apache MXNet.

Apache MXNet provides the underlying engine. It has support for imperative and symbolic execution, distributed training of your models using multi-gpu or multi-host hardware, and multiple language bindings. Groovy is fully compatible with the Java binding.

Using DJL with Groovy

Groovy uses the Java binding. Consider looking at the DJL beginner tutorials for Java - they will work almost unchanged for Groovy.

For our example, the first thing we need to do is download the image we want to run the object detection model on:

Path tempDir = Files.createTempDirectory("resnetssd")
def imageName = 'dog-ssd.jpg'
Path localImage = tempDir.resolve(imageName)
def url = new URL("https://s3.amazonaws.com/model-server/inputs/$imageName")
DownloadUtils.download(url, localImage, new ProgressBar())
Image img = ImageFactory.instance.fromFile(localImage)

It happens to be a well-known already available image. We'll store a local copy of the image in a temporary directory and we'll use a utility class that comes with DJL to provide a nice progress bar while the image is downloading. DJL provides it's own image classes, so we'll create an instance using the appropriate class from the downloaded image.

Next we want to configure our neural network layers:

def criteria = Criteria.builder()
.optApplication(Application.CV.OBJECT_DETECTION)
.setTypes(Image, DetectedObjects)
.optFilter("backbone", "resnet50")
.optEngine(Engine.defaultEngineName)
.optProgress(new ProgressBar())
.build()

DLJ supports numerous model applications including image classification, word recognition, sentiment analysis, linear regression, and others. We'll select object detection. This kind of application looks for the bounding box of known objects within an image. The types configuration option identifies that our input will be an image and the output will be detected objects. The filter option indicates that we will be using ResNet-50 (a 50-layers deep convolutional neural network often used as a backbone for many computer vision tasks). We set the engine to be the default engine which happens to be Apache MXNet. We also configure an optional progress bar to provide feedback of progress while our model is running.

Now that we have our configuration sorted, we'll use it to load a model and then use the model to make object predictions:

def detection = criteria.loadModel().withCloseable { model ->
model.newPredictor().predict(img)
}
detection.items().each { println it }
img.drawBoundingBoxes(detection)

For good measure, we'll draw the bounding boxes into our image.

Next, we save our image into a file and display it using Groovy's SwingBuilder.

Path imageSaved = tempDir.resolve('detected.png')
imageSaved.withOutputStream { os -> img.save(os, 'png') }
def saved = ImageIO.read(imageSaved.toFile())
new SwingBuilder().edt {
frame(title: "$detection.numberOfObjects detected objects",
size: [saved.width, saved.height],
defaultCloseOperation: DISPOSE_ON_CLOSE,
show: true) { label(icon: imageIcon(image: saved)) }
}

Building and running our application

Our code is stored on a source file called ObjectDetect.groovy.

We used Gradle for our build file:

apply plugin: 'groovy'
apply plugin: 'application'

repositories {
mavenCentral()
}

application {
mainClass = 'ObjectDetect'
}

dependencies {
implementation "ai.djl:api:0.18.0"
implementation "org.apache.groovy:groovy:4.0.4"
implementation "org.apache.groovy:groovy-swing:4.0.4"
runtimeOnly "ai.djl:model-zoo:0.18.0"
runtimeOnly "ai.djl.mxnet:mxnet-engine:0.18.0"
runtimeOnly "ai.djl.mxnet:mxnet-model-zoo:0.18.0"
runtimeOnly "ai.djl.mxnet:mxnet-native-auto:1.8.0"
runtimeOnly "org.apache.groovy:groovy-nio:4.0.4"
runtimeOnly "org.slf4j:slf4j-jdk14:1.7.36"
}

We run the application with the gradle run task:

paulk@pop-os:/extra/projects/groovy-data-science$ ./gradlew DLMXNet:run
> Task :DeepLearningMxnet:run
Downloading: 100% |████████████████████████████████████████| dog-ssd.jpg
Loading:     100% |████████████████████████████████████████|
...
class: "car", probability: 0.99991, bounds: [x=0.611, y=0.137, width=0.293, height=0.160]
class: "bicycle", probability: 0.95385, bounds: [x=0.162, y=0.207, width=0.594, height=0.588]
class: "dog", probability: 0.93752, bounds: [x=0.168, y=0.350, width=0.274, height=0.593]

The displayed image looks like this:
2022-08-01 21_28_33-3 detected objects.png

Further Information

The full source code can be found in the following repo:
https://github.com/paulk-asert/groovy-data-science/subprojects/DeepLearningMxnet

Conclusion

We have examined using Apache Groovy, DLJ and Apache MXNet to detect objects within an image. We've used a model based on a rich deep learning model but we didn't need to get into the details of the model or its neural network layers. DLJ and Apache MXNet did the hard lifting for us. Groovy provided a simple coding experience for building our application.


Working with SQL databases with Groovy and GraalVM

by paulk


Posted on Friday July 29, 2022 at 02:07PM in Technology


During the week, there was an interesting video and blog post on the latest GraalVM 22.2 Release. The release has numerous new features and improvements including:

  • smaller native executables
  • the ability to generate heap dumps in native executables
  • experimental native image debugging within IntelliJ IDEA
  • the ability to embed a Software Bill of Materials (SBOM) into the executable for improved security (when using GraalVM Enterprise)
  • native metadata integration.

This blog looks at the last of these. We'll use the running example of the H2 database which the video discusses.

Native Metadata

For anyone who has used GraalVM, they will know that frequently certain information must be given to the native compiler. Certain classes can be initialized at build time, others should be initialized at runtime. If accessing certain kinds of resources, knowledge of those resources must be given to the compiler. Parts of the application which might be invoked through reflection or involve serialization, might not be deemed reachable and won't automatically be included by the compiler.

Each library that is being used within an application will have its own set of classes and resources which will commonly need to dealt with by anyone using that library. The Native Metadata repository keeps a shared copy of this information on a per library basis. Once someone has populated the metadata, other projects using the same library can get that information automatically. We'll look more at metadata integration shortly, but first, let's look at our database application.

Working with SQL in Groovy

The application creates and then populates a customer database with four customers. It then prints them out:

import groovy.sql.Sql
import groovy.transform.CompileStatic

@CompileStatic
class H2Demo {
static void main(args) {
Sql.withInstance('jdbc:h2:./data/test') { sql ->
sql.execute 'DROP TABLE IF EXISTS customers'
sql.execute 'CREATE TABLE customers(id INTEGER AUTO_INCREMENT, name VARCHAR)'
for (cust in ['Lord Archimonde', 'Arthur', 'Gilbert', 'Grug']) {
sql.executeInsert "INSERT INTO customers(name) VALUES $cust"
}
println sql.rows('SELECT * FROM customers').join('\n')
}
}
}

Groovy's Sql class makes this relatively easy. The withInstance method will create a database connection and close it down when finished with. The executeInsert method is using a Groovy interpolated String (GString) which creates a prepared statement under the covers.

Configuring our native build

Here is our build file:

plugins {
id 'application'
id 'groovy'
id 'org.graalvm.buildtools.native'
}

application {
mainClass = 'H2Demo'
}

repositories {
mavenCentral()
}

dependencies {
implementation 'com.h2database:h2:2.1.210'
implementation 'org.apache.groovy:groovy:4.0.4'
implementation 'org.apache.groovy:groovy-sql:4.0.4'
}

graalvmNative {
agent {
defaultMode = 'standard'
}
metadataRepository {
enabled = true
}
binaries {
main {
buildArgs.addAll(
// '-H:IncludeSBOM=cyclonedx',
'--report-unsupported-elements-at-runtime',
'--initialize-at-run-time=groovy.grape.GrapeIvy,org.h2.store.fs.niomem.FileNioMemData',
'--initialize-at-build-time',
'--no-fallback',
)
}
}
}

We make use of the graalvm native build plugin. We define our dependencies of Groovy and H2. We can also supply any needed parameters to the native compiler. Importantly, we enable integration with the metadata repository.

When we run the build, it will automatically create the native app for us:

paulk@pop-os:/extra/projects/groovy-graalvm-h2$ ./gradlew clean nativeRun
...
> Task :nativeCompile
[native-image-plugin] Using executable path: /extra/devtools/graalvm-ce-java17-22.2.0/bin/native-image
========================================================================================================================
GraalVM Native Image: Generating 'H2Demo' (executable)...
========================================================================================================================
...
[1/7] Initializing...                                                                                    (5.3s @ 0.26GB)
 Version info: 'GraalVM 22.2.0 Java 17 CE'
 Java version info: '17.0.4+8-jvmci-22.2-b06'
 C compiler: gcc (linux, x86_64, 11.2.0)
 Garbage collector: Serial GC
 1 user-specific feature(s)
 - com.oracle.svm.polyglot.groovy.GroovyIndyInterfaceFeature
[2/7] Performing analysis...  [************]                                                            (51.7s @ 1.82GB)
  10,597 (90.60%) of 11,697 classes reachable
  17,002 (64.13%) of 26,510 fields reachable
  58,165 (63.45%) of 91,666 methods reachable
     393 classes,   100 fields, and 2,057 methods registered for reflection
      65 classes,    74 fields, and    55 methods registered for JNI access
       4 native libraries: dl, pthread, rt, z
[3/7] Building universe...                                                                               (8.0s @ 4.02GB)
[4/7] Parsing methods...      [**]                                                                       (4.8s @ 3.85GB)
[5/7] Inlining methods...     [***]                                                                      (3.0s @ 1.72GB)
[6/7] Compiling methods...    [******]                                                                  (38.0s @ 3.63GB)
[7/7] Creating image...                                                                                  (5.9s @ 1.70GB)
  26.65MB (46.64%) for code area:    38,890 compilation units
  28.04MB (49.05%) for image heap:  359,812 objects and 66 resources
   2.46MB ( 4.31%) for other data
  57.15MB in total
------------------------------------------------------------------------------------------------------------------------
Top 10 packages in code area:                               Top 10 object types in image heap:
   1.48MB sun.security.ssl                                     5.85MB byte[] for code metadata
   1.06MB java.util                                            2.82MB java.lang.String
 979.43KB java.lang.invoke                                     2.78MB java.lang.Class
 758.29KB org.apache.groovy.parser.antlr4                      2.47MB byte[] for general heap data
 723.92KB com.sun.crypto.provider                              2.04MB byte[] for java.lang.String
 588.57KB org.h2.table                                       910.68KB com.oracle.svm.core.hub.DynamicHubCompanion
 582.06KB org.h2.command                                     764.95KB java.util.HashMap$Node
 494.23KB org.codehaus.groovy.classgen                       761.53KB java.lang.Object[]
 476.03KB c.s.org.apache.xerces.internal.impl.xs.traversers  715.65KB byte[] for embedded resources
 468.69KB java.lang                                          584.75KB java.util.HashMap$Node[]
  18.87MB for 370 more packages                                8.28MB for 2535 more object types
------------------------------------------------------------------------------------------------------------------------
                        3.9s (3.2% of total time) in 30 GCs | Peak RSS: 6.22GB | CPU load: 6.48
------------------------------------------------------------------------------------------------------------------------
Produced artifacts:
 /extra/projects/groovy-graalvm-h2/build/native/nativeCompile/H2Demo (executable)
 /extra/projects/groovy-graalvm-h2/build/native/nativeCompile/H2Demo.build_artifacts.txt (txt)
========================================================================================================================
Finished generating 'H2Demo' in 2m 1s.
    [native-image-plugin] Native Image written to: /extra/projects/groovy-graalvm-h2/build/native/nativeCompile

> Task :nativeRun
[ID:1, NAME:Lord Archimonde]
[ID:2, NAME:Arthur]
[ID:3, NAME:Gilbert]
[ID:4, NAME:Grug]

Checking the native image speed

We can also check the speed once the native image is built:

paulk@pop-os:/extra/projects/groovy-graalvm-h2$ time build/native/nativeCompile/H2Demo
[ID:1, NAME:Lord Archimonde]
[ID:2, NAME:Arthur]
[ID:3, NAME:Gilbert]
[ID:4, NAME:Grug]

real	0m0.027s
user	0m0.010s
sys	0m0.011s

More information

Check out the full source code from the repo: https://github.com/paulk-asert/groovy-graalvm-h2.

Conclusion

We have looked at a simple H2 database application and the steps involved in creating a native application with Groovy and GraalVM.


Reading and Writing CSV files with Groovy

by paulk


Posted on Monday July 25, 2022 at 02:26PM in Technology


In this post, we'll look at reading and writing CSV files using Groovy.

Aren't CSV files just text files?

For simple cases, we can treat CSV files no differently than we would other text files. Suppose we have the following data that we would like to write to a CSV file:

def data = [
['place', 'firstname', 'lastname', 'team'],
['1', 'Lorena', 'Wiebes', 'Team DSM'],
['2', 'Marianne', 'Vos', 'Team Jumbo Visma'],
['3', 'Lotte', 'Kopecky', 'Team SD Worx']
]

Groovy uses File or Path objects similar to Java. We'll use a File object here and, for our purposes, we'll just use a temporary file since we are just going to read it back in and check it against our data. Here is how to create a temporary file:

def file = File.createTempFile('FemmesStage1Podium', '.csv')

Writing our CSV (in this simple example) is as simple as joining the data with commas and the lines with line separator character(s):

file.text = data*.join(',').join(System.lineSeparator())

Here we "wrote" the entire file contents in one go but there are options for writing a line or character or byte at a time.

Reading the data in is just as simple. We read the lines and split on commas:

assert file.readLines()*.split(',') == data

In general, we might want to further process the data. Groovy provides nice options for this too. Suppose we have the following existing CSV file:
HommesOverall.png
We can read in the file and select various columns of interest with code like below:

def file = new File('HommesStageWinners.csv')
def rows = file.readLines().tail()*.split(',')
int total = rows.size()
Set names = rows.collect { it[1] + ' ' + it[2] }
Set teams = rows*.getAt(3)
Set countries = rows*.getAt(4)
String result = "Across $total stages, ${names.size()} riders from " +
"${teams.size()} teams and ${countries.size()} countries won stages."
assert result == 'Across 21 stages, 15 riders from 10 teams and 9 countries won stages.'

Here, the tail() method skips over the header line. Column 0 contains the stage number which we ignore. Column 1 contains the first name, column 2 the last name, column 3 the team, and column 4 the country of the rider. We store away the full names, teams and countries in sets to remove duplicates. We then create an overall result message using the size of those sets.

While for this simple example, the coding was fairly simple, it isn't recommended to hand process CSV files in this fashion. The details for CSV can quickly get messy. What if the values themselves contain commas or newlines? Perhaps we can surround in double quotes but then what if the value contains a double quote? And so forth. For this reason, CSV libraries are recommended.

We'll look at three shortly, but first let's summarise some of the highlights of the tour by looking at multiple winners. Here is some code which summarises our CSV data:

def byValueDesc = { -it.value }
def bySize = { k, v -> [k, v.size()] }
def isMultiple = { it.value > 1 }
def multipleWins = { Closure select -> rows
.groupBy(select)
.collectEntries(bySize)
.findAll(isMultiple)
.sort(byValueDesc)
.entrySet()
.join(', ')
}
println 'Multiple wins by country:\n' + multipleWins{ it[4] }
println 'Multiple wins by rider:\n' + multipleWins{ it[1] + ' ' + it[2] }
println 'Multiple wins by team:\n' + multipleWins{ it[3] }

This summary has nothing in particular to do with CSV files but is summarised in honour of the great riding during the tour! Here's the output:

MultipleWins.png

Okay, now let's look at our three CSV libraries.

Commons CSV

The Apache Commons CSV library makes writing and parsing CSV files easier. Here is the code for writing our CSV which makes use of the CSVPrinter class:

file.withWriter { w ->
new CSVPrinter(w, CSVFormat.DEFAULT).printRecords(data)
}

And here is the code for reading it back in which uses the RFC4180 parser factory singleton:

file.withReader { r ->
assert RFC4180.parse(r).records*.toList() == data
}

There are other singleton factories for tab-separated values and other common formats and builders to let you set a whole variety of options like escape characters, quote options, whether to use an enum to define header names, and whether to ignore empty lines or nulls.

For our more elaborate example, we have a tiny bit more work to do. We'll use the builder to tell the parser to skip the header row. We could have chosen to use the tail() trick we used earlier but we decided to use the parser features instead. The code would look like this:

file.withReader { r ->
def rows = RFC4180.builder()
.setHeader()
.setSkipHeaderRecord(true)
.build()
.parse(r)
.records
assert rows.size() == 21
assert rows.collect { it.firstname + ' ' + it.lastname }.toSet().size() == 15
assert rows*.team.toSet().size() == 10
assert rows*.country.toSet().size() == 9
}

You can see here that we have used column names rather than column numbers during our processing. Using column names is another advantage of using the CSV library; it would be quite a lot of work to do that aspect by hand. Also note that, for simplicity, we didn't create the entire result message as in the earlier example. Instead, we just checked the size of all of the relevant sets that we calculated previously.

OpenCSV

The OpenCSV library handles the messy CSV details when needed but doesn't get in the way for simple cases. For our first example, the CSVReader and CSVWriter classes will be suitable. Here is the code for writing our CSV file in the same way as earlier:

file.withWriter { w ->
new CSVWriter(w).writeAll(data.collect{ it as String[] })
}

And here is the code for reading data:

file.withReader { r ->
assert new CSVReader(r).readAll() == data
}

If we look at the produced file, it is already a little fancier than earlier with double quotes around all data:

FemmesPodiumStage1.png

If we want to do more elaborate processing, the CSVReaderHeaderAware class is aware of the initial header row and its column names. Here is our more elaborate example which processed some of the data further:

file.withReader { r ->
def rows = []
def reader = new CSVReaderHeaderAware(r)
while ((next = reader.readMap())) rows << next
assert rows.size() == 21
assert rows.collect { it.firstname + ' ' + it.lastname }.toSet().size() == 15
assert rows*.team.toSet().size() == 10
assert rows*.country.toSet().size() == 9
}

You can see here that we have again used column names rather than column numbers during our processing. For simplicity, we followed the same style as in the Commons CSV example and just checked the size of all of the relevant sets that we calculated previously.

OpenCSV also supports transforming CSV files into JavaBean instances. First, we define our target class (or annotate an existing domain class):

class Cyclist {
@CsvBindByName(column = 'firstname')
String first
@CsvBindByName(column = 'lastname')
String last
@CsvBindByName
String team
@CsvBindByName
String country
}

For two of the columns, we've indicated that the column name in the CSV file doesn't match our class property. The annotation attribute caters for that scenario.

Then, we can use this code to convert our CSV file into a list of domain objects:

file.withReader { r ->
List<Cyclist> rows = new CsvToBeanBuilder(r).withType(Cyclist).build().parse()
assert rows.size() == 21
assert rows.collect { it.first + ' ' + it.last }.toSet().size() == 15
assert rows*.team.toSet().size() == 10
assert rows*.country.toSet().size() == 9
}

OpenCSV has many options we didn't show. When writing files you can specify the separator and quote characters, when reading CSV you can specify column positions, types, and validate data.

Jackson Databind CSV

The Jackson Databind library supports the CSV format (as well as many others).

Writing CSV files from existing data is simple as shown here for running example:

file.withWriter { w ->
new CsvMapper().writeValue(w, data)
}

This writes the data into our temporary file as we saw with previous examples. One minor difference is that by default, just the values containing spaces will be double quoted but like the other libraries, there are many configuration options to tweak such settings.

Reading the data can be achieved using the following code:

def mapper = new CsvMapper().readerForListOf(String).with(CsvParser.Feature.WRAP_AS_ARRAY)
file.withReader { r ->
assert mapper.readValues(r).readAll() == data
}

Our more elaborate example is done in a similar way:

def schema = CsvSchema.emptySchema().withHeader()
def mapper = new CsvMapper().readerForMapOf(String).with(schema)
file.withReader { r ->
def rows = mapper.readValues(r).readAll()
assert rows.size() == 21
assert rows.collect { it.firstname + ' ' + it.lastname }.toSet().size() == 15
assert rows*.team.toSet().size() == 10
assert rows*.country.toSet().size() == 9
}

Here, we tell the library to make use of our header row and store each row of data in a map.

Jackson Databind also supports writing to classes including JavaBeans as well as records. Let's create a record to hold our cyclist information:

@JsonCreator
record Cyclist(
@JsonProperty('stage') int stage,
@JsonProperty('firstname') String first,
@JsonProperty('lastname') String last,
@JsonProperty('team') String team,
@JsonProperty('country') String country) {
String full() { "$first $last" }
}

Note that again we can indicate where our record component names may not match the names used in the CSV file, we simply supply the alternate name when specifying the property. There are other options like indicating that a field is required or giving its column position but we don't need those options for our example. We've also added a full() helper method to return the full name of the cyclist.

Groovy will use native records on platforms that support it (JDK16+) or emulated records on earlier platforms.

Now we can write our code for record deserialization:

def schema = CsvSchema.emptySchema().withHeader()
def mapper = new CsvMapper().readerFor(Cyclist).with(schema)
file.withReader { r ->
List<Cyclist> records = mapper.readValues(r).readAll()
assert records.size() == 21
assert records*.full().toSet().size() == 15
assert records*.team.toSet().size() == 10
assert records*.country.toSet().size() == 9
}

Conclusion

We have looked at writing and reading CSV files to Strings and domain classes and records. We had a look at handling simple cases by hand and also looked at the OpenCSV, Commons CSV and Jackson Databind CSV libraries.

Code for these examples:
https://github.com/paulk-asert/CsvGroovy

Code for other examples of using Groovy for Data Science:
https://github.com/paulk-asert/groovy-data-science