samadhiweb

smalltalk programming for the web



SQLite Boolean

20 May 2018

From its documentation, SQLite does not have a separate Boolean storage class. Instead, Boolean values are stored as integers 0 (false) and 1 (true).

I've added Boolean handling to UDBC-SQLite. When writing to an SQLite database from Pharo, true is written as integer 1 and false as integer 0. SQLite uses dynamic typing, and any column in an SQLite database, except an INTEGER PRIMARY KEY column, may be used to store a value of any type, irrespective of the column's type declaration. As such, when writing Boolean values to a database, UDBC-SQLite does not check the database's type declaration.

When reading an SQLite database, UDBC-SQLite does check a column's type declaration: If a column is Boolean, UDBC-SQLite reads 1 as true, 0 as false, NULL as nil, and any other integer values raises an exception. I've encountered real world data where the string "t" means true and "f" means false for a Boolean column, so UDBC-SQLite handles these cases too.

Glorp has been similarly updated. Loading GlorpSQLite, from my development fork for now, installs both UDBC-SQLite and Glorp:

Metacello new
  repository: 'github://PierceNg/glorp-sqlite3';
  baseline: 'GlorpSQLite';
  load.

All Glorp unit tests should pass. Tested on Linux using fresh 32- and 64-bit 60540 images.

OpenSSL RIPEMD160

18 March 2018

I've just added RIPEMD160 to the EVP interface in OpenSSL-Pharo. This post serves as a HOWTO.

OpenSSL's C interface defines RIPEMD160 thusly:

const EVP_MD *EVP_ripemd160(void);

Create LcLibCrypto>>apiEvpRIPEMD160 for it:

apiEvpRIPEMD160
  ^ self ffiCall: #(EVP_MD* EVP_ripemd160 ())
    module: self library

Next, create LcEvpRIPEMD160 as a subclass of LcEvpMessageDigest:

LcEvpMessageDigest subclass: #LcEvpRIPEMD160
  instanceVariableNames: ''
  classVariableNames: ''
  package: 'OpenSSL-EVP'

LcEvpRIPEMD160>>initialize
  super initialize.
  handle := LcLibCrypto current apiEvpRIPEMD160.
  self errorIfNull: handle

Add class-side accessors:

LcEvpRIPEMD160 class>>blocksize
  ^ 64

LcEvpRIPEMD160 class>>hashsize
  ^ 20

And that's it! Using the test vectors from the RIPEMD160 home page and RFC 2286, the unit tests verify that we can now use RIPEMD160 for hashing and HMAC from within Pharo:

LcEvpRIPEMD160Test>>testDigest1
  | msg result |

  msg := ''.
  result := ByteArray readHexFrom: '9c1185a5c5e9fc54612808977ee8f548b2258d31' readStream.
  self assert: (md hashMessage: msg) equals: result

LcEvpRIPEMD160Test>>testHMAC1
  | msg result expectedResult |

  msg := 'Hi There'.
  key := ByteArray readHexFrom: '0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b' readStream.
  expectedResult := ByteArray readHexFrom: '24cb4bd67d20fc1a5d2ed7732dcc39377f0a5668' readStream.
  result := (HMAC on: LcEvpRIPEMD160)
    key: key;
    digestMessage: msg asByteArray.
  self assert: result equals: expectedResult

SQLite Loadable Extensions

4 March 2018

SQLite has the ability to load extensions at run-time. I've now implemented this functionality in UDBC-SQLite.

An SQLite extension is built as a .so/dylib/dll shared library file. Let's use SQLite's rot13 extension as our example. The source file rot13.c is located in the SQLite source code's ext/misc directory. To build the rot13 extension, also download the amalgamation. Unzip the amalgamation and copy rot13.c into its directory. Build the extension:

% gcc -fPIC -shared -o rot13.so -I. rot13.c

Verify that the extension works:

% sqlite3
SQLite version 3.22.0 2018-01-22 18:45:57
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> select rot13('qwerty');
Error: no such function: rot13
sqlite> .load "./rot13.so"
sqlite> select rot13('qwerty');
djregl
sqlite> select rot13('djregl');
qwerty
sqlite> 

For use with Pharo, copy rot13.so into the Pharo VM directory where all the other .so files are.

Next steps are done in Pharo. For the purpose of this blog post, I downloaded a fresh Pharo 60536 64-bit image. Start the image and install GlorpSQLite from the Catalog browser, which installs the latest UDBC-SQLite. (This also installs Glorp, of course. If you run Glorp's unit tests from Test Runner you should get all 890 tests passed.)

In a playground, run this snippet and Transcript should show, first, the text "no such function: rot13" and then the rot13 outputs.

| db rs r1 r2 |
Transcript clear.
db := UDBCSQLite3Connection openOn: ':memory:'.
[   [ db execute: 'select rot13(''qwerty'')' ]
      on: UDBCSQLite3Error
      do: [ ex: | Transcript show: ex messageText; cr ].
    db enableExtensions;          "<== New stuff."
      loadExtension: 'rot13.so'.  "<== New stuff."
    rs := db execute: 'select rot13(''qwerty'') as r1'.
    r1 := rs next at: 'r1'.
    Transcript show: r1; cr. 
    rs := db execute: 'select rot13(?) as r2' with: (Array with: r1).
    r2 := rs next at: 'r2'.
    Transcript show: r2; cr.
] ensure: [ db close ]

Note the messages #enableExtensions and #loadExtension:. For security reasons, extension loading is disabled by default.

Tested on Linux latest Pharo 64-bit 60536.

Hello, digits!

4 February 2018

The Hello World intro to machine learning is usually by way of the Iris flower image classification or the MNIST handwritten digit recognition. In this post, I describe training a neural network in Pharo to perform handwritten digit recognition. Instead of the MNIST dataset, I'll use the smaller UCI dataset. According to the description, this dataset consists of two files:

  • optdigits.tra, 3823 records
  • optdigits.tes, 1797 records
  • Each record consists of 64 inputs and 1 class attributes.
  • Input attributes are integers in the range 0..16.
  • The class attribute is the class code 0..9, i.e., it denotes the digit that the 64 input attributes encode.

The files are in CSV format. Let's use the excellent NeoCSV package to read the data:

| rb training testing |

rb := [ :fn |
  | r |
  r := NeoCSVReader on: fn asFileReference readStream.
  1 to: 64 do: [ :i |
    r addFieldConverter: [ :string | string asInteger / 16.0 ]].
  "The converter normalizes the 64 input integers into floats between 0 and 1."
  r addIntegerField.
  r upToEnd ].

training := rb value: '/tmp/optdigits.tra'.
testing := rb value: '/tmp/optdigits.tes'.

Next, install MLNeuralNetwork by Oleksandr Zaytsev:

Metacello new 
  repository: 'http://smalltalkhub.com/mc/Oleks/NeuralNetwork/main';
  configuration: 'MLNeuralNetwork';
  version: #development;
  load.

MLNeuralNetwork operates on MLDataset instances, so modify the CSV reader accordingly:

| ohv rb dsprep training testing |

"Create and cache the 10 one-hot vectors."
ohv := IdentityDictionary new.
0 to: 9 do: [ :i |
  ohv at: i put: (MLMnistReader onehot: i) ].

"No change."
rb := [ :fn |
  | r |
  r := NeoCSVReader on: fn asFileReference readStream.
  1 to: 64 do: [ :i |
    r addFieldConverter: [ :string | string asInteger / 16.0 ]].
  "The converter normalizes the 64 input integers into floats between 0 and 1."
  r addIntegerField.
  r upToEnd ].

"Turn the output of 'rb' into a MLDataset."
dsprep := [ :aa |
  MLDataset new 
    input: (aa collect: [ :ea | ea allButLast asPMVector ]) 
    output: (aa collect: [ :ea | ohv at: ea last ]) ].

training := dsprep value: (rb value: '/tmp/optdigits.tra').
testing := dsprep value: (rb value: '/tmp/optdigits.tes').

Note MLMnistReader>>onehot: which creates a 'one-hot' vector for each digit. One-hot vectors make machine learning more effective. They are easy to understand "pictorially":

  • For the digit 0, [1 0 0 0 0 0 0 0 0 0].
  • For the digit 1, [0 1 0 0 0 0 0 0 0 0].
  • For the digit 3, [0 0 0 1 0 0 0 0 0 0].
  • ...
  • For the digit 9, [0 0 0 0 0 0 0 0 0 1].

Since there are over 5,000 records, we precompute the one-hot vectors and reuse them, instead of creating one vector per record.

Now create a 3-layer neural network of 64 input, 96 hidden, and 10 output neurons, set it to learn from the training data for 500 epochs, and test it:

| net result | 
net := MLNeuralNetwork new initialize: #(64 96 10).
net costFunction: (MLSoftmaxCrossEntropy  new).
net outputLayer activationFunction: (MLSoftmax new).
net learningRate: 0.5.
net learn: training epochs: 500.

result := Array new: testing size.
testing input doWithIndex: [ :ea :i |
  | ret |
  ret := net value: ea.
  result at: i
    put: (Array 
            "'i' identifies the record in the test set."
            with: i
            "The neuron that is 'most activated' serves as the prediction."
            with: (ret indexOf: (ret max: [ :x | x ])) - 1 
            "The actual digit from the test set, decoded from its one-hot vector."
            with: ((testing output at: i) findFirst: [ :x | x = 1 ]) - 1) ].
result inspect.

From the inspector, we can see that the network got records 3, 6 and 20 wrong:

In the inspector's code pane, the following snippet shows that the network's accuracy is about 92%.

1 - (self select: [ :x | (x second = x third) not ]) size / 1797.0)

Not bad for a first attempt. However, the data set's source states that the K-Nearest Neighbours algorithm achieved up to 98% accuracy on the testing set, so there's plenty of room for improvement for the network.

Here's a screenshot showing some of the 8x8 digits with their predicted and actual values. I don't know about you, but the top right "digit" looks more like a smudge to me than any number.

Code to generate the digit images follows:

| mb | 
mb := [ :array :row :predicted :actual |
  | b lb |
  b := RTMondrian new.
  b shape box
    size: 1;
    color: Color gray.
  b nodes: array.
  b normalizer normalizeColorAsGray: [ :x | x * 16 ].
  b layout gridWithPerRow: 8;
    gapSize: 0.
  b build.
  lb := RTLegendBuilder new.
  lb view: b view.
  lb textSize: 9;
    addText: 'Row ', row asString, 
      ': Predicted = ', predicted asString,
      '. Actual = ', actual asString,
      '.'.
  lb build.
  b ].

rand := Random new.
6 timesRepeat: [
  | x |
  x := rand nextInt: 1797.
  (mb value: (testing input at: x)
    value: x
    value: (result at: x) second
    value: (result at: x) third)
  inspect ]

Let's also look at the failed predictions:

| failed |
failed := (result select: [ :x | (x second = x third) not ])
  collect: [ :x | x first ].
6 timesRepeat: [
  | x |
  x := rand nextInt: 1797.
  [ failed includes: x ] whileFalse: [ x := rand nextInt: 1797 ].
  (mb value: (testing input at: x)
    value: x
    value: (result at: x) second
    value: (result at: x) third)
  inspect ]

Putting the code altogether:

| ohv rb dsprep training testing net result mb rand failed |

ohv := IdentityDictionary new.
0 to: 9 do: [ :i |
  ohv at: i put: (MLMnistReader onehot: i) ].

rb := [ :fn |
  | r |
  r := NeoCSVReader on: fn asFileReference readStream.
  1 to: 64 do: [ :i |
    r addFieldConverter: [ :string | string asInteger / 16.0 ]].
  r addIntegerField.
  r upToEnd ].

dsprep := [ :aa |
  MLDataset new 
    input: (aa collect: [ :ea | ea allButLast asPMVector ]) 
    output: (aa collect: [ :ea | ohv at: ea last ]) ].

training := dsprep value: (rb value: '/tmp/optdigits.tra').
testing := dsprep value: (rb value: '/tmp/optdigits.tes').

net := MLNeuralNetwork new initialize: #(64 96 10).
net costFunction: (MLSoftmaxCrossEntropy  new).
net outputLayer activationFunction: (MLSoftmax new).
net learningRate: 0.5.
net learn: training epochs: 500.

result := Array new: testing size.
testing input doWithIndex: [ :ea :i |
  | ret |
  ret := net value: ea.
  result at: i 
    put: (Array 
            "'i' identifies the record in the test set."
            with: i
            "The neuron that is 'most activated' serves as the prediction."
            with: (ret indexOf: (ret max: [ :x | x ])) - 1 
            "The actual digit from the test set, decoded from its one-hot vector."
            with: ((testing output at: i) findFirst: [ :x | x = 1 ]) - 1) ].
result inspect.

mb := [ :array :row :predicted :actual |
  | b lb |
  b := RTMondrian new.
  b shape box
    size: 1;
    color: Color gray.
  b nodes: array.
  b normalizer normalizeColorAsGray: [ :x | x * 16 ].
  b layout gridWithPerRow: 8;
    gapSize: 0.
  b build.
  lb := RTLegendBuilder new.
  lb view: b view.
  lb textSize: 8;
    addText: 'Row ', row asString, 
      ': Predicted = ', predicted asString,
      '. Actual = ', actual asString,
      '.'.
  lb build.
  b ].

rand := Random new.
6 timesRepeat: [ 
  | x |
  x := rand nextInt: 1797.
  (mb value: (testing input at: x)
    value: x
    value: (result at: x) second
    value: (result at: x) third)
  inspect ].

failed := (result select: [ :x | (x second = x third) not ]) collect: [ :x | x first ].
6 timesRepeat: [
  | x |
  x := rand nextInt: 1797.
  [ failed includes: x ] whileFalse: [ x := rand nextInt: 1797 ].
  (mb value: (testing input at: x)
    value: x
    value: (result at: x) second
    value: (result at: x) third)
  inspect ]