FraBoMat |
Bibelwissenschaft |
FraBoMat,
Concepts for analysing, joining and reading Biblical manuscripts and Genizah fragments
(Konzepte
zum automatischen Lesen und Identifizieren von Handschriften aus
Geniza-Funden und Bibliotheken).
31.3.2021, updated 6.2.2022
In
September 2020 I have done some programming for computer based
manuscript analysis in the field of hebrew studies. I am working on
manuscripts from the Cairo genizah(s) since years and enjoyed the new
possibilities offered by the Friedberg
Genizah Project (FGP) .
This project offers analysis of manuscripts resulting in the numbers
of written lines on a page of an Hebrew fragment and the hight of the
lines and the distances between the lines etc.
Sometimes I would
like to overwrite the data coming from the FGP with my own. This
desire resulted in starting some experiments in programming with
NN-libraries like tensorflow and PyTorch and Reforcement Learning
mechanisms in about 2018. Because I worked 8 years in computer
business mostly as programmer in the first decade of this millenium I
never lost interest in the field, but prefered to work again in the
area of theology: Since 2009 I am a religious teacher in schools in
Austria and Germany and I like it. Programming is leisure time ...
and fun. And because of my previous work with FGP resulted since 2014
in some finds of new manuscripts of a rhymed Hebrew paraphrase on an
Hebrew original text of Ben Sira I am motivated to continue in this
mixture of manuscripts studies and computer programming. The
understanding how machine learning works helped me a lot in using the
FGP system and the possibilities for queries in the databeses it
offers. Now I would like to program my own system which could be used on any image of Biblical manuscripts or Genizah fragments.
Crazy? Yes, but it was also crazy to find hebrew manuscripts the exegetical community slept about for decades ;-)
Here I can show first results of my own programming.
The following image shows the classic approach using
crosscorrelation – and it was very effective from
the beginning. This heuristic allows the fast finding of almost 98 %
of the letters. Using the neural network heuristic for hebrew letters
(see at the bottom) the wrong indentifications can mostly be
suspended.
The area of writing on the manuscript can be found by using complex hull mechanisms after cleaning of the manuscript of the darkened areas by some filters like the otsu-filter and similar mechanisms:
When looking on this convex hull analysis of the page of an Esther scroll fragment you might be surprised like me about the clear visible movement of the edges left and right. I had seen that the writer had a tendence to write more on the left side at the bottom from line to line – but I did not see this at the right side before the marking by the hull mechanism. This small drifting in the writing practice can help identify writers together width the hight of the letters, the spacing of letters, the line spacing and the letter chart of letters of any writer (like a fingerprint of the writer).
Some other mechanisms like mathematical morphological methods on whole pages help do find the writing area.
One technique I thought about and developped on my own but found it programmed also in skimage is polar rotation of letters and analysing the rotation accordingly. You can see an hebrew letter aleph in original (scaled up from a small version in gimp to make it blur) and a polar transform of the original. At the right you see the letter rotated by a known degree and the polar transform. It is mathematically clear that the result ist just a shift by the amount of the degrees as you can see comparing the polar-transformed images. The program can identify the rotation with a precision of 1-3°.
Another technique I use is the dct (discret cosinus transform) which analyses the frequencies of any line and column of the image and results in an frequence spectrum of the image. By masking out some (high) frequencies in both directions and reconstruct the image some disturbing small artifacts in the images can be reduced. The result above shows the reconsctructed image after using a high pass filter and another technique which is called mathematical morphology (using dilation, erosion and mixing the both closing and opening). This helps in determining written areas (I have overpressed here to make the effect more clearly visible).
The next step in my computer based manuscript study has been based on producing a dataset of hebrew letters. The easiest way was to make a synthetic dataset of fonts – what you can write you also can read. What a computer can write it also can read – when trained on it.
The basic idea upon which this undertaking rests is that any computer does nowadays understand fonts because he can write it out of installed fonts based on the unicode standard. א (Aleph) has its own unicode number and can be written in most modern unicode fonts. When I started my internet activities in 1994 more than 25 years ago this was not standard and just started to spread on the internet. Now it is standard and the young people do not even thing about the problems of the dinosaurs. Therefore I installed all hebrew fonts I could find and wrote programs to convert all hebrew glyphs into a database.
One
program is written in python and works but I did not use it actually
because I preferred my old visually oriented programming environment
livecode. The result looks like this:
The result is a database of hebrew glyphs in the format 32x32 pixel (=1024 byte with grey values 0 to 255) and the correct label with the character number and the name of the font.
In a first attempt the dataset consists out of 13000 glyphs. I prgrammed a technique to mix around 20 glyphs from different random glyphs which results in glyphs similar to scanned glyphs from manuscripts. But mathematically it might be just a small progress to offer the neural networks thousands of this generated and mixed glyphs. The adapted parameters reading all the buildt in glyphs should do the same for the mixed glyphs also.
For the neural network I avoided cnn (convolution) but experimented with small sequential nets in python, tensorflow and keras - and it worked on my 10 year old windows notebook with 4 GB RAM. I had installed conda on this win7 notebook (without a graphic card usable by cuda, but with one of the first core i5 processors) and Pycharm. The first attempt resulted in 95 %accuracy after a run of 30 minutes. The following model with one hiffen layer more (the 512 nodes layer at level 2 instead going to 256) resulted in 97% accuracy and run about 40 minutes for the 300 epochs (300 times going through the whole dataset and adapting the parameters for reading the glyphs). That means only 30 errors in correct identification of all glyphs out of a set of 1000 different pixel representations. The graphic above shows 20x50 = 1000 glyphs of hebrew characters in different fonts and statistically 30 from them are misinterpreted. That is near perfect and enought for me ;-).
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(512, input_shape=(1024,),activation=tf.nn.relu),
#tf.keras.layers.Dropout(0.1),
tf.keras.layers.Dense(256, activation=tf.nn.relu),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
#tf.keras.layers.Dropout(0.1),
tf.keras.layers.Dense(64, activation=tf.nn.relu),
#tf.keras.layers.Dropout(0.1),
tf.keras.layers.Dense(28, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
Epoch 300/300
64/9999 [..............................] 64/9999 [..............................] - ETA: 4s - loss: 0.0424 - acc: 0.9844
192/9999 [..............................] 192/9999 [..............................] - ETA: 4s - loss: 0.0363 - acc: 0.9792
320/9999 [..............................] 320/9999 [..............................] - ETA: 4s - loss: 0.0278 - acc: 0.9844
448/9999 [>.............................] 448/9999 [>.............................] - ETA: 4s - loss: 0.0316 - acc: 0.9821
576/9999 [>.............................] 576/9999 [>.............................] - ETA: 4s - loss: 0.0308 - acc: 0.9826
704/9999 [=>............................] 704/9999 [=>............................] - ETA: 4s - loss: 0.0254 - acc: 0.9858
832/9999 [=>............................] 832/9999 [=>............................] - ETA: 4s - loss: 0.0282 - acc: 0.9844
960/9999 [=>............................] 960/9999 [=>............................] - ETA: 3s - loss: 0.0260 - acc: 0.9865
1088/9999 [==>...........................]1088/9999 [==>...........................] - ETA: 3s - loss: 0.0270 - acc: 0.9871
1216/9999 [==>...........................]1216/9999 [==>...........................] - ETA: 3s - loss: 0.0251 - acc: 0.9885
1344/9999 [===>..........................]1344/9999 [===>..........................] - ETA: 3s - loss: 0.0280 - acc: 0.9881
1472/9999 [===>..........................]1472/9999 [===>..........................] - ETA: 3s - loss: 0.0267 - acc: 0.9891
1600/9999 [===>..........................]1600/9999 [===>..........................] - ETA: 3s - loss: 0.0268 - acc: 0.9888
...
8768/9999 [=========================>....]8768/9999 [=========================>....] - ETA: 0s - loss: 0.0452 - acc: 0.9841
8896/9999 [=========================>....]8896/9999 [=========================>....] - ETA: 0s - loss: 0.0459 - acc: 0.9838
9024/9999 [==========================>...]9024/9999 [==========================>...] - ETA: 0s - loss: 0.0470 - acc: 0.9836
9152/9999 [==========================>...]9152/9999 [==========================>...] - ETA: 0s - loss: 0.0469 - acc: 0.9836
9280/9999 [==========================>...]9280/9999 [==========================>...] - ETA: 0s - loss: 0.0484 - acc: 0.9834
9408/9999 [===========================>..]9408/9999 [===========================>..] - ETA: 0s - loss: 0.0493 - acc: 0.9832
9536/9999 [===========================>..]9536/9999 [===========================>..] - ETA: 0s - loss: 0.0490 - acc: 0.9833
9664/9999 [===========================>..]9664/9999 [===========================>..] - ETA: 0s - loss: 0.0485 - acc: 0.9835
9792/9999 [============================>.]9792/9999 [============================>.] - ETA: 0s - loss: 0.0490 - acc: 0.9834
9920/9999 [============================>.]9920/9999 [============================>.] - ETA: 0s - loss: 0.0497 - acc: 0.9831
9999/9999 [==============================]9999/9999 [==============================] - 4s 436us/step - loss: 0.0496 - acc: 0.9830
64/2765 [..............................] 64/2765 [..............................] - ETA: 1s
576/2765 [=====>........................] 576/2765 [=====>........................] - ETA: 0s
1088/2765 [==========>...................]1088/2765 [==========>...................] - ETA: 0s
1600/2765 [================>.............]1600/2765 [================>.............] - ETA: 0s
2112/2765 [=====================>........]2112/2765 [=====================>........] - ETA: 0s
2624/2765 [===========================>..]2624/2765 [===========================>..] - ETA: 0s
2765/2765 [==============================]2765/2765 [==============================] - 0s 112us/step
test loss, test acc: [0.07396609171457293, 0.9775768535262206]
A second concept based on a neural network including normalized Image moments and Hu-moments got 99% accuracy on the testset (training on 9999 items of the 12764 item testset, testset the rest consisting of 2765 items).
A new approach was a Variational Autoencoder applied on the FraBoMat hebrew letter testset.
The results with different network topologies:
a)
b)
c)