Preparatory works for Hebrew manuscript studies using neural networks
Posted on Samstag 23 Januar 2021 at 9:38 by Franz Böhmisch
In September 2020 I have done some programming for computer based
manuscript analysis in the field of hebrew studies. I am working on
manuscripts from the Cairo genizah(s) since years and enjoyed the new
possibilities offered by the Friedberg
Genizah Project (FGP)
. This project offers analysis of manuscripts resulting in the numbers of
written lines on a page of an Hebrew fragment and the hight of the lines
and the distances between the lines etc.
Sometimes I would like to
overwrite the data coming from the FGP with my own. This desire resulted
in starting some experiments in programming with NN-libraries like
tensorflow and PyTorch and Reforcement Learning mechanisms in about
2018. Because I worked 8 years in computer business mostly as programmer
in the first decade of this millenium I never lost interest in the
field, but prefered to work again in the area of theology: Since 2009 I
am a religious teacher in schools in Austria and Germany and I like it.
Programming is leisure time ... and fun. And because of my previous work
with FGP resulted since 2014 in some finds of new manuscripts of a
rhymed Hebrew paraphrase on an Hebrew original text of Ben Sira I am
motivated to continue in this mixture of manuscripts studies and
computer programming. The understanding how machine learning works
helped me a lot in using the FGP system and the possibilities for
queries in the databeses it offers. Now I would like to program my own
Here I can show first results of my own programming.
First I produced my own synthetic hebrew dataset. This is necessary for supervised learning: You need the pixel representation of a glyph (= the visual representation of a character) and the label for it in a database. Then a neural network in a machine learning process can go through thousands of items in the labelled database and optimize its parameters to identify characters from pixel glyphs. For avoiding analysing hundreds of manuscripts (as many actual initiatives even on academic level do with thousands of hours of working of volunteers identifiing glyphs on manuscripts) I had to produce a synthetic datasets of very different hebrew scripts with correct labels in a database.
The basic idea upon which this undertaking rests is the thought, that any computer does nowadays understand fonts and scripts because he can write it out of installed fonts based on the unicode standard. א (Aleph) has its own unicode number and can be written in most modern unicode fonts. When I started my internet activities in 1995 more than 25 years ago this was not standard and just started to spread on the internet. Now it is standard and the young people do not even thing about the problems of the dinosaurs. Therefore I installed all hebrew fonts I could find and wrote programs to convert all hebrew glyphs into a database.
One program is written in python and works but I did not use it actually
because I preferred my old visually oriented programming environment
livecode. The result looks like this:
The result is a database of hebrew glyphs in the format 32x32 pixel (=1024 byte with grey values 0 to 255) and the correct label with the character number and the name of the font.
In a first attempt the dataset consists out of 13000 glyphs. I prgrammed a technique to mix around 20 glyphs from different random glyphs which results in glyphs similar to scanned glyphs from manuscripts. But mathematically it might be just a small progress to offer the neural networks thousands of this generated and mixed glyphs. The adapted parameters reading all the buildt in glyphs should do the same for the mixed glyphs also.
For the neural network I avoided cnn (convolution) but experimented with small sequential nets in python, tensorflow and keras - and it worked on my 10 year old windows notebook with 4 GB RAM. I had installed conda on this win7 notebook (without a graphic card usable by cuda, but with one of the first core i5 processors) and Pycharm. The first attempt resulted in 95 %accuracy after a run of 30 minutes. The following model with one hiffen layer more (the 512 nodes layer at level 2 instead going to 256) resulted in 97% accuracy and run about 40 minutes for the 300 epochs (300 times going through the whole dataset and adapting the parameters for reading the glyphs). That means only 30 errors in correct identification of all glyphs out of a set of 1000 different pixel representations. The graphic above shows 20x50 = 1000 glyphs of hebrew characters in different fonts and statistically 30 from them are misinterpreted. That is near perfect and enought for me ;-).
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(512, input_shape=(1024,),activation=tf.nn.relu),
#tf.keras.layers.Dropout(0.1),
tf.keras.layers.Dense(256, activation=tf.nn.relu),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
#tf.keras.layers.Dropout(0.1),
tf.keras.layers.Dense(64, activation=tf.nn.relu),
#tf.keras.layers.Dropout(0.1),
tf.keras.layers.Dense(28, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
Epoch 300/300
64/9999 [..............................] 64/9999 [..............................] - ETA: 4s - loss: 0.0424 - acc: 0.9844
192/9999 [..............................] 192/9999 [..............................] - ETA: 4s - loss: 0.0363 - acc: 0.9792
320/9999 [..............................] 320/9999 [..............................] - ETA: 4s - loss: 0.0278 - acc: 0.9844
448/9999 [>.............................] 448/9999 [>.............................] - ETA: 4s - loss: 0.0316 - acc: 0.9821
576/9999 [>.............................] 576/9999 [>.............................] - ETA: 4s - loss: 0.0308 - acc: 0.9826
704/9999 [=>............................] 704/9999 [=>............................] - ETA: 4s - loss: 0.0254 - acc: 0.9858
832/9999 [=>............................] 832/9999 [=>............................] - ETA: 4s - loss: 0.0282 - acc: 0.9844
960/9999 [=>............................] 960/9999 [=>............................] - ETA: 3s - loss: 0.0260 - acc: 0.9865
1088/9999 [==>...........................]1088/9999 [==>...........................] - ETA: 3s - loss: 0.0270 - acc: 0.9871
1216/9999 [==>...........................]1216/9999 [==>...........................] - ETA: 3s - loss: 0.0251 - acc: 0.9885
1344/9999 [===>..........................]1344/9999 [===>..........................] - ETA: 3s - loss: 0.0280 - acc: 0.9881
1472/9999 [===>..........................]1472/9999 [===>..........................] - ETA: 3s - loss: 0.0267 - acc: 0.9891
1600/9999 [===>..........................]1600/9999 [===>..........................] - ETA: 3s - loss: 0.0268 - acc: 0.9888
...
8768/9999 [=========================>....]8768/9999 [=========================>....] - ETA: 0s - loss: 0.0452 - acc: 0.9841
8896/9999 [=========================>....]8896/9999 [=========================>....] - ETA: 0s - loss: 0.0459 - acc: 0.9838
9024/9999 [==========================>...]9024/9999 [==========================>...] - ETA: 0s - loss: 0.0470 - acc: 0.9836
9152/9999 [==========================>...]9152/9999 [==========================>...] - ETA: 0s - loss: 0.0469 - acc: 0.9836
9280/9999 [==========================>...]9280/9999 [==========================>...] - ETA: 0s - loss: 0.0484 - acc: 0.9834
9408/9999 [===========================>..]9408/9999 [===========================>..] - ETA: 0s - loss: 0.0493 - acc: 0.9832
9536/9999 [===========================>..]9536/9999 [===========================>..] - ETA: 0s - loss: 0.0490 - acc: 0.9833
9664/9999 [===========================>..]9664/9999 [===========================>..] - ETA: 0s - loss: 0.0485 - acc: 0.9835
9792/9999 [============================>.]9792/9999 [============================>.] - ETA: 0s - loss: 0.0490 - acc: 0.9834
9920/9999 [============================>.]9920/9999 [============================>.] - ETA: 0s - loss: 0.0497 - acc: 0.9831
9999/9999 [==============================]9999/9999 [==============================] - 4s 436us/step - loss: 0.0496 - acc: 0.9830
64/2765 [..............................] 64/2765 [..............................] - ETA: 1s
576/2765 [=====>........................] 576/2765 [=====>........................] - ETA: 0s
1088/2765 [==========>...................]1088/2765 [==========>...................] - ETA: 0s
1600/2765 [================>.............]1600/2765 [================>.............] - ETA: 0s
2112/2765 [=====================>........]2112/2765 [=====================>........] - ETA: 0s
2624/2765 [===========================>..]2624/2765 [===========================>..] - ETA: 0s
2765/2765 [==============================]2765/2765 [==============================] - 0s 112us/step
test loss, test acc: [0.07396609171457293, 0.9775768535262206]
Generate predictions for 3 samples
predictions shape: (3, 28)
[[0.0000000e+00 0.0000000e+00 1.4492371e-36 3.6740185e-35 0.0000000e+00
0.0000000e+00 0.0000000e+00 1.7022430e-25 0.0000000e+00 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00 4.0820628e-27 4.2639808e-37 0.0000000e+00 1.8963427e-22
0.0000000e+00 0.0000000e+00 1.0000000e+00 2.5980322e-29 4.8806806e-16
0.0000000e+00 0.0000000e+00 5.7977794e-26]
[0.0000000e+00 5.0976535e-15 2.1335409e-28 3.8345570e-25 1.0014407e-27
0.0000000e+00 1.0041099e-36 1.5026116e-28 0.0000000e+00 1.4207416e-19
1.0817977e-31 0.0000000e+00 3.8180895e-34 3.0210756e-24 1.7604077e-35
4.1450524e-33 0.0000000e+00 1.1173805e-33 2.5169558e-37 1.2636250e-10
1.9431216e-29 2.0433610e-22 6.8561061e-24 1.0000000e+00 0.0000000e+00
0.0000000e+00 4.1955911e-19 0.0000000e+00]
[0.0000000e+00 2.1200031e-18 1.8191180e-35 1.4755366e-32 6.1646172e-35
0.0000000e+00 0.0000000e+00 3.4135426e-34 0.0000000e+00 4.7979412e-24
0.0000000e+00 0.0000000e+00 0.0000000e+00 1.1714270e-29 0.0000000e+00
0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 8.2938552e-11
1.7726182e-35 2.1993817e-25 4.4254627e-27 1.0000000e+00 0.0000000e+00
0.0000000e+00 1.7615296e-20 0.0000000e+00]]
Process finished with exit code 0
Edited on: Samstag 23 Januar 2021 13:12Posted in (RSS)