Jewish Tombstones

How should we get to know a Jewish community, if an abandoned cemetery and registers in provincial archives are all that’s left of it?

Working with what we’ve got.

Description of the cemetery is a painstaking and long manual work, consisting of several stages: collecting field material (photographs of tombstones, epitaph texts, a map of the cemetery), followed by processing, verification, cataloging, and analysis.

What happens if you try to do this using computer vision technologies?

Our Data

‣ Collection of David N. Goberman from the archives of Center "Petersburg Judaica" of the European University (the 1950s — 1960s)

David Goberman was a true artist: it was not the epitaph that attracted him in the tombstones, but the ornament. Goberman's photographs do not provide complete information about the cemeteries he observed. Yet many of the tombstones he captured no longer exist: the artist's photographs are all that remains of them.

‣ Fieldwork materials of the center "Sefer" (2018−2019)

This collection was created by researchers who processed the epitaphs thoroughly and photographed each tombstone. Using modern technologies, they also captured geodata, which makes the task of mapping much easier.

Map of places where the photo archive was collected:

Data Limitations

and the impact they exert on recognition quality

Several tombstones on a photograph

Enumeration cards

Only parts of tombstones depicted on a photograph

We do not analyse texts containing less than one line of epitaph

Photographs can be too blurry, bright or dark

Tombstones can be tilted or photographed at an inconvenient angle

Even 5 ° deviation of the text from the horizontal line reduce the quality of recognition

Some epitaphs are extremely difficult to read

No tombstones on a photograph

~~Naive~~ First Take

What we started with

We studied the subject of OCR

We explored the principles of optical character recognition services from Google and Yandex.

We tested the quality of recognition

We sampled 100 photos of epitaphs that already had the transcription, launched OCR, and calculated similarity metrics to the reference text. Out of 100 images in the sample, only 12 provided good results.

Analysied the resulits

We compiled a list of factors affecting the quality of text recognition. For example, the tilt of the text, other objects in the frame, the way text is engraved to the tombstone.
The first take allowed us to plan the general workflow of the project.

Workflow of the Project

Classification

Text recognition for
epitaphs

Tasks

Methods

Optical recognition systems

Filters

Segmentation

Toloka

Artificial neural networks

Classification

Text recognition for epitaphs

Tasks

Toloka

Methods

Optical recognition systems

Filters

Segmentation

Artificial neural networks

Toloka

Artificial neural networks

Optical recognition systems

Toloka or how to crowdsource peoples' help

For the tasks of images classification and segmentation, we used Toloka crowdsourcing service.
The platform allows the rapid processing of big data.
We decomposed the tasks into simple assignments that do not require special education or training.

We placed two types of assignments in Toloka.
The first assignment was to check the photos for compliance with the following requirements:

There is a tombstone in a photo;
The tombstone is visible;
The text of the epitaph on the tombstone is longer than two lines.

Selected images formed the basis for the second task — marking the tombstones. With a special tool, the tolokers marked the fragments with epitaphs strictly along the border of the text. If there were two tombstones in one photo, the text segments were marked on both.

Classification and segmentation using neural networks

With help of Toloka we got two sets of images:

passed classification
passed both classification and segmentation

Based on this data we created two artificial neural networks, one for classification, another for segmentation of the images.

Classification
We used standard Tensorflow classifier, the classification accuracy was ~ 0.6 after about 30 epochs.
We also tried one of the newest EfficientNet neural network architectures, but the results were unsatisfactory. The set of images is too heterogeneous for the network training.

Segmentation
For this task we used UNet. The segmentation accuracy based on learning outcomes was 93%.

Google Cloud Vision API and Yandex Vision

For text recognition in OCR services, a language model is trained on a specific text corpus. You can either select the model or let it be selected automatically.

The service highlights the found text in the image and groups it by levels: words into lines, lines into blocks, blocks into pages.
The response indicates the service's confidence in the result of character recognition. For example, the value "confidence 0.9 412 244 558" means the text is recognised correctly with a probability of 94%.

Poorly recognised:

handwritten text and art fonts;
vertical text;
a random set of letters and numbers (license plates);
character written in a separate cell (questionnaire);
short words and numbers in table cells;
very large text.

Since first results of calculated similarity metrics were unsatisfactory, we started to think on possible improvements.

Filters

To improve the OCR recognition results, we tried to apply standard conversion filters and filters that increase the quality of photos and the "readability" of characters. Few of them gave positive results.
Moreover, often there were words like "price", "cleaning", "cheap" in responses of Yandex OCR.

Adobe Photoshop (Auto Tone, Auto Tone + Brightness/Contrast)

Filters autotone, autotone + brightness / contrast

Service response:
Auto Tone: יום ש 3
Auto Tone + Brightness/Contrast: –

Adobe Photoshop (Auto Tone + Crop, Auto Tone + Brightness/Contrast + Crop)

Filters autotone + brightness / contrast + crop, autotone + crop

Service response:
Auto Tone + Crop: - זול ניקיון
Auto Tone + Brightness/Contrast + Crop: מחיר זול

Adobe Photoshop (Shadows/Highlights + Crop + Rotate, Shadows/Highlights + Crop)

Filters highlight / shadow correction + crop, highlight / shadow correction + crop + leveling

Service response:
Shadows/Highlights + Crop + Rotate:נובאש ב
Shadows/Highlights + Crop: יום

Threshold

Pixels with a brightness value above or below the threshold are displayed in white or black depending on the selected threshold setting.

Service response:
adaptiveThreshold: זול ניק - זול חלונות - זול ניקיון ניקוי - מחיר זול
adaptiveThreshold + ADAPTIVE_THRESH_GAUSSIAN_C + THRESH_BINARY: ניקיון ., -053-7211-55 פול ניקיון ניק _ 1. 2 5 5 זול ------- ניקיון מחיר

Bilateral

The filter reduces image noise while preserving the edges.
d — the diameter of each pixel of the vicinity used in the filtering process;
sigmaColor — color space sigma filter (the higher, the more intense the color and the stronger the discontinuity);
sigmaSpace — sigma value of coordinate space filter.

Service response:
When d: 5; sigmaColor: 10; sigmaSpace: 10: בוהדבורה
When d: 75; sigmaColor: 150; sigmaSpace: 150 : והדבורה מענה

Contrast Limited Adaptive Histogram Equalization (CLAHE), CLAHE + invert + black&white scale

Equalizes the contrast of the image. The contrast enhancement is limited by clipping the histogram to a predetermined value, which depends on the normalization of the histogram and on the size of the adjacent area.

Service response:
CLAHE:
נווהרר זאב
פטרה יופה
ריר שני ערב
17
!
ב

CLAHE + invert + black&white scale:
יעין
ניר שורה בה
נווהרר זאב
בטרה יובה
תאריך שני ערב
טרה
צV

Evaluation of the Effectiveness of Our Work

Results

Our research shows that for the application and high-quality customisation of UNet, it is necessary to classify the images of Sefer and Goberman separately.
The calculated metrics are based on analysis of images selected and marked with the help of Toloka. We formed a set of recommendations for image selection in order to improve a future dataset and training of the artificial neural network.

Predicted Masks of Trained Net UNet
(Accuracy = 0.9, Loss = 0.23)

Metrics

How to estimate how similar two texts are without knowing the languages?

There are various metrics for determining the similarity of the texts.
We applied three methods: Levenshtein ratio, Jaccard similarity measure, and Hamming distance.
Here we will focus on the first one.

Levenshtein ratio

Levenshtein ratio (or "editorial distance") shows how many actions (insertion, deletion, or replacement of one character with another) must be performed to edit one line to get another. The higher this indicator, the more correctly the text is recognised.

1 (891907000005_35)

Levenshtein ratio:

Masks: 0.507
Masks + Equalisation + Bilateral filter: 0.566
Masks + CLAHE: 0.483
Masks + CLAHE + Alignment: 0.507

2 (992992000034_37)

Levenshtein ratio:

Masks: 0.329
Masks + Equalisation + Bilateral filter: 0.341
Masks + CLAHE + Equalisation: 0.329

3 (992992000033_32)

Levenshtein ratio:

Masks: 0.274
Masks + Equalisation + Bilateral filter: 0.267
Masks + CLAHE + Equalisation: 0.274

4 (LDZ1281)

Levenshtein ratio:

Masks : 0.330
Masks + CLAHE + Equalisation: 0.330

Buried here
our dear mother
Sarah, daughter of Zimla,
Zuser,
eternal memory to her,
died on 16 Iyar {5} 719.
May her soul be tied in the knot of life.

May 24, 1959

5 (891907000003_19)

Levenshtein ratio:

Masks: 0.390
Masks + CLAHE + Equalisation: 0.390

6 (ORG773)

Levenshtein ratio:

Masks + Equalisation + Bilateral filter: 0.232

Buried here
honorable man,
R. Zvi, son of r.
Eliezer Yaakov.
Passed 5 days
months of Iyar,
years 606
by small chronology. Let his soul be tied
in the knot of life.

What we've accomplished

1. Dealt with the crowdsourcing platform Toloka, which accelerated the process of classifying and segmentation of the desired areas in photographs of tombstones.

2. Formulated requirements for the quality of images in the dataset.

3. Stated that the Tesseract Deskew and CLACHE filters show a significant improvement in recognition results (a large number of others tested - no).

4. Revealed the conditions for performance improvements of the neural network UNet.

Plans and Prospects

1. A separate optical model is required for the classification and segmentation of individual text characters.

2. Diacritical marks on tombstones are poorly visible. They can be reconstructed from context, but a good language model is needed.

3. The language of epitaphs differs from the language in which models have been trained in universal OCRs. Adapting the language model to the texts that are found on the gravestones is crucial.

4. Training the neural network on a higher-quality collection of pre-selected images is possible.

Our Team

Alexei Artamonov

Project Supervisor, Yandex

Ekaterina Karaseva

Project Supervisor, EUSP

Anna Bulina

Project Advisor, Toloka

Julia Amatuni

Developer,
Project Manager of the Programme, EUSP

Dmitrii Serebrennikov

Developer, Project Manager

Ekaterina Alieva

Developer

Kira Kovalenko

Developer

Tatiana Tkacheva

Developer

Aigul Ashrafulina

Developer

Olga Saveleva

Developer

Ekaterina Tyurina

Developer

The Jewish Tombstones is one of the four educational group projects of the Programme in Applied Data Analysis (PANDAN).
PANDAN is a joint programme of the European University and Yandex.

https://pandan.eusp.org/