How should we get to know a Jewish community, if an abandoned cemetery and registers in provincial archives are all that’s left of it?

Working with what we’ve got.

Description of the cemetery is a painstaking and long manual work, consisting of several stages: collecting field material (photographs of tombstones, epitaph texts, a map of the cemetery), followed by processing, verification, cataloging, and analysis.

What happens if you try to do this using computer vision technologies?
Our Data
Collection of David N. Goberman from the archives of Center "Petersburg Judaica" of the European University (the 1950s — 1960s)

David Goberman was a true artist: it was not the epitaph that attracted him in the tombstones, but the ornament. Goberman's photographs do not provide complete information about the cemeteries he observed. Yet many of the tombstones he captured no longer exist: the artist's photographs are all that remains of them.

Fieldwork materials of the center "Sefer" (2018−2019)

This collection was created by researchers who processed the epitaphs thoroughly and photographed each tombstone. Using modern technologies, they also captured geodata, which makes the task of mapping much easier.


Map of places where the photo archive was collected:
Data Limitations
and the impact they exert on recognition quality
Several tombstones on a photograph

Enumeration cards

Only parts of tombstones depicted on a photograph
We do not analyse texts containing less than one line of epitaph
Photographs can be too blurry, bright or dark

Tombstones can be tilted or photographed at an inconvenient angle
Even 5 ° deviation of the text from the horizontal line reduce the quality of recognition
Some epitaphs are extremely difficult to read

No tombstones on a photograph

Naive First Take
What we started with
1
We studied the subject of OCR
We explored the principles of optical character recognition services from Google and Yandex.
2
We tested the quality of recognition
We sampled 100 photos of epitaphs that already had the transcription, launched OCR, and calculated similarity metrics to the reference text. Out of 100 images in the sample, only 12 provided good results.
3
Analysied the resulits
We compiled a list of factors affecting the quality of text recognition. For example, the tilt of the text, other objects in the frame, the way text is engraved to the tombstone.
The first take allowed us to plan the general workflow of the project.
Workflow of the Project
Classification
Text recognition for
epitaphs
Tasks
Methods
Optical recognition systems
Filters
Segmentation
Toloka
Artificial neural networks
Classification
Text recognition for epitaphs
Tasks
Toloka
Methods
Optical recognition systems
Filters
Segmentation
Artificial neural networks
Toloka
Artificial neural networks
Optical recognition systems
Toloka or how to crowdsource peoples' help
For the tasks of images classification and segmentation, we used Toloka crowdsourcing service.
The platform allows the rapid processing of big data.
We decomposed the tasks into simple assignments that do not require special education or training.

We placed two types of assignments in Toloka.
The first assignment was to check the photos for compliance with the following requirements:

  1. There is a tombstone in a photo;
  2. The tombstone is visible;
  3. The text of the epitaph on the tombstone is longer than two lines.
Selected images formed the basis for the second task — marking the tombstones. With a special tool, the tolokers marked the fragments with epitaphs strictly along the border of the text. If there were two tombstones in one photo, the text segments were marked on both.
Classification and segmentation using neural networks
With help of Toloka we got two sets of images:



  • passed classification
  • passed both classification and segmentation

Based on this data we created two artificial neural networks, one for classification, another for segmentation of the images.

Classification
We used standard Tensorflow classifier, the classification accuracy was ~ 0.6 after about 30 epochs.
We also tried one of the newest EfficientNet neural network architectures, but the results were unsatisfactory. The set of images is too heterogeneous for the network training.


Segmentation
For this task we used UNet.
The segmentation accuracy based on learning outcomes was 93%.
Google Cloud Vision API and Yandex Vision
For text recognition in OCR services, a language model is trained on a specific text corpus. You can either select the model or let it be selected automatically.

The service highlights the found text in the image and groups it by levels: words into lines, lines into blocks, blocks into pages.
The response indicates the service's confidence in the result of character recognition. For example, the value "confidence 0.9 412 244 558" means the text is recognised correctly with a probability of 94%.

Poorly recognised:
  • handwritten text and art fonts;
  • vertical text;
  • a random set of letters and numbers (license plates);
  • character written in a separate cell (questionnaire);
  • short words and numbers in table cells;
  • very large text.

Since first results of calculated similarity metrics were unsatisfactory, we started to think on possible improvements.
Filters
To improve the OCR recognition results, we tried to apply standard conversion filters and filters that increase the quality of photos and the "readability" of characters. Few of them gave positive results.
Moreover, often there were words like "price", "cleaning", "cheap" in responses of Yandex OCR.
Evaluation of the Effectiveness of Our Work
Results
Our research shows that for the application and high-quality customisation of UNet, it is necessary to classify the images of Sefer and Goberman separately.
The calculated metrics are based on analysis of images selected and marked with the help of Toloka. We formed a set of recommendations for image selection in order to improve a future dataset and training of the artificial neural network.

Predicted Masks of Trained Net UNet
(Accuracy = 0.9, Loss = 0.23)
Metrics
How to estimate how similar two texts are without knowing the languages?

There are various metrics for determining the similarity of the texts.
We applied three methods: Levenshtein ratio, Jaccard similarity measure, and Hamming distance.
Here we will focus on the first one.

Levenshtein ratio

Levenshtein ratio (or "editorial distance") shows how many actions (insertion, deletion, or replacement of one character with another) must be performed to edit one line to get another. The higher this indicator, the more correctly the text is recognised.
What we've accomplished
1. Dealt with the crowdsourcing platform Toloka, which accelerated the process of classifying and segmentation of the desired areas in photographs of tombstones.

2. Formulated requirements for the quality of images in the dataset.

3. Stated that the Tesseract Deskew and CLACHE filters show a significant improvement in recognition results (a large number of others tested - no).

4. Revealed the conditions for performance improvements of the neural network UNet.
Plans and Prospects
1. A separate optical model is required for the classification and segmentation of individual text characters.

2. Diacritical marks on tombstones are poorly visible. They can be reconstructed from context, but a good language model is needed.

3. The language of epitaphs differs from the language in which models have been trained in universal OCRs. Adapting the language model to the texts that are found on the gravestones is crucial.

4. Training the neural network on a higher-quality collection of pre-selected images is possible.
Our Team
Alexei Artamonov
Project Supervisor, Yandex
Ekaterina Karaseva
Project Supervisor, EUSP
Anna Bulina
Project Advisor, Toloka
Julia Amatuni
Developer,
Project Manager of the Programme, EUSP
Dmitrii Serebrennikov
Developer, Project Manager
Ekaterina Alieva
Developer
Kira Kovalenko
Developer
Tatiana Tkacheva
Developer
Aigul Ashrafulina
Developer
Olga Saveleva
Developer
Ekaterina Tyurina
Developer

The Jewish Tombstones is one of the four educational group projects of the Programme in Applied Data Analysis (PANDAN).
PANDAN is a joint programme of the European University and Yandex.


https://pandan.eusp.org/
Made on
Tilda