e most important part of training is specifying the shapes of every character.
is is done by creating image files containing each character, and specifying in a
separate text file the coordinates and UTF-8 codepoint of each character. One can
then run several programs distributed with Tesseract to extract and store the char-
acter shapes. is is generally referred to as the tif/box step, as historically the only
image format Tesseract supported was TIFF, and the text file specifying character
coordinates is called a box file.
e official advice is to use scanned images for training, run Tesseract in a spe-
cial mode to guess at the correct location of each character, and edit the resulting
coordinates and UTF-8 characters as appropriate. However, there are several issues
that make this difficult for Ancient Greek.
For a reliable training process every character needs to occur at least a few times,
which can be difficult with a large character set or if one is interested in including
uncommon characters. While Ancient Greek does not have a large alphabet, for the
purposes of OCR with Tesseract it does. Ancient Greek has two types of diacritical
marks; breathing marks (which can be smooth:  ̓ or rough:  ̔), and accents (which
can be acute:  ́ grave: ` or circumflex:  ͂). ese can be applied in a large variety of
combinations to all vowels, and are placed either above the character, or to the le
for the upper case.
Being designed originally for English, Tesseract has no concept of diacritics, and
thus cannot separately recognise a character and a diacritical mark above it, and out-
put the individual UTF-8 codepoints for the character and the combining diacritical
mark. Instead it must be trained every possible combination of characters and dia-
critical marks. Moreover, one cannot just scan character lists, as Tesseract requires
the image to be as close to the format of printed text as possible, in order to make
informed choices about relative character position and size. Using scans would there-
fore require many pages, and a great deal of time to create the corresponding box files,
to ensure every possible character was accounted for.