Use the tool to normalize your input text, just copy and paste the text or use the file select. You can configure your own normalization pipeline. If you need a decomposition, just choose one or more.
(Select one; it is the normalization of the output. During other normalization steps another unicode normalization maybe used.)
Unicode NFD:
Unicode NFC:
Unicode NFKD:
Unicode NFKC:
1.2 Sign equalization
Disambiguate diacritica:
Disambiguate dashes:
Text output latin u-v: (repaces all u with v)
Text output latin j-i: (repaces all j with i)
1.3 Markup / Format
Without markup: (input a string and get it back with markup (html / xml) removed)
Delete punctuation: (takes string and returns the string without punctuation)
Without newline: (input string and get it back with linebreaks removed)
1.4 Word level conversions
Elision expansion: (elusion it will be expanded)
Alpha privativum / copulativum: (takes utf8 greek and splits the alpha privativum and copulativum from wordforms)
Text output without numbering: (takes string, return string without the edition numbering i.e. [2])
Text output no hypenation: (removes hyphenation)
1.5 Single normalization steps
Iota sub to ad: (takes greek utf8 string and repleces iota subscriptum with iota ad scriptum)
Text output tailing sigma uniform: (equalize tailing sigma)
Text output without diacritics: (replaces diacritics)
Text output without some signs: (delete some to the programmer unknown signs: †, *,⋖,#)
Text output without ligature: (takes a string, return string with ligatures turned to single letters)
Text output equal case: (input a string and get it back with all small case letters)
Text output no brackets: (input string and get it back with no brackets)
1.6 Combinations
(Select one of the combined normalization functions (none of the single steps is used).)
Text output basic norm: (basic equalization and hypenation reversal)
Text output all deleted: (deletes UV/JI, brackets, sigma, lower, hyphenation, ligatures, punctuation, edition numbering, unknown signs, diakritics)
Text output is a combination of steps: (diacritics disambiguation, normalization, hyphenation removal, linebreak to space, punctuation separation and bracket removal)
1.7 Translitteration
(Select one of the transliterations.)
Text transliteration (gr-la): (takes greek utf8 string and returns transliterated latin utf8 string)
Text transliteration (la-gr): (takes latin utf8 string and returns transliterated greek utf8 string)
Separation of diakritics: (takes string and returns array of array of diakritica and array of letters)
Without consonants: (string without consonants)
Without vowels: (string without vowels)
Small words: (string with just small words (stopwords))
Big words: (string with just big words (not stopwords))
2.2 General N-Gram decomposition
Use N-Gram: (check this to enable ngram decomposition and use the configuration below)
Gram-level:
N:
Padding:
2.3 Special decompositions (fixed size)
Pseudo-syllables:
Head-body-Coda I:
All partitions (Head-body-Coda II):
2.4 Heuristik
Flat neighborhood:
3. Input
(Copy and paste input is ignored if files are selected. Just use one of it.)
3.1 Copy and paste
Run normalization / decomposition
3.2 File select
(Just choose the files, than the selection above will be appleyed.)