string norm / decomposition tool

Use the tool to normalize your input text, just copy and paste the text or use the file select. You can configure your own normalization pipeline. If you need a decomposition, just choose one or more.

1. NORMALIZATION

(Please check http://ecomparatio.net/~khk/NORM-DECOMP-DIST/textnorm.html to see some examples of how the selection would work.)

1.1 Unicode normalization

(Select one; it is the normalization of the output. During other normalization steps another unicode normalization maybe used.)
Unicode NFD:
Unicode NFC:
Unicode NFKD:
Unicode NFKC:

1.2 Sign equalization

Disambiguate diacritica:
Disambiguate dashes:
Text output latin u-v: (repaces all u with v)
Text output latin j-i: (repaces all j with i)

1.3 Markup / Format

Without markup: (input a string and get it back with markup (html / xml) removed)
Delete punctuation: (takes string and returns the string without punctuation)
Without newline: (input string and get it back with linebreaks removed)

1.4 Word level conversions

Elision expansion: (elusion it will be expanded)
Alpha privativum / copulativum: (takes utf8 greek and splits the alpha privativum and copulativum from wordforms)
Text output without numbering: (takes string, return string without the edition numbering i.e. [2])
Text output no hypenation: (removes hyphenation)

1.5 Single normalization steps

Iota sub to ad: (takes greek utf8 string and repleces iota subscriptum with iota ad scriptum)
Text output tailing sigma uniform: (equalize tailing sigma)
Text output without diacritics: (replaces diacritics)
Text output without some signs: (delete some to the programmer unknown signs: †, *,⋖,#)
Text output without ligature: (takes a string, return string with ligatures turned to single letters)
Text output equal case: (input a string and get it back with all small case letters)
Text output no brackets: (input string and get it back with no brackets)

1.6 Combinations

(Select one of the combined normalization functions (none of the single steps is used).)
Text output basic norm: (basic equalization and hypenation reversal)
Text output all deleted: (deletes UV/JI, brackets, sigma, lower, hyphenation, ligatures, punctuation, edition numbering, unknown signs, diakritics)
Text output is a combination of steps: (diacritics disambiguation, normalization, hyphenation removal, linebreak to space, punctuation separation and bracket removal)

1.7 Translitteration

(Select one of the transliterations.)
Text transliteration (gr-la): (takes greek utf8 string and returns transliterated latin utf8 string)
Text transliteration (la-gr): (takes latin utf8 string and returns transliterated greek utf8 string)

2. FEATURES / DECOMPOSITION

(If more decompositions are selected, than more output file are generated. Check http://ecomparatio.net/~khk/NORM-DECOMP-DIST/zerl.html for some examples to see how it will work.)

2.1 Word level decomposition

Separation of diakritics: (takes string and returns array of array of diakritica and array of letters)
Without consonants: (string without consonants)
Without vowels: (string without vowels)
Small words: (string with just small words (stopwords))
Big words: (string with just big words (not stopwords))

2.2 General N-Gram decomposition

Use N-Gram: (check this to enable ngram decomposition and use the configuration below)
Gram-level:
N:
Padding:

2.3 Special decompositions (fixed size)

Pseudo-syllables:
Head-body-Coda I:
All partitions (Head-body-Coda II):

2.4 Heuristik

Flat neighborhood:

3. Input

(Copy and paste input is ignored if files are selected. Just use one of it.)

3.1 Copy and paste

Run normalization / decomposition

3.2 File select

(Just choose the files, than the selection above will be appleyed.)

University Trier / Ancient History Trier