Nominex

You are in:   Home > How does Nominex work > Derived Forms

Home

Existing Search Methods

Precision vs. Recall

Demo

How does Nominex work?

Overview

Derived Forms

IPA Conversion

Creating Scores

FAQ

Links

References

Acknowledgements

Creating Derived Forms

First, for each surname spelling a number of derived versions are produced. There are also other columns in the original database, giving for each surname spelling the following list of original and derived forms:

Database ColumnNotes
1.   Existing Columns:
1a.Raw SpellingExactly as originally recorded in the database.
1b.'Standardized' versionWon’t be available for most datasets, but does exist for the working datasets used in this project (NBI and 1881 census). The standard forms were assigned previously by manual inspection for each of the original projects. They are not used at all in the algorithmic generation of the ranked matches, but have provided a useful benchmark against which the program’s performance can be measured.
1c.No. of OccurrencesThe frequency of this spelling in the dataset. May or may not be immediately available in the dataset, but can usually be calculated.
2.   Derived Forms:
2a.Revised spellingDerived from the cleaned spelling. Incorporates minor punctuation changes such as removal of spaces and quote-marks, reduction of any 3-character repeats to 2 characters; and expansion of ST to SAINT. Spaces, dashes and apostrophes are allowed - as in GRACEY-JONES, GRACEY JONES, O'BRIEN. Other anomalies may be corrected, such as making a double space into a single space character. For double-barrelled names three versions are generated, one for each component and a composite version with the space (or dash) character removed. This is because some names that have been recorded as two words may originally have been a single surname, e.g. Green-Field as a version of ‘Greenfield’.
2b.Phonetic (IPA) versionDerived from the Revised spelling and recorded in the working database using the Sampa version of the International Phonetic Alphabet (IPA). The current version of the system can create two different phonetic versions where necessary, for those cases where alternative pronunciations of a surname are either known or suspected. Click for detailed information.
2c.Syllable countFor each surname its syllable count is estimated - by counting the number of vowel sounds in its IPA version and allowing values >1.0 for for long vowels and diphthongs. The range of syllable counts for most surnames is from 1.0, up to 3.0 or so.

The Derived Forms in the above table are generated from the Cleaned Spelling column using a batch process, this typically might take a while to process, e.g. perhaps half an hour or so for a dataset of c.400,000 spellings.

Next...