How does Nominex work - Overview
The algorithms behind Nominex are based upon sound academic principles, relying heavily upon the phonetic rules described by Edward Carney (A Survey of English Spelling, 1994).
Each surname spelling is translated into its phonetic version using the symbols of the IPA (international phonetic alphabet). Carney’s list of 225 spelling-to-pronunciation rules was used as the basis of these IPA versions, supplemented by additional rules created to deal with his numerous exceptions, as well as many other exceptions found in the corpus of British surnames. The total rule set is currently almost 2,000. Click here for more information on conversion to IPA and the difficulties this poses.
The similarity scores created by Nominex rely upon comparing each surname with all other surnames in the database. This comparison can be performed on the fly in a live system in which case there will be a performance hit. Alternatively they can be pre-calculated, at the expense of some disk space. Those names whose scores fall in the top 20% or 25% may be regarded as variants that are close enough to be useful, while the lower ranking pairs would generally be discarded.
In order to compute the scores a number of steps are involved, these are described next:
Derived Forms
First, for each surname spelling a number of derived versions are produced. There are also other columns in the original database, giving for each surnanme spelling the following list or original and derived forms:
| Database Column | Notes | |
|---|---|---|
| 1. Existing Columns: | ||
| 1a. | Raw Spelling | Exactly as originally recorded in the database. |
| 1b. | 'Standardized' version | Won’t be available for most datasets, but does exist for the working datasets used in this project (NBI and 1881 census). The standard forms were assigned previously by manual inspection for each of the original projects. They are not used at all in the algorithmic generation of the ranked matches, but have provided a useful benchmark against which the program’s performance can be measured. |
| 1c. | No. of Occurrences | The frequency of this spelling in the dataset. May or may not be immediately available in the dataset, but can usually be calculated. |
| 2. Derived Columns: | ||
| 2a. | Revised spelling | Derived from the cleaned spelling. Incorporates minor punctuation changes such as removal of spaces and quote-marks, reduction of any 3-character repeats to 2 characters; and expansion of ST to SAINT. Spaces, dashes and apostrophes are allowed - as in GRACEY-JONES, GRACEY JONES, O'BRIEN. Other anomalies may be corrected, such as making a double space into a single space character. For double-barrelled names three versions are generated, one for each component and a composite version with the space (or dash) character removed. This is because some names that have been recorded as two words may originally have been a single surname, e.g. Green-Field as a version of ‘Greenfield’. |
| 2b. | Phonetic version | Derived from the Revised spelling and recorded in the working database using the Sampa version of the International Phonetic Alphabet (IPA). Click for detailed information. |
| 2c. | Syllable count | For each surname its syllable count is estimated - by counting the number of vowel sounds in its IPA version and allowing values >1.0 for for long vowels and diphthongs. The range of syllable counts for most surnames is from 1.0, up to 3.0 or so. |
The Derived Columns in the above table are generated from the Cleaned Spelling column using a batch process, this typically might take half an hour or so to process, for a dataset of c.400,000 spellings.
Comparison Metrics
Second, the program computes weighted pairs. This is the most time-consuming part of the process since it potentially involves comparing each spelling with every other spelling. In practice one wouldn't bother to compare (say) SMITH with JONES, since the score would be so low. But every surname beginning with a given letter is compared with every other surname with that letter, and often with other sets of names too. Thus, all 'G' surnames are compared with each other, but also with all the 'J' names (e.g. to get a link between GERRARD and JERRARD). Similarly, all F- are compared with all PH-, and R- with WR-.
The metrics computed by the system are as follows:
| Metric | Description | |
|---|---|---|
| 1. Punctuation Issues: | ||
| 1a. | Pre-process penalty | A small penalty imposed if a punctuation change was necessary. |
| 2. Phonetic Comparison: | ||
| 2a. | IPA Comparison | The most important of the comparison algorithms. For each pair of surnames (head word & candidate word) the program compares their respective IPA versions and creates a 'difference score'. This uses a letter-matrix created from the two spellings, identifying a "minimum cost route" through the matrix. Each cell of the comparison matrix is populated from a pre-existing ‘penalty cost table’, i.e. a table of similarities between every possible pair of phonemes. e.g. ‘d’ compared to ‘t’ would be a lower cost than ‘d’ compared to ‘m’. The raw version of the minimum cost score is then modified in various ways:
|
| 2b. | Initial Letter Weight | Allows additional weight to be given to the initial letter, much as Soundex does. |
| 2c. | First vowel weight | Additional weighting on the first syllable on the grounds that this is where the stress lies in most surnames. |
| 2d. | Syllable count difference | The metric is the difference between the syllable count of the head word and the current word. |
| 3. Orthographic Comparison [spelling]: | ||
| 3a. | Edit Difference | The second most important comparison algorithm.Uses the classic Levenshtein edit-difference algorithm, i.e. the number of insert / delete / substitute steps required to turn one string into the other. |
Other comparison metrics were tried, such as N-gram, Guth etc. But these contributed little to the success of the algorithm so were eventually dropped. Old algorithms such as Soundex and Metaphone were not used at all.
Combining Scores
To combine the scores from the above comparison metrics, various different methods and weightings were tried. An overall phonetic score is computed by adding Nos. 2a, 2b, 2c, 2d above. This overall Phonetic Score is then compared to the Edit Distance Score (No.3 above), and the final difference score is the minimum of these two (suitable weighting being applied first).
The final step is to turn the combined difference metric into a percentage, where 100% is an exact match. A cut-off is generally applied, e.g. discarding values lower than (say) 80%.

