Nominex

You are in:   Home > How does Nominex work

Home

Existing Search Methods

Precision vs. Recall

Demo

How does Nominex work?

Overview

IPA Conversion

FAQ

Links

References

Acknowledgements

How does Nominex work - Overview

The algorithms behind Nominex are based upon sound academic principles, relying heavily upon the phonetic rules described by Edward Carney (A Survey of English Spelling, 1994).

Each surname spelling is translated into its phonetic version using the symbols of the IPA (international phonetic alphabet). Carney’s list of 225 spelling-to-pronunciation rules was used as the basis of these IPA versions, supplemented by additional rules created to deal with his numerous exceptions, as well as many other exceptions found in the corpus of British surnames. The total rule set is currently almost 2,000. Click here for more information on conversion to IPA and the difficulties this poses.

The similarity scores created by Nominex rely upon comparing each surname with all other surnames in the database. This comparison can be performed on the fly in a live system in which case there will be a performance hit. Alternatively they can be pre-calculated, at the expense of some disk space. Those names whose scores fall in the top 20% or 25% may be regarded as variants that are close enough to be useful, while the lower ranking pairs would generally be discarded.

In order to compute the scores a number of steps are involved, these are described next:

Derived Forms

First, for each surname spelling a number of derived versions are produced. There are also other columns in the original database, giving for each surnanme spelling the following list or original and derived forms:

Database ColumnNotes
1.   Existing Columns:
1a.Raw SpellingExactly as originally recorded in the database.
1b.'Standardized' versionWon’t be available for most datasets, but does exist for the working datasets used in this project (NBI and 1881 census). The standard forms were assigned previously by manual inspection for each of the original projects. They are not used at all in the algorithmic generation of the ranked matches, but have provided a useful benchmark against which the program’s performance can be measured.
1c.No. of OccurrencesThe frequency of this spelling in the dataset. May or may not be immediately available in the dataset, but can usually be calculated.
2.   Derived Columns:
2a.Revised spellingDerived from the cleaned spelling. Incorporates minor punctuation changes such as removal of spaces and quote-marks, reduction of any 3-character repeats to 2 characters; and expansion of ST to SAINT. Spaces, dashes and apostrophes are allowed - as in GRACEY-JONES, GRACEY JONES, O'BRIEN. Other anomalies may be corrected, such as making a double space into a single space character. For double-barrelled names three versions are generated, one for each component and a composite version with the space (or dash) character removed. This is because some names that have been recorded as two words may originally have been a single surname, e.g. Green-Field as a version of ‘Greenfield’.
2b.Phonetic versionDerived from the Revised spelling and recorded in the working database using the Sampa version of the International Phonetic Alphabet (IPA). Click for detailed information.
2c.Syllable countFor each surname its syllable count is estimated - by counting the number of vowel sounds in its IPA version and allowing values >1.0 for for long vowels and diphthongs. The range of syllable counts for most surnames is from 1.0, up to 3.0 or so.

The Derived Columns in the above table are generated from the Cleaned Spelling column using a batch process, this typically might take half an hour or so to process, for a dataset of c.400,000 spellings.

Comparison Metrics

Second, the program computes weighted pairs. This is the most time-consuming part of the process since it potentially involves comparing each spelling with every other spelling. In practice one wouldn't bother to compare (say) SMITH with JONES, since the score would be so low. But every surname beginning with a given letter is compared with every other surname with that letter, and often with other sets of names too. Thus, all 'G' surnames are compared with each other, but also with all the 'J' names (e.g. to get a link between GERRARD and JERRARD). Similarly, all F- are compared with all PH-, and R- with WR-.

The metrics computed by the system are as follows:

MetricDescription
1.   Punctuation Issues:
1a.Pre-process penaltyA small penalty imposed if a punctuation change was necessary.
2.   Phonetic Comparison:
2a.IPA ComparisonThe most important of the comparison algorithms. For each pair of surnames (head word & candidate word) the program compares their respective IPA versions and creates a 'difference score'. This uses a letter-matrix created from the two spellings, identifying a "minimum cost route" through the matrix. Each cell of the comparison matrix is populated from a pre-existing ‘penalty cost table’, i.e. a table of similarities between every possible pair of phonemes. e.g. ‘d’ compared to ‘t’ would be a lower cost than ‘d’ compared to ‘m’. The raw version of the minimum cost score is then modified in various ways:
  1. reducing the insertion score certain cluster comparisons (e.g. comparing 'ns' with 'nds' as in Simmons vs. Simmonds). There are currently some 275 of these comparisons.
  2. ‘tapering’ is applied, whereby lesser weight is given to the end of each word on the grounds that suffixes in longer surnames tend to get corrupted.
  3. the score is then ‘normalized’, to take account of differing word length.
2b.Initial Letter WeightAllows additional weight to be given to the initial letter, much as Soundex does.
2c.First vowel weightAdditional weighting on the first syllable on the grounds that this is where the stress lies in most surnames.
2d.Syllable count differenceThe metric is the difference between the syllable count of the head word and the current word.
3.   Orthographic Comparison [spelling]:
3a.Edit DifferenceThe second most important comparison algorithm.Uses the classic Levenshtein edit-difference algorithm, i.e. the number of insert / delete / substitute steps required to turn one string into the other.

Other comparison metrics were tried, such as N-gram, Guth etc. But these contributed little to the success of the algorithm so were eventually dropped. Old algorithms such as Soundex and Metaphone were not used at all.

Combining Scores

To combine the scores from the above comparison metrics, various different methods and weightings were tried. An overall phonetic score is computed by adding Nos. 2a, 2b, 2c, 2d above. This overall Phonetic Score is then compared to the Edit Distance Score (No.3 above), and the final difference score is the minimum of these two (suitable weighting being applied first).

The final step is to turn the combined difference metric into a percentage, where 100% is an exact match. A cut-off is generally applied, e.g. discarding values lower than (say) 80%.