NameX critique

NameX

NameX is a proprietary name matching solution from Image Partners that bears a close resemblance to Nominex. It produces a ranked list of variants, derived from name pairs to which scores are assigned. The list of variants can be generated at runtime, though to speed up searches the list is usually pre-generated. The algorithms behind it are not published, except in very general terms.

You can view the NameX test site at Image Partners, and also on the Origins website.

For most tests the names returned look plausible, i.e. Precision is fairly good. But it’s more difficult to assess Recall because the demo sites only display results down to a threshold value of 75% similarity. In other words, there may be plausible variants that are not shown because they either don’t meet the 75% threshold, or perhaps were not considered at by the application that created the lists. Click here to read more about Precision and Recall.

NameX processes a huge number of surname spelling variants - many milliions in fact. However most of these are extremely low frequency occurrences in real datasets, and most will simply be transcription or typing errors. When looking at the test site it's difficult to judge what are the 'important' spellings since frequencies are not displayed. This disguises the fact that many important correspondences are either missing altogether or linked at a relativley low percentage match. Some of these are detailed below.

NameX issues

There are a number of problems with the current implementation of NameX, of which the following are examples.

	Problem	Examples
1.	Phonetic pairs	Some phonetically similar pairs don't appear at all in the same NameX search, even down to the lowest displayed match, i.e. 75%. These are pairs of surnames with identical or very similar pronunciations - and are nearly all important spellings with lots of possible hits. So, if you enter one member from each pair you won't get the other amongst the search results. Amongst these problem surnames are: Appleby & Appelbee, Ayres & Eyres, Barclay & Berkley, Boyle & Boil, Bough & Boff, Burn & Byrne, Cawley & Corley, Childs & Chiles, Chisholm & Chisam/-um, Coombs & Combs, Cru[i]se & Crews & Cruwys, Daw[e] & Dorr/Dore, England & Ingland, English & Inglis[h], Euridge & Uridge, Evans & Evins, Farley & Farleigh, Farrow & Pharaoh, Francombe & Frankham, Gambling & Gamlin, Gail & Gayle, Gerrard & Garrard, Gough & Goffe, Graham & Graeme, Hague & Haigh, Handcocks & Hancox, Haw & Hoare, Hazlewood & Aizlewood, Heywood & Heyward, Horwick & Horridge, Hugh[e]s & Hew[e]s, Icke & Eyke, Ingram & Engram, Irwin & Urwin, Keogh & Kehoe, Knowle & Noel, Knox & Nocks, Larcombe & Larkham, Lee & Leigh, Lineker & Linacre, Lyell & L'Isle, Mayor & Mare/Mair, McLeod & McCloud, McEwen & McKeown, Maguire & McGuire, McGee & Magee, McGuinness & McGinnis, McQueen & McQuinn, McVay & McVeigh, Mile & Myall, Moore/Moor/More & Maw[er], Moore & Moir, Morris & Maurice, Muir & Mewer/Meur/Mure, Neil & Kneale, Nye & Nigh[y], Ogden & Hogden, Osborne & Usborne, Peel & Peale, Pincombe & Pinkham, Pugh & Pew, Quayle & Quail[e], Rees/Reece & Rhys, Riley & Reilly/Reilley, Rochester & Rodchester, Rough & Ruff, Salmon & Sammon, Sawer/Soar(e)/Sore/Saw(e), Shaw & Shore, Simmonds & Symmons, Sleigh & Slay, Ure & Eure, Vaughan & Vaughn, Wallis & Wollis, Waugh & Warr[e], Waters & Warters, Wiltshire & Willsher, Worcester & Wooster, Yeo & Yoe, Yewen & Ewen, Young & Yong. The same problem would have affected certain other pairs, were it not for what appears to be some manual intervention. Thus Gough and Goff have been linked (at 92% as it happens). But not Gough and Goffe. And since each of these pairs has several plausible subvariants, simply providing a look-up from one frequent spelling to another frequent spelling isn't sufficient, ideally the solution needs to be more sophisticated. Other pairs show up in the same NameX search, but at a fairly low percentage match despite their similar pronunciation. There are many of these. Just a few are: Faulkner & Falkner (86%). Hurd & Heard (88%). Lewis & Louis (81%). Peeps & Pepys (85%), Redknapp & Rednap (76%), Singleton & Shingleton (88%), Wallace & Wallis (92%). Searching on the phonetically similar surnames Burley, Birley, Burleigh, Berley and Burly each returns most but not all of the other variants - all depending on which you start with. As it happens, only starting with the least frequent variant Burly guarantees returning all five with a greater than 80% match. See this Rootsweb discussion from 2001. Ewell, Hewell, Whewell, Youell, Yewell, Yewle, Youll and Yule are a particularly difficult group whose members are phonetically close, and each has numerous subvariants. See this Rootsweb discussion. Another example is McHugh, McCue, McKew, McQue, also with subvariants. None of these are handled well by NameX. See this Rootsweb discussion from 1996. Eyles is another awkward surname, which historically has been pronounced in various ways. It can match to Ayles, I[s]les and Eels, as well other variants of these. See this Rootsweb discussion from 2001. NameX only deals with some of these correspondences. Another tricky group is the Irish surname Tuoh[e]y, Twoh[e]y, Tooh[e]y, Tuey, Touhey (plus other variants). Of these, the two with the greatest number of hits in Origins - Tuohy (540) & Toohey (529) don't appear in the same NameX search. See this Rootsweb discussion from 2005.
2.	Initial letters	NameX doesn't appear, in general, to pair up plausible variants where the initial letter is different. This particularly affects names where an initial H can be dropped, so that a search on Hancock will fail to return Ancock or several other plausible sub-variants. Other examples are Humphrey & Umphrey; Hainsworth & Ainsworth; Askew & Haskew; Earnshaw & Hearnshaw; Hebblethwaite & Ebblethwaite; Holdham & Oldham; Honeycombe & Unicombe; Holdroyd & Oldroyd; Hepworth & Epworth; Horton, Orton & Auton. Regarding Eaton & Heaton, see this Rootsweb discussion from 2000. The problem also affects other initial letter pairs such as E/I as in Englefield & Inglefield; A/E as in Alexander & Elexander, and A/O as in Austerbury & Osterbury.
3.	Prefixes	Names with separate prefixes such as St, De, La, Le, Van etc. don’t appear to be catered for in NameX. This may be because they are regarded as equivalent to double-barrelled names. A search on St Clair only performs a look-up on the first element (St), while searching on Clair doesn’t include the St Clair variants. Searching on Saint Claire, Saintclaire or Sinclair won't produce St Clair. There are many examples with the De prefix: e.g. searching on Devine doesn't produce De Vine amongst its search results, even though both versions of the surname occur today; similarly Devere & De Vere and Delamare & De La Mare (searching on 'De La Mare' strips the 'La Mare' and matches only on 'De'). See this Rootsweb posting from 2000. Examples with Van: Van Gelder & Vangelder, Van Dyke & Vandyke. The problem also exists for those Mc/Mac names where the prefix has been separated out, e.g. Mc Donald and Mac Donald. Such forms occur in large numbers in most historic datasets.
4.	Double-barrelled names	These occur in most surname datasets, with either a linking '-' character or a space character. The NameX website explains that only one element is used, whereas there are a number of surname instances where a compound name (either with or without hyphen) probably should be treated as a single entity, e.g. Green-Field, Fitz Herbert.
5.	Patronymics	In historical databases there are generally spellings such as Edwds or Edds (Edwards), Rbts (Roberts), Wms (Williams), Jno (John), i.e. surnames that have undergone the same shortening as the equivalent forename. Welsh patronymics represent the extreme version of these problems, where not only is the distinction between surname and forename blurred, but name elements have also been abbreviated eg Dd (David), Lld (Lloyd), Hl (Howel?).
6.	Greek letters	Historically Greek chi has sometimes been used to represent Ch-, sometimes combined with rho for Chr-, especially in earlier sources. In typed and computerised databases these generally get replaced by the letters they most closely resemble, hence Xtmas or Xpmas (= Christmas), Xtopher (= Christopher). Although these are low occurrences, in these situations NameX doesn't expand 'X' (chi) to 'christ'.
7.	Non-intuitive pronunciations	e.g. Beauchamp [='Beecham'], Cholmondeley [='Chumley'], Cockburn [='Coburn'], Dalziel [='Dyell' etc], Knollys [='Knowles'], Mainwaring [='Mannering'], Marjoribanks [='Marshbanks'], Urquhart [='Urkut']. These have generated additional versions that are closer phonetically to the actual pronunciations. These arguably ought to be linked. See also the Debretts website for a list of such surnames, also the Wikipedia article.
8.	Placename derivations	Rather like the previous category, surnames derived from certain placenames have conventional pronunciations that differ from their spellings, e.g. Leicester, Gloucester, Bicester, Worcester. These again have produced phonetically similar spellings i.e. Lester, Gloster, Bister and Wooster respectively. Each of these have numerous further sub-variants. NameX doesn't deal with these correspondences.
9.	Misinterpreted letters in documents	Letters u/v and i/j are often interchangeable in early sources. For example Euans (Evans or possibly Ewans?), Steuenson (Stevenson), Oliuer (Oliver), Dauidson (Davidson), Vnderwood (Underwood), Iacobs (Jacobs), Beniamin (Benjamin), FitzIames (FitzJames). Arguably such spellings should be processed such that they achieve the ranking they deserve.

Conclusions

According to the NameX website the scores are weighted averages of some six different comparison metrics. In this respect it’s similar to Nominex, but comparing the output from the two suggests that NameX probably relies more heavily on orthographical (spelling) differences and less on phonetic comparisons. All of the results returned by NameX seem reasonable, but some key variants are missing. Thus it appears to perform well on Precision, but less well on Recall. (see Precision vs Recall for more information on these terms).

NameX is probably good as a generic solution for surnames deriving from all cultures, but is less well-adjusted to British surnames. Nominex however has been designed using datasets specifically drawn from British sources.