Introduction
During the person registration scenario Connect ID performs a duplicate check based on first name, last name, date of birth and a gender. If potential duplicates are detected, system tries to contact all MAs "owning" potential duplicates to get personal details. In most implementations detailed personal information coming from other MAs is displayed to the human operator so that he/she can determine whether it's a duplicate or not.
Some MAs implementing Connect ID have chosen to also display to the user information about a duplicate scoring (similarity between original records and a potential duplicate found) which comes not from other MAs but directly from Connect ID:
- score
- queried hash
- queried date of birth
- matched hash
- matched date of birth
This article explains how to interpret values of these fields (e.g. "Exact" or "SwapNames") should you need to show them to the end user. An example of what can be shown to the user you can find in this article.
First chapter discusses the details of duplicate detection and scoring algorithm, while the next one gives the list of all potential values of the fields above with explanation. If you don't need to understand how the algorithm works, jump to chapter "All variant types explained".
Duplicate detection and scoring algorithm
To better understand the meaning of different fields, we need to first look at how Connect ID stores data and performs duplicate detection.
Firstly, we will assume that Person A has been already registered in Connect ID. Effectively it means that a number of variants of Person's A name and date of birth have been stored. For example, if we register "John Smith" born on "1990-05-31", the system will store the following variants of the name and date of birth (names are hashed therefore we will sometimes refer to them hashes):
Name | Variant/hash | Explanation | Date of birth | Explanation |
John Smith | john smith | Type of hash: "Exact". The name is stored in lower case and this hash will have a score of 1.0. | 1990-05-31 | Type: "Original". Exact date of birth will be stored along the hash, with a score of 1.0. The total score for this variant will be 1.0 x 1.0 = 1.0. |
as above | as above | 1990-05-30 | Type: "OneDayBefore". Date different by one day will be stored along the hash, with a score of 0.6. The total score of this variant will be 1.0 x 0.6 = 0.6. | |
as above | as above | 1990-06-01 | Type: "OneDayAfter". Date different by one day will be stored along the hash, with a score of 0.6. The total score of this variant will be 1.0 x 0.6 = 0.6. | |
smith john | Type of hash: "SwapNames". Algorithm will store a hash for swapped names and give it a score of 0.90. | 1990-05-31 | Type: "Original". Exact date of birth will be stored along the hash, with a score of 1.0. The total score for this variant will be 0.9 x 1.0 = 0.9. | |
as above | as above | 1990-06-01 | Type: "OneDayAfter". Date different by one day will be stored along the hash, with a score of 0.6. The total score of this variant will be 0.9 x 0.6 = 0.54. | |
(...) | (...) | (...) | (...) |
Many more variants (up to 1000) are stored in Connect ID in order to detect duplicates better. The ones given above will be enough, however, to explain the way duplicates are detected and score calculated.
Now let's consider Person B that we want to register. As before, system will create a number of variants for Person B and then try to register it. The algorithm will try to match any variant of Person B with any of the existing variants. If a duplicate is found, Connect ID will return the following information:
- queried hash - this is type of this hash of Person B, for which a duplicate has been found
- queried date of birth - this is type of date of birth variant of Person B, for which a duplicate has been found
- matched hash - this is type of the hash of a duplicate found which matched with Person's B hash
- matched date of birth - this is type of date of birth variant of a duplicate found which matched with Person's B date of birth
- score - this is a multiplication of partial scores for all of the above (value between 0 and 1).
Let's see a couple of examples to understand this better. We will continue using the example of Person B and their hashes above. We will consider a couple of different Persons B which will all match with Person A one way or another.
Person B example | Queried hash | Queried date of birth | Matched hash | Matched date of birth | Score |
John Smith, 1990-05-31 | "Exact", because it was the exact hash ("john smith") for which a duplicate has been found. | "Original", because it was the exact date of birth for which a duplicate has been found. | "Exact", because it was the exact hash of Person A that was found. | "Original", because it was the exact date of birth of Person A that was found. | 1.0 x 1.0 x 1.0 x 1.0 = 1.0 |
John Smith, 1990-06-01 | as above | "Original", because a duplicate has been found for date "1990-06-01". | as above | "OneDayAfter", because when the system searched for "1990-06-01", it found a variant of Person's A date of birth - date moved by one day. | 1.0 x 1.0 x 1.0 x 0.6 = 0.6 |
John Smith, 1990-06-02 | as above | "OneDayBefore", because a duplicate has been found when the system searched for a non-exact variant "1990-06-01". | as above | "OneDayAfter", because when the system searched for "1990-06-01", it found a variant of Person's A date of birth - date moved by one day. | 1.0 x 0.6 x 1.0 x 0.6 = 0.36 |
Smith John, 1990-06-02 | "Exact", because it was the exact hash ("smith john") for which a duplicate has been found. | as above | "SwapNames", because it was a non-exact swap-name-hash of Person A, that was matched with "smith john". | as above | 1.0 x 0.6 x 0.9 x 0.6 = 0.324 |
All variant types explained
The table showing all potential values of queriedHash, matchedHash, queriedDateOfBirth and matchedDateOfBirth can be found in this article.
More advanced topics
The fact that non-exact variants of Person B can be matched with non-exact variants of Person A has an interesting side effect:
- by design Connect ID is able to match dates if they are 1 day apart (because we have OneDayAfter and OneDayBefore variants). In practices, it can match dates which are 2 days apart (because non-exact OneDayAfter will be matched with non-exact OneDayBefore.
- by design Connect ID is able to match names which are 1 letter apart (e.g. "Smith" will be matched with "Smit", because they share a variant "Smit"). In practice, it can sometimes match names which are 2 letters apart ("Smith" and "Amit" will be matched as well because they share a non-exact variant "Smit").
Such matches are not very precise and they will have a low score because the partial scores will be multiplied. It may help, though, to understand why certain records pop up as potential duplicates.