Record De-Duplication, Entity Resolution, Merge/Purge and Data Quality

CRM Data Quality (CDQ)

CDQ cleans large contact databases by finding complex patterns of duplicate data, achieves single version of truth through its Entity Resolution algorithms and executes real-time de-duplication checks on applications within the enterprise.

Main Features

Assimilate data from multiple files and systems

CDQ enables a data dictionary to be defined using the field constructor option. Hundreds of fields can be defined about the contact data which can be strings, numeric, text or date types. This definition can be modified at any time.

Data can be uploaded from Comma Separated Values (CSV) or text files. The upload feature automatically attempts to map the incoming data to the fields in the data dictionary. Fields can be re-mapped or skipped if needed.

Dozens of encoding options are available during import, with UTF-8 selected by default. During the import process, checks are performed on the completeness of data. The data can be added to an existing dataset or by default a new one will be created after a successful upload.

External data can also be imported by connecting to applications like Salesforce, Eloqua and Zoho CRM. Any system with an API access can be integrated to CDQ.

Create single contact records from disconnected datasets

Data from multiple datasets can be merged after a common factor between them has been discovered by CDQ. Using its de-duplication rules, CDQ forms a cluster of records with this common criteria. These records can then be collapsed into a single golden record. The data custodian can do this merging manually or instruct CDQ to perform it automatically.

The following example will make it clearer. CDQ has found 4 common records which have been collapsed into a single golden record within CRM by the data custodian.

Identify duplicate records across databases

Using its pattern matching algorithms, CDQ can find common records across databases and multiple datasets. While the CSV and text files is usually directly uploaded into a dataset within CDQ , the data from external systems can be configured by either importing into CDQ or by storing the indexed keys of the records of that database.

Storing the data within CDQ is optional. Essentially, it stores the key which it constructs after indexing the datasets. An external data system can be indexed by CDQ and that would be sufficient for it to compare that database with others or perform a real-time de-duplication on it.

Automatically standardize, enhance and repair data

CDQ has dozens of functions to manually or automatically repair, standardize and enhance the data within its datasets. The strength of these functions is shown in the example below in which an incoming record on the left is passed through them, resulting in the final record on the right.

The name field was cleaned and converted to proper format
Before converting the email to its proper format, CDQ checked if it is a valid email by sending a request to the SMTP server of its domain. It turns out to be an address of bad quality (which is displayed in the last row). CDQ then recommends an alternative email address by itself by correcting the spelling and verifies it to be a valid address.
The company name, Address-1, Postal code and Website are converted to proper format.
The designation is standardized.
A typo error (sectorr) in Address-2 is automatically corrected.
"gurgaon" in the City field refers to an old name of this city. The new official name is "Gurugram" which is recommended by CDQ .
The fields of State and Country were blank in original data. These were populated by CDQ from its internal database
The registration date was corrected and normalized to a readable format by other systems.

CDQ can be customized further by connecting to other external databases like company or country specific regional definitions or through an APIs for geo-location or mobile number validations.

Check for duplicates in real-time from millions of records

CDQ can find duplicates within seconds, across tens of millions of records. This de-duplication in real-time is being used by our customers from data entry to identification and in other innovative ways like explained in the example below:

Before CDQ integration, the data is coming from a customer's web submission form and is being passed to CRM, telesales or lead-nurturing programs. In example above, CDQ is made to tap into this email (without integration) and it starts to perform the de-duplication for incoming records in real-time and sends that information to telesales. The actions (and their effectiveness) on original mail as compared to one sent by CDQ become very different.

Resolve entity names and cluster within large bulk files

From merchant data records in banks to epidemiological data in hospitals and product descriptions in warranty management, large files with millions of records need to be quickly resolved to determine the right entity to be used through some agreed logic.

In the example above, a million records are received in text file by a financial institution comprising of merchant names. CDQ can take this entire data set and provide an entity resolution map within an hour about identity of the main entity and which other records in the incoming file correspond to that entity. Here, CDQ shows (based on configured rules) that Starbucks (S) Pte Ltd is the main entity and the 5 other names should correspond to this main record within the text file. With this resolution, the file can now be easily processed.

Merge duplicate contact records automatically or manually

Finding the duplicate records is part of the solution, merging them into a single version of truth is the next step. Once the cluster of duplicate records is found, it is important to determine the master record which will lead the merging process. CDQ has a rich set of configurable rules to determine which record would be the master record within a cluster. The user can always over-ride this suggestion.

What happens to the records which get merged can be configured too. They can get deleted or moved to a special data set to be analyzed later. Then there are values that can get overwritten or ones which get appended.

Lastly, the merging process itself can be automatic and CDQ can take responsibility of determining the master record and merge duplicate records of tens of thousands of clusters into their master record within an hour.

Examples of Duplicate Clusters

CDQ extracts clusters of duplicate records from single or multiple datasets, which could comprise of millions of records. Here are some examples. Majority of them are close to real cases of duplicate data found by the product during its usage. The real data has been changed for confidentiality, but the discovered pattern is intact.

Full Name - First Example

Example of Name de-duplication taken from a medical institution in India. Combination of salutations, qualifications and swap of first name and surnames were considered by CDQ :

Sheela Joshi
Dr. Sheela Joshi, PhD
Mrs. Joshi, Sheela
Sheela Joshi, M.B.B.S.
Joshi Sheela

Full Name - Second Example

Example of Name de-duplication taken from a warranty registration database in Philippines:

Ivy Mathew Griffin
Ivy M. Griffin
Ivy Matt Griffin

Full Name - Third Example

A powerful example of CDQ's capabilities. Example of Name de-duplication taken from a database of a South Asian country:

Mohammed Qasim
Mohammad Kasim
Mohd. Kasim
Mhd. Kasim
Md. Kasim
Muhammad Kasim

Full Name - Fourth Example

Example of variations of a name considered in a suspected duplicate cluster by CDQ:

Casey Pabilla
Cassey Pabilla
Caseyy Pabilla
Caesey Pabilla
Caseey Pabilla
Caseey Pabillaa
Caseey Pabella
Caseeey Pabilla
Caasaay Pabilla
Caasey Pabilla
Caseey Pabilla
Casey Pabillla
Casey Pabellla
Casey Pabella
Casiy Pabilla
Casii Pabilla
Casey Pabiilla
Casey Pabilla
Caseeyy Pabilla
Caassey Pabilla

Address - First Example

This is one of the best example of Address de-duplication, highlighted by CDQ within a massive CRM database in India. Not only there are inconsistent abbreviations and spelling errors, the old and new official name of the city has been detected as duplicate:

43/2, Industrial Road, Sector 65, Bangalore
sector 65, bangalore - 43/2 (industrial) rd
43\2, sectorr 65 - industrial rd,, BENGALURU
43 2 indl rd sec 65 bangalore

Address - Second Example

An example of Address de-duplication, from Singapore:

#01-33, 92 Whampoa Annexe, Causeway Drive
01 33, 92-whampoa annx, causeway dr
01--33 whampoa anx #92, causeway drv
#92. Whampoa Anex. Causeway= Drive, 01,33

Address - Third Example

An example of duplicate address cluster from Australia. Note the abbreviations and variations of state name captured in duplicate cluster:

7th Floor, 43/2 Miller Plaza, Industrial Highway, Sydney, New South Wales
7th fl miller plz, (industrial) hw – 43 2, sydney, nsw
Seventh Floor. Indl Hway. 43-2 Plaza. Miller. Sydney. N.S.W.
Flr 7th, #43—2, miller pz, sydney indl hwye, ns.w

Mobile Numbers

CDQ finds mobile numbers in multiple formats and groups the duplicate together. Here's an example of such a group:

+63-906-222-1520
0063 9 06 22 21 520
+(906).222.1520
0-906-22-21-520
9 0 6 2 2 2 1 5 2 0

Company Name - First Example

An example of Company Name de-duplication, from India:

HPE India Private Limited
HPE India Private Ltd.
HPE Pvt. Ltd. – India
H.P.E. Pvt. Ltd.
HPE Limited

Company Name - Second Example

Another similar example of Company Name de-duplication from Philippines:

HPE Philippines Incorporated
HPE Philippines Inc.
HPE Phils, Inc.
H.P.E. Incorporated
HPE Inc

Name + Mobile Number

Example of 5 duplicate Name and Mobile Number combinations as found by CDQ:

Name: Mohammad Kasim
Mobile: +91-98336-90611

Name: Mohd. Kasim
Mobile: 0091 98 33 69 06 11

Name: Mhd. Kasim
Mobile: (9833) 690-611

Name: Md. Kasim
Mobile: 0-98336-90611

Name: Muhammad Kasim
Mobile: 9 8 3 3 6 9 0 6 1 1

Name + Date of Birth

Example of 4 duplicate Person's Name and Company Name combinations as found by CDQ:

Person's Name: Narendra Bajpayee
Date of Birth: 15/11/1984

Person's Name: Narindir Bajpayee
Date of Birth: 11-15-1984

Person's Name: Narender Bejpeyee
Date of Birth: 15.11.84

Person's Name: Nariinder Baajpayii
Date of Birth: 1984, novembr 15

Name + Company

Example of 4 duplicate combinations of Person's and Company Names as found by CDQ:

Person's Name: Sanjiv Kumar
Company's Name: HPE India Private Limited

Person's Name: Sanjeve Kumarr
Company's Name: HPE India Private Ltd.

Person's Name: Sanjeev Qumar
Company's Name: HPE Pvt. Ltd

Person's Name: Sanjive Koomar
Company's Name: HPE Limited

Website URL

CDQ groups different Website URLs which refer to the same page in a single cluster. Here's an example of such a group:

contactous.com
http://www.contactous.com/index.htm
www4.contactous.com/?query=malaysia
https://contactous.com:8080/

Frequently Discussed Topics

On location of hosted data

CDQ's computational servers run on AWS Singapore and they are different from the servers where databases are kept. Our compute servers are fixed, but the database could reside in following 3 configurations -

1) Data on Contactous' Servers - Here, we create a separate environment for customer's data on one of our AWS Singapore instance. This is a good option if the data is of low volume or for a pilot project. This option is also used by our customers as a pre-CRM space, to scrub and clean the data before transferring it to other systems.

2) Data on Customer's own AWS Instance - We recommend this option as it is fast to setup and gives our customers assurance as the database access is managed by them. We help to set the system and establish the API access, after which the full control of database is with customer.

3) Data on Customer's own data center - This on-premise database location is possible. The setup is like in #2. Co-ordination and testing usually takes more time than #2.

Security of application and platform

Contactous' CDQ has 100% SaaS architecture, and is hosted on AWS Singapore. Compliance certificates are available at:
https://aws.amazon.com/compliance/programs/

Entire application of Contactous (including all web services) are at
https://web.contactous.com/. All external accesses to our platform are
protected by the TLS 1.2 – HTTPS protocol. Strong encryption algorithms (AES 128 GCM) are used. Validated by SSLLabs (rated A).

The risk of data being intercepted by a third party during transmission is minimal. Validate at: https://www.ssllabs.com/ssltest/analyze.html?d=web.contactous.com

Every instance used by a customer runs a separate set of programs isolated from others.

We are listed on Application Exchange of Salesforce and Microsoft - both of which have the highest level of security standards that we
comply to. Check:
https://itunes.apple.com/sg/app/contactous/id1161332503?mt=8 and
https://play.google.com/store/apps/details?id=com.contactous&hl=en

De-duplication speed and data volume

CDQ has been designed to give fast results of de-duplication. In our sample database with over 2 million records, the duplicate pattern is found within a second. This speed has enabled our customers to use the system in real-time.

When the records are read by CDQ, they get indexed in multiple ways through its complex algorithms. For smaller number of records like a couple of thousands, this indexing takes place automatically. When large datasets are imported, the indexing can be manually triggered.

Large datasets can be uploaded into CDQ as CSV files. It is frequent for our customers to upload datasets of half million records as CSVs. In case there is a very large database of several million records that need to be indexed, our consultants can help to ensure that data upload is done successfully.

Algorithms in use and customization

The algorithms used for de-duplication and entity resolution within CDQ have the foundation of proven data matching frameworks, but have been developed from scratch. These algorithms heavily use fuzzy logic and other probabilistic methods to arrive at decisions. A big emphasis has been on the quality of output, which will remain as the guiding principle in future. The algorithms in current version of CDQ have been proven after running on millions of actual records.

Our algorithms give 3 types of results: 1) Exact - The output is a result of direct match and this functionality can be compared to de-duplication by many CRMs. 2) Fuzzy - Based on our own AI based algorithms. This is probabilistic as compared to Exact match, which is deterministic. Then there is 3) Smart match - It shows results where our confidence level is very high and is a combination of deterministic and probabilistic approaches.

On Pricing

CDQ follows a yearly subscription model and its pricing is based on number of records that are stored or indexed by the application.

There is no setup cost if the database is kept on Contactous' own servers. However, there is a one-time setup charge for a separate user's own AWS instance or on-premise database. Consulting services are charged separately, if taken.

There are no other charges.

Consulting services and partners

8 out of 10 users of CDQ are using it the way it is designed. All frequent combinations of de-duplication keys have been programmed in the system, which are used across industries in both B2C and B2B configurations. Creation of data dictionary too is always handled by users themselves, due to its simplicity.

Still, there are 7 areas where our services have been asked, for more specialized tasks.

1) Creation of special de-duplication criteria
2) Implementation of user's own de-duplication algorithm
3) Custom logic of entity resolution methods
4) Special logic to merge duplicate records
5) Large data scrubbing and readiness
6) Extraction programs for unstructured data
7) Consulting services on data ETL

Our experienced consultants and network of partners are available for such tasks. With proper Non-disclosure and confidentiality agreements in place, our team is ready to work with our user's on such requirements.

Complex de-duplication and entity resolution

CRM Data Quality (CDQ)

CDQ cleans large contact databases by finding complex patterns of duplicate data, achieves single version of truth through its Entity Resolution algorithms and executes real-time de-duplication checks on applications within the enterprise.

Main Features

Examples of Duplicate Clusters

Frequently Discussed Topics

Ask for a Demonstration of CDQ

Support

Resources

Address