Contactous
  • Products
    • Contact Management >
      • Enterprise Contact Manager (ECM)
      • ECM Pricing
    • Data Quality >
      • DeDupe API
      • CRM Data Quality
    • Data Parser >
      • On-Premise Data Parser
      • Cloud-based Data Extractor and Parser
    • AI Content >
      • Personalized Learning
    • RAG-as-a-service >
      • Answerous
      • Free Trial
    • Carbon Estimation API
  • Use Cases
    • Digital Business Cards
    • Customer Golden Record
    • Sales Funnel from Dealers
    • Automated Document Parser
    • Relationship Intelligence
    • Marketing Leads Management
    • Intelligent Data Import
    • CRM Data Consolidation
    • Webinars and Events
    • Physical Business Cards
    • Enterprise Pre-CRM
  • Company
    • Contact Us
    • Terms Of Use
    • Privacy Policy
  • Login
  • Products
    • Contact Management >
      • Enterprise Contact Manager (ECM)
      • ECM Pricing
    • Data Quality >
      • DeDupe API
      • CRM Data Quality
    • Data Parser >
      • On-Premise Data Parser
      • Cloud-based Data Extractor and Parser
    • AI Content >
      • Personalized Learning
    • RAG-as-a-service >
      • Answerous
      • Free Trial
    • Carbon Estimation API
  • Use Cases
    • Digital Business Cards
    • Customer Golden Record
    • Sales Funnel from Dealers
    • Automated Document Parser
    • Relationship Intelligence
    • Marketing Leads Management
    • Intelligent Data Import
    • CRM Data Consolidation
    • Webinars and Events
    • Physical Business Cards
    • Enterprise Pre-CRM
  • Company
    • Contact Us
    • Terms Of Use
    • Privacy Policy
  • Login

Complex de-duplication and entity resolution

CRM Data Quality (CDQ)

CDQ cleans large contact databases by finding complex patterns of duplicate data, achieves single version of truth through its Entity Resolution algorithms and executes real-time de-duplication checks on applications within the enterprise. 

Main Features

Assimilate data from multiple files and systems
CDQ enables a data dictionary to be defined using the field constructor option. Hundreds of fields can be defined about the contact data which can be strings, numeric, text or date types. This definition can be modified at any time. 

Data can be uploaded from Comma Separated Values (CSV) or text files. The upload feature automatically attempts to map the incoming data to the fields in the data dictionary. Fields can be re-mapped or skipped if needed. 

Dozens of encoding options are available during import, with UTF-8 selected by default. During the import process, checks are performed on the completeness of data.  The data can be added to an existing dataset or by default a new one will be created after a successful upload. 

External data can also be imported by connecting to applications like Salesforce, Eloqua and Zoho CRM. Any system with an API access can be integrated to CDQ.  
Create single contact records from disconnected datasets
Data from multiple datasets can be merged after a common factor between them has been discovered by CDQ. Using its de-duplication rules, CDQ forms a cluster of records with this common criteria. These records can then be collapsed into a single golden record. The data custodian can do this merging manually or instruct CDQ to perform it automatically. 

The following example will make it clearer. CDQ ​has found 4 common records which have been collapsed into a single golden record within CRM by the data custodian. 
​
Picture
Identify duplicate records across databases
​Using its pattern matching algorithms, CDQ can find common records across databases and multiple datasets. While the CSV and text files is usually directly uploaded into a dataset within CDQ , the data from external systems can be configured by either importing into CDQ or by storing the indexed keys of the records of that database. 

Storing the data within CDQ is optional. Essentially, it stores the key which it constructs after indexing the datasets. An external data system can be indexed by CDQ ​and that would be sufficient for it to compare that database with others or perform a real-time de-duplication on it. 
Automatically standardize, enhance and repair data
​​CDQ ​has dozens of functions to manually or automatically repair, standardize and enhance the data within its datasets. The strength of these functions is shown in the example below in which an incoming record on the left is passed through them, resulting in the final record on the right. 
Picture
  • ​The name field was cleaned and converted to proper format
  • Before converting the email to its proper format, CDQ checked if it is a valid email by sending a request to the SMTP server of its domain. It turns out to be an address of bad quality (which is displayed in the last row). CDQ then recommends an alternative email address by itself by correcting the spelling and verifies it to be a valid address.
  • The company name, Address-1, Postal code and Website are converted to proper format.
  • The designation is standardized.
  • A typo error (sectorr) in Address-2 is automatically corrected.
  • "gurgaon" in the City field refers to an old name of this city. The new official name is "Gurugram" which is recommended by CDQ .
  • The fields of State and Country were blank in original data. These were populated by CDQ from its internal database
  • The registration date was corrected and normalized to a readable format by other systems.
CDQ ​can be customized further by connecting to other external databases like company or country specific regional definitions or through an APIs for geo-location or mobile number validations. 
Check for duplicates in real-time from millions of records
CDQ ​can find duplicates within seconds, across tens of millions of records. This de-duplication in real-time is being used by our customers from data entry to identification and in other innovative ways like explained in the example below:
Picture
​Before CDQ integration, the data is coming from a customer's web submission form and is being passed to CRM, telesales or lead-nurturing programs. In example above, CDQ is made to tap into this email (without integration) and it starts to perform the de-duplication for incoming records in real-time and sends that information to telesales. The actions (and their effectiveness) on original mail as compared to one sent by CDQ ​become very different.
Resolve entity names and cluster within large bulk files
From merchant data records in banks to epidemiological data in hospitals and product descriptions in warranty management, large files with millions of records need to be quickly resolved to determine the right entity to be used through some agreed logic.
Picture
​In the example above, a million records are received in text file by a financial institution comprising of merchant names. CDQ can take this entire data set and provide an entity resolution map within an hour about identity of the main entity and which other records in the incoming file correspond to that entity. Here, CDQ ​shows (based on configured rules) that Starbucks (S) Pte Ltd is the main entity and the 5 other names should correspond to this main record within the text file. With this resolution, the file can now be easily processed. 
Merge duplicate contact records automatically or manually
Finding the duplicate records is part of the solution, merging them into a single version of truth is the next step. Once the cluster of duplicate records is found, it is important to determine the master record which will lead the merging process. CDQ has a rich set of configurable rules to determine which record would be the master record within a cluster. The user can always over-ride this suggestion. 

What happens to the records which get merged can be configured too. They can get deleted or moved to a special data set to be analyzed later. Then there are values that can get overwritten or ones which get appended. 

Lastly, the merging process itself can be automatic and CDQ ​can take responsibility of determining the master record and merge duplicate records of tens of thousands of clusters into their master record within an hour.

Examples of Duplicate Clusters

CDQ ​extracts clusters of duplicate records from single or multiple datasets, which could comprise of millions of records. Here are some examples. Majority of them are close to real cases of duplicate data found by the product during its usage. The real data has been changed for confidentiality, but the discovered pattern is intact. 
Full Name - First Example
Example of Name de-duplication taken from a medical institution in India. Combination of salutations, qualifications and swap of first name and surnames were considered by CDQ ​:
  • Sheela Joshi
  • Dr. Sheela Joshi, PhD
  • Mrs. Joshi, Sheela
  • Sheela Joshi, M.B.B.S.
  • Joshi Sheela
Full Name - Second Example
Example of Name de-duplication taken from a warranty registration database in Philippines:
  • Ivy Mathew Griffin
  • Ivy M. Griffin
  • Ivy Matt Griffin
Full Name - Third Example
A powerful example of CDQ's capabilities. Example of Name de-duplication taken from a database of a South Asian country:
  • Mohammed Qasim
  • Mohammad Kasim
  • Mohd. Kasim
  • Mhd. Kasim
  • Md. Kasim
  • Muhammad Kasim
Full Name - Fourth Example
Example of variations of a name considered in a suspected duplicate cluster by CDQ:
  • Casey Pabilla
  • Cassey Pabilla
  • Caseyy Pabilla
  • Caesey Pabilla
  • Caseey Pabilla
  • Caseey Pabillaa
  • Caseey Pabella
  • Caseeey Pabilla
  • Caasaay Pabilla
  • Caasey Pabilla
  • Caseey Pabilla
  • Casey Pabillla
  • Casey Pabellla
  • Casey Pabella
  • Casiy Pabilla
  • Casii Pabilla
  • Casey Pabiilla
  • Casey Pabilla
  • Caseeyy Pabilla
  • Caassey Pabilla
Address - First Example
This is one of the best example of Address de-duplication, highlighted by CDQ ​within a massive CRM database in India.  Not only there are inconsistent abbreviations and spelling errors, the old and new official name of the city has been detected as duplicate: 
  • 43/2, Industrial Road, Sector 65, Bangalore
  • sector 65, bangalore - 43/2 (industrial) rd
  • 43\2, sectorr 65 - industrial rd,, BENGALURU
  • 43 2 indl rd sec 65 bangalore 
Address - Second Example
An example of Address de-duplication, from Singapore:
  • #01-33, 92 Whampoa Annexe, Causeway Drive
  • 01 33, 92-whampoa annx, causeway dr
  • 01--33 whampoa anx #92, causeway drv
  • #92. Whampoa Anex. Causeway= Drive, 01,33
Address - Third Example
An example of duplicate address cluster from Australia. Note the abbreviations and variations of state name captured in duplicate cluster: 
  • 7th Floor, 43/2 Miller Plaza, Industrial Highway, Sydney, New South Wales
  • 7th fl miller plz, (industrial) hw – 43 2, sydney, nsw
  • Seventh Floor. Indl Hway. 43-2 Plaza. Miller. Sydney. N.S.W.
  • Flr 7th, #43—2, miller pz, sydney indl hwye, ns.w
Mobile Numbers
CDQ ​finds mobile numbers in multiple formats and groups the duplicate together. Here's an example of such a group:
  • +63-906-222-1520
  • 0063 9 06 22 21 520
  • +(906).222.1520
  • 0-906-22-21-520
  • 9 0 6 2 2 2 1 5 2 0
Company Name - First Example
An example of Company Name de-duplication, from India:
  • HPE India Private Limited
  • HPE India Private Ltd.
  • HPE Pvt. Ltd. – India
  • H.P.E. Pvt. Ltd.
  • HPE Limited
Company Name - Second Example
Another similar example of Company Name de-duplication from Philippines:
  • HPE Philippines Incorporated
  • HPE Philippines Inc.
  • HPE Phils, Inc.
  • H.P.E. Incorporated
  • HPE Inc
Name + Mobile Number
​Example of 5 duplicate Name and Mobile Number combinations as found by CDQ:

Name: Mohammad Kasim 
Mobile: +91-98336-90611

Name: Mohd. Kasim
Mobile: 0091 98 33 69 06 11

Name: Mhd. Kasim
Mobile: (9833) 690-611

Name: Md. Kasim
Mobile: 0-98336-90611

Name: Muhammad Kasim
Mobile: 9 8 3 3 6 9 0 6 1 1
Name + Date of Birth
​Example of 4 duplicate Person's Name and Company Name combinations as found by CDQ:

Person's Name: Narendra Bajpayee
Date of Birth: 15/11/1984

Person's Name: Narindir Bajpayee
Date of Birth: 11-15-1984

Person's Name: Narender Bejpeyee
Date of Birth: 15.11.84

Person's Name: Nariinder Baajpayii
Date of Birth: 1984, novembr 15
Name + Company
​Example of 4 duplicate combinations of Person's and Company Names as found by CDQ:

Person's Name: Sanjiv Kumar
Company's Name: HPE India Private Limited

Person's Name: Sanjeve Kumarr
Company's Name: HPE India Private Ltd.

Person's Name: Sanjeev Qumar
Company's Name: HPE Pvt. Ltd

Person's Name: Sanjive Koomar
Company's Name: HPE Limited
Website URL
CDQ ​groups different Website URLs which refer to the same page in a single cluster. Here's an example of such a group:
  • contactous.com
  • http://www.contactous.com/index.htm
  • www4.contactous.com/?query=malaysia
  • https://contactous.com:8080/

Frequently Discussed Topics

On location of hosted data
​CDQ's computational servers run on AWS Singapore and they are different from the servers where databases are kept. Our compute servers are fixed, but the database could reside in following 3 configurations - 

1) Data on Contactous' Servers - Here, we create a separate environment for customer's data on one of our AWS Singapore instance. This is a good option if the data is of low volume or for a pilot project. This option is also used by our customers as a pre-CRM space, to scrub and clean the data before transferring it to other systems. 

2) Data on Customer's own AWS Instance - We recommend this option as it is fast to setup and gives our customers assurance as the database access is managed by them. We help to set the system and establish the API access, after which the full control of database is with customer. 

3) Data on Customer's own data center - This on-premise database location is possible. The setup is like in #2. Co-ordination and testing usually takes more time than #2. 
Security of application and platform
Contactous' CDQ ​has 100% SaaS architecture, and is hosted on AWS Singapore. Compliance certificates are available at:
https://aws.amazon.com/compliance/programs/

Entire application of Contactous (including all web services) are at
https://web.contactous.com/. All external accesses to our platform are
protected by the TLS 1.2 – HTTPS protocol. Strong encryption algorithms (AES 128 GCM) are used. Validated by SSLLabs (rated A).

The risk of data being intercepted by a third party during transmission is minimal. Validate at: https://www.ssllabs.com/ssltest/analyze.html?d=web.contactous.com

Every instance used by a customer runs a separate set of programs isolated from others. 

We are listed on Application Exchange of Salesforce and Microsoft - both of which have the highest level of security standards that we
comply to. Check:
https://itunes.apple.com/sg/app/contactous/id1161332503?mt=8 and 
https://play.google.com/store/apps/details?id=com.contactous&hl=en
De-duplication speed and data volume
​CDQ has been designed to give fast results of de-duplication. In our sample database with over 2 million records, the duplicate pattern is found within a second. This speed has enabled our customers to use the system in real-time. 

When the records are read by CDQ, they get indexed in multiple ways through its complex algorithms. For smaller number of records like a couple of thousands, this indexing takes place automatically. When large datasets are imported, the indexing can be manually triggered. 

Large datasets can be uploaded into CDQ ​as CSV files. It is frequent for our customers to upload datasets of half million records as CSVs. In case there is a very large database of several million records that need to be indexed, our consultants can help to ensure that data upload is done successfully. 
Algorithms in use and customization
​The algorithms used for de-duplication and entity resolution within CDQ have the foundation of proven data matching frameworks, but have been developed from scratch. These algorithms heavily use fuzzy logic and other probabilistic methods to arrive at decisions. A big emphasis has been on the quality of output, which will remain as the guiding principle in future. The algorithms in current version of CDQ ​have been proven after running on millions of actual records.

Our algorithms give 3 types of results: 1) Exact - The output is a result of direct match and this functionality can be compared to de-duplication by many CRMs. 2) Fuzzy - Based on our own AI based algorithms. This is probabilistic as compared to Exact match, which is deterministic. Then there is 3) Smart match - It shows results where our confidence level is very high and is a combination of deterministic and probabilistic approaches.
On Pricing
CDQ follows a yearly subscription model and its pricing is based on number of records that are stored or indexed by the application. 

There is no setup cost if the database is kept on Contactous' own servers. However, there is a one-time setup charge for a separate user's own AWS instance or on-premise database. Consulting services are charged separately, if taken. 

There are no other charges. 
Consulting services and partners
​8 out of 10 users of CDQ ​are using it the way it is designed. All frequent combinations of de-duplication keys have been programmed in the system, which are used across industries in both B2C and B2B configurations. Creation of data dictionary too is always handled by users themselves, due to its simplicity. 

Still, there are 7 areas where our services have been asked, for more specialized tasks. 

1) Creation of special de-duplication criteria
2) Implementation of user's own de-duplication algorithm
3) Custom logic of entity resolution methods
4) Special logic to merge duplicate records
5) Large data scrubbing and readiness
6) Extraction programs for unstructured data
7) Consulting services on data ETL

Our experienced consultants and network of partners are available for such tasks. With proper Non-disclosure and confidentiality agreements in place, our team is ready to work with our user's on such requirements. 

Ask for a Demonstration of CDQ

© 2025 CONTACTOUS PTE LTD | ALL RIGHTS RESERVED

Support

FAQ
Contact Us

Resources

Privacy Policy
Terms of Use

Address

24 Raffles Place, #25-02A
Singapore 048621.
© 2016 CONTACTOUS PTE LTD
ALL RIGHTS RESERVED