Enterprise Data Hygiene (EDH)
EDH cleans large contact databases by finding complex patterns of duplicate data, achieves single version of truth through its Entity Resolution algorithms and provides real-time de-duplication checks to applications within the enterprise.
Assimilate data from multiple files and systems
EDH enables a data dictionary to be defined using the field constructor option. Hundreds of fields can be defined about the contact data which can be strings, numeric, text or date types. This definition can be modified at any time.
Data can be uploaded from Comma Separated Values (CSV) or text files. The upload feature automatically attempts to map the incoming data to the fields in the data dictionary. Fields can be re-mapped or skipped if needed.
Dozens of encoding options are available during import, with UTF-8 selected by default. During the import process, checks are performed on the completeness of data. The data can be added to an existing dataset or by default a new one will be created after a successful upload.
External data can also be imported by connecting to applications like Salesforce, Eloqua and Zoho CRM. Any system with an API access can be integrated to EDH.
Create single contact records from disconnected datasets
Data from multiple datasets can be merged after a common factor between them has been discovered by EDH. Using its de-duplication rules, EDH forms a cluster of records with this common criteria. These records can then be collapsed into a single golden record. The data custodian can do this merging manually or instruct EDH to perform it automatically.
The following example will make it clearer. EDH has found 4 common records which have been collapsed into a single golden record within CRM by the data custodian.
Identify duplicate records across databases
Using its pattern matching algorithms, EDH can find common records across databases and multiple datasets. While the CSV and text files is usually directly uploaded into a dataset within EDH, the data from external systems can be configured by either importing into EDH or by storing the indexed keys of the records of that database.
Storing the data within EDH is optional. Essentially, it stores the key which it constructs after indexing the datasets. An external data system can be indexed by EDH and that would be sufficient for it to compare that database with others or perform a real-time de-duplication on it.
Automatically standardize, enhance and repair data
EDH has dozens of functions to manually or automatically repair, standardize and enhance the data within its datasets. The strength of these functions is shown in the example below in which an incoming record on the left is passed through them, resulting in the final record on the right.
Check for duplicates in real-time from millions of records
EDH can find duplicates within seconds, across tens of millions of records. This de-duplication in real-time is being used by our customers from data entry to identification and in other innovative ways like explained in the example below:
Before EDH integration, the data is coming from a customer's web submission form and is being passed to CRM, telesales or lead-nurturing programs. In example above, EDH is made to tap into this email (without integration) and it starts to perform the de-duplication for incoming records in real-time and sends that information to telesales. The actions (and their effectiveness) on original mail as compared to one sent by EDH become very different.
Resolve entity names and cluster within large bulk files
From merchant data records in banks to epidemiological data in hospitals and product descriptions in warranty management, large files with millions of records need to be quickly resolved to determine the right entity to be used through some agreed logic.
In the example above, a million records are received in text file by a financial institution comprising of merchant names. EDH can take this entire data set and provide an entity resolution map within an hour about identity of the main entity and which other records in the incoming file correspond to that entity. Here, EDH shows (based on configured rules) that Starbucks (S) Pte Ltd is the main entity and the 5 other names should correspond to this main record within the text file. With this resolution, the file can now be easily processed.
Merge duplicate contact records automatically or manually
Finding the duplicate records is part of the solution, merging them into a single version of truth is the next step. Once the cluster of duplicate records is found, it is important to determine the master record which will lead the merging process. EDH has a rich set of configurable rules to determine which record would be the master record within a cluster. The user can always over-ride this suggestion.
What happens to the records which get merged can be configured too. They can get deleted or moved to a special data set to be analysed later. Then there are values that can get overwritten or ones which get appended.
Lastly, the merging process itself can be automatic and EDH can take responsibility of determining the master record and merge duplicate records of tens of thousands of clusters into their master record within an hour.
Examples of Duplicate Clusters
EDH extracts clusters of duplicate records from single or multiple datasets, which could comprise of millions of records. Here are some examples. Majority of them are close to real cases of duplicate data found by the product during its usage. The real data has been changed for confidentiality, but the discovered pattern is intact.
Full Name - First Example
Example of Name de-duplication taken from a medical institution in India. Combination of salutations, qualifications and swap of first name and surnames were considered by EDH:
Full Name - Second Example
Example of Name de-duplication taken from a warranty registration database in Philippines:
Full Name - Third Example
A powerful example of EDH's capabilities. Example of Name de-duplication taken from a database of a South Asian country:
Full Name - Fourth Example
Example of variations of a name considered in a suspected duplicate cluster by EDH:
Address - First Example
This is one of the best example of Address de-duplication, highlighted by EDH within a massive CRM database in India. Not only there are inconsistent abbreviations and spelling errors, the old and new official name of the city has been detected as duplicate:
Address - Second Example
An example of Address de-duplication, from Singapore:
Address - Third Example
An example of duplicate address cluster from Australia. Note the abbreviations and variations of state name captured in duplicate cluster:
EDH finds mobile numbers in multiple formats and groups the duplicate together. Here's an example of such a group:
Company Name - First Example
An example of Company Name de-duplication, from India:
Company Name - Second Example
Another similar example of Company Name de-duplication from Philippines:
Name + Mobile Number
Example of 5 duplicate Name and Mobile Number combinations as found by EDH:
Name: Mohammad Kasim
Name: Mohd. Kasim
Mobile: 0091 98 33 69 06 11
Name: Mhd. Kasim
Mobile: (9833) 690-611
Name: Md. Kasim
Name: Muhammad Kasim
Mobile: 9 8 3 3 6 9 0 6 1 1
Name + Date of Birth
Example of 4 duplicate Person's Name and Company Name combinations as found by EDH:
Person's Name: Narendra Bajpayee
Date of Birth: 15/11/1984
Person's Name: Narindir Bajpayee
Date of Birth: 11-15-1984
Person's Name: Narender Bejpeyee
Date of Birth: 15.11.84
Person's Name: Nariinder Baajpayii
Date of Birth: 1984, novembr 15
Name + Company
Example of 4 duplicate combinations of Person's and Company Names as found by EDH:
Person's Name: Sanjiv Kumar
Company's Name: HPE India Private Limited
Person's Name: Sanjeve Kumarr
Company's Name: HPE India Private Ltd.
Person's Name: Sanjeev Qumar
Company's Name: HPE Pvt. Ltd
Person's Name: Sanjive Koomar
Company's Name: HPE Limited
EDH groups different Website URLs which refer to the same page in a single cluster. Here's an example of such a group:
Frequently Discussed Topics
On location of hosted data
EDH's computational servers run on AWS Singapore and they are different from the servers where databases are kept. Our compute servers are fixed, but the database could reside in following 3 configurations -
1) Data on Contactous' Servers - Here, we create a separate environment for customer's data on one of our AWS Singapore instance. This is a good option if the data is of low volume or for a pilot project. This option is also used by our customers as a pre-CRM space, to scrub and clean the data before transferring it to other systems.
2) Data on Customer's own AWS Instance - We recommend this option as it is fast to setup and gives our customers assurance as the database access is managed by them. We help to set the system and establish the API access, after which the full control of database is with customer.
3) Data on Customer's own data center - This on-premise database location is possible. The setup is like in #2. Co-ordination and testing usually takes more time than #2.
Security of application and platform
Contactous' EDH has 100% SaaS architecture, and is hosted on AWS Singapore. Compliance certificates are available at:
Entire application of Contactous (including all web services) are at
https://web.contactous.com/. All external accesses to our platform are
protected by the TLS 1.2 – HTTPS protocol. Strong encryption algorithms (AES 128 GCM) are used. Validated by SSLLabs (rated A).
The risk of data being intercepted by a third party during transmission is minimal. Validate at: https://www.ssllabs.com/ssltest/analyze.html?d=web.contactous.com
Every instance used by a customer runs a separate set of programs isolated from others.
We are listed on Application Exchange of Salesforce and Microsoft - both of which have the highest level of security standards that we
comply to. Check:
De-duplication speed and data volume
EDH has been designed to give fast results of de-duplication. In our sample database with over 2 million records, the duplicate pattern is found within a second. This speed has enabled our customers to use the system in real-time.
When the records are read by EDH, they get indexed in multiple ways through its complex algorithms. For smaller number of records like a couple of thousands, this indexing takes place automatically. When large datasets are imported, the indexing can be manually triggered.
Large datasets can be uploaded into EDH as CSV files. It is frequent for our customers to upload datasets of half million records as CSVs. In case there is a very large database of several million records that need to be indexed, our consultants can help to ensure that data upload is done successfully.
Algorithms in use and customization
The algorithms used for de-duplication and entity resolution within EDH have the foundation of proven data matching frameworks, but have been developed from scratch. These algorithms heavily use fuzzy logic and other probabilistic methods to arrive at decisions. A big emphasis has been on the quality of output, which will remain as the guiding principle in future. The algorithms in current version of EDH have been proven after running on millions of actual records.
Our algorithms give 3 types of results: 1) Exact - The output is a result of direct match and this functionality can be compared to de-duplication by many CRMs. 2) Fuzzy - Based on our own AI based algorithms. This is probabilistic as compared to Exact match, which is deterministic. Then there is 3) Smart match - It shows results where our confidence level is very high and is a combination of deterministic and probabilistic approaches.
EDH's follows a yearly subscription model and its pricing is based on number of records that are stored or indexed by the application.
There is no setup cost if the database is kept on Contactous' own servers. However, there is a one-time setup charge for a separate user's own AWS instance or on-premise database. Consulting services are charged separately, if taken.
There are no other charges.
Consulting services and partners
8 out of 10 users of EDH are using it the way it is designed. All frequent combinations of de-duplication keys have been programmed in the system, which are used across industries in both B2C and B2B configurations. Creation of data dictionary too is always handled by users themselves, due to its simplicity.
Still, there are 7 areas where our services have been asked, for more specialized tasks.
1) Creation of special de-duplication criteria
2) Implementation of user's own de-duplication algorithm
3) Custom logic of entity resolution methods
4) Special logic to merge duplicate records
5) Large data scrubbing and readiness
6) Extraction programs for unstructured data
7) Consulting services on data ETL
Our experienced consultants and network of partners are available for such tasks. With proper Non-disclosure and confidentiality agreements in place, our team is ready to work with our user's on such requirements.