Friday, July 15, 2016

Data Cleansing Explained

Data cleansing or clearing is the process of detecting and removing or correcting inaccurate or corrupt records from a record set, database or table. It's used mainly in databases and the term refers to identifying incorrect, inaccurate, irrelevant or incomplete parts of the data and then correcting, modifying, deleting or replacing this data. Data cleansing can be performed with scripts, data wrangling tools or through a batch process.

Incorrect or inconsistent data can lead to fake conclusions and misdirected investments in both the public and private sector. In the business world, incorrect data can cost millions and even billions of dollars. Many companies use databases that inform about customer records like contact information, addresses, preferences, etc.

The process of data cleansing is a somewhat complex process and it involves multiple stages:

  • Data auditing – The data will be audited with the use of database and statistical methods that will detect incorrect data and contradictions. This will give an indication of the characteristics of the incorrect data and it's locations. Some software will let you specify the constraints and then generate a code that check data for violation of these constraints. 
  • Workflow specification – The detection and removal of incorrect data is performed by a list of operations on the data known as a workflow. This process is of utmost importance in achieving high-quality data. 
  • Workflow execution – The workflow is executed after the cleansing and it's being verified. The implementation of the workflow needs to be sufficient even on large amounts of data.
  • After-process and control – After all is done the results are being inspected, data that wasn't corrected is manually corrected. This results in a new cycle in the data-cleansing process where the data is being audited again, which allows the specification of the additional flow to be further cleanse the data.


There is also a process called decleansing, which involves:

  • Parsing – Detection of syntax errors, this decides whether data is acceptable within the allowed data specification. 
  • Data Transformation – This allows the mapping of data from a given format to shift into the expected format. This process includes translation functions and value conversions as well as the normalization of numeric values to conform to maximum and minimum values.
  • Duplicate Elimination – This requires n algorithm to determine whether the data contains duplicate representation of a single entity. Data is usually sorted by a key, which brings duplicate entries together for quick identification. 

No comments:

Post a Comment