Incorrect or inconsistent data can lead to fake conclusions and misdirected investments in both the public and private sector. In the business world, incorrect data can cost millions and even billions of dollars. Many companies use databases that inform about customer records like contact information, addresses, preferences, etc.
The process of data cleansing is a somewhat complex process and it involves multiple stages:
- Data auditing – The data will be audited with the use of database and statistical methods that will detect incorrect data and contradictions. This will give an indication of the characteristics of the incorrect data and it's locations. Some software will let you specify the constraints and then generate a code that check data for violation of these constraints.
- Workflow specification – The detection and removal of incorrect data is performed by a list of operations on the data known as a workflow. This process is of utmost importance in achieving high-quality data.
- Workflow execution – The workflow is executed after the cleansing and it's being verified. The implementation of the workflow needs to be sufficient even on large amounts of data.
- After-process and control – After all is done the results are being inspected, data that wasn't corrected is manually corrected. This results in a new cycle in the data-cleansing process where the data is being audited again, which allows the specification of the additional flow to be further cleanse the data.
- Parsing – Detection of syntax errors, this decides whether data is acceptable within the allowed data specification.
- Data Transformation – This allows the mapping of data from a given format to shift into the expected format. This process includes translation functions and value conversions as well as the normalization of numeric values to conform to maximum and minimum values.
- Duplicate Elimination – This requires n algorithm to determine whether the data contains duplicate representation of a single entity. Data is usually sorted by a key, which brings duplicate entries together for quick identification.
No comments:
Post a Comment