Making sense of data cleaning: Variability, Structured & unstructured data

Today’s technology generates huge amount of data, which has become an essential business driver. This data requires efficient protection and management to effectively drive business continuity.

Big data is a term we often hear now days in combination with data analytics or Business Intelligence (BI). Big data refers to data sets or combinations of data sets whose size (volume), complexity (variability), and rate of growth (velocity) make them difficult to be captured, managed, processed or analyzed by traditional methods and technologies.

The complexity of this “big data” is primarily driven by both structured and unstructured data. Social media generates a large amount of unstructured data every second and is growing at a rapid pace. Anything you do today on the Internet generates data. This data can be useful to drive business decisions. If not, then it is termed “dirty data”.

With data inputs and updates changing every second it is important to wean out the errors and duplicates. Using the ‘correct’ data is essential for meaningful analytics, those in turn will help make data driven business decisions. Systems need to be integrated with this critical process of continuously cleaning the incoming data.

Transforming the growing amount of data into actionable information to support strategic and tactical decision making has become critical to gain that competitive edge in today’s business organizations.

Variability in Big data refers to changes in data rate, format/structure, semantics, and/or quality that impact the supported applications, analytics, or problems. Specifically, variability is seen as a change in one or more of the other Big Data characteristics.

Separating the chaff from the wheat is known as Data cleaning in Information technology. It is also called data cleansing or scrubbing. This deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. In single data collections, invalid data can be cleaned easily, but is still a time consuming process. When multiple data sources are integrated- like in a global web-based information system, there is a significant need for data cleaning. This is because the sources may contain redundant data in different representations. In order to provide access to accurate and consistent data, consolidation of different data representations and elimination of duplicate information become necessary.

In data warehouses, data cleaning is a major part of the ETL process and is considered to be one of the biggest problems in data warehousing. Data from multiple sources needs to be extracted, transformed, and combined, during query runtime. This can cause significant processing delays making it difficult to achieve acceptable response times.

What is a satisfactory data cleaning approach?

It should help detect and remove all major inconsistencies both in individual data sources and when integrating multiple sources.

It should employ tools that can limit manual inspection, lessen programming time, and be extensible to engage any additional sources.

Data Cleaning in isolation is not advised. But, it should be performed in combination with schema-related data transformations based on comprehensive metadata.

To reduce costs and time related functions, mapping functions for data cleaning and other data transformations should be specified in a declarative way for reuse, especially for query processing.

Establishing a workflow infrastructure should be integrated to execute all data transformation steps for multiple sources and large data sets in a reliable and efficient way.

Dirty data can be transformed into clean data with the use of a combination of automatic and manual data cleansing techniques. Extensive validation processes can be run that include matching data against known reference sets and using full regular expression validation that validates files that are well formed and UTF-8 encoded.

It should be remembered that outdated, inaccurate, or duplicate data would not drive optimal decisions. Markets are in constant flux as is the data is generated in large quantities resulting in Big Data complexities. If data is inaccurate, it is harder to track and nurture, and present insights that may be flawed hampering business leads. Business strategy should be based on up-to-date and accurate data not duplicate information. Better decision-making is the result of clean and accurate data!