A Taxonomy of Dirty Data

Won Kim, Byoung Ju Choi, Eui Kyeong Hong, Soo Kyung Kim, Doheon Lee

Research output: Contribution to journalArticlepeer-review

226 Scopus citations

Abstract

Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical applications, data mining, and customer relationship management systems. A major problem that is only beginning to be recognized is that the data in data sources are often "dirty". Broadly, dirty data include missing data, wrong data, and non-standard representations of the same data. The results of analyzing a database/data warehouse of dirty data can be damaging and at best be unreliable. In this paper, a comprehensive classification of dirty data is developed for use as a framework for understanding how dirty data arise, manifest themselves, and may be cleansed to ensure proper construction of data warehouses and accurate data analysis. The impact of dirty data on data mining is also explored.

Original languageEnglish
Pages (from-to)81-99
Number of pages19
JournalData Mining and Knowledge Discovery
Volume7
Issue number1
DOIs
StatePublished - Jan 2003

Bibliographical note

Funding Information:
∗This research was partially supported by Korea’s Brain Korea-21 grant. †This research was partially supported by Korea’s KISTEP grant.

Keywords

  • Data cleansing
  • Data mining
  • Data quality
  • Data warehousing
  • Dirty data

Fingerprint

Dive into the research topics of 'A Taxonomy of Dirty Data'. Together they form a unique fingerprint.

Cite this