On Memory and I/O Efficient Duplication Detection for Multiple Self-clean Data Sources

Download

Full text for this resource is not available from the Research Repository.

Export

Zhang, J, Shu, Y and Wang, Hua ORCID: 0000-0002-8465-0996 (2010) On Memory and I/O Efficient Duplication Detection for Multiple Self-clean Data Sources. In: Database Systems for Advanced Applications, 01 April 2010-04 April 2010, Tsukuba, Japan.

Abstract

In this paper, we propose efficient algorithms for duplicate detection from multiple data sources that are themselves duplicate-free. When developing these algorithms, we take the full consideration of various possible cases given the workload of data sources to be cleaned and the available memory. These algorithms are memory and I/O efficient, being able to reduce the number of pair-wise record comparison and minimize the total page access cost involved in the cleaning process. Experimental evaluation demonstrates that the algorithms we propose are efficient and are able to achieve better performance than SNM and random access methods.

Dimensions Badge

Altmetric Badge

Additional Information	Series title: Lecture Notes in Computer Science, vol. 6193
Item type	Conference or Workshop Item (Paper)
URI	https://vuir.vu.edu.au/id/eprint/27938
DOI	10.1007/978-3-642-14589-6_14
Official URL	http://link.springer.com/chapter/10.1007/978-3-642...
ISBN	9783642145889
Subjects	Historical > FOR Classification > 0806 Information Systems Historical > Faculty/School/Research Centre/Department > Centre for Applied Informatics
Keywords	algorithms; duplicate detection; data sources; data management
Citations in Scopus	0 - View on Scopus
Download/View statistics	View download statistics for this item