Advances in cluster editing: linear FPT kernels and comparative implementations

Shaw, Peter E.

Title: Advances in cluster editing: linear FPT kernels and comparative implementations
Creator: Shaw, Peter E.
Relation: University of Newcastle Research Higher Degree Thesis
Resource Type: thesis
Date: 2010
Description: Research Doctorate - Doctor of Philosophy (PhD)
Description: Experience has shown that clustering objects into groups is a useful way to analyze and order information. It turns out that many clustering problems are intractable. Several heuristic and approximation algorithms exist; however in many applications what is desired is an optimum solution. Finding an optimum result for the Cluster Edit problem has proven non-trivial, as Cluster Edit is NP-Hard [KM86], and APX-Hard, and therefore cannot be approximated within a factor of (1 + ϵ) unless Poly = NP [SST04]. The algorithmic technique of Parameterized Complexity has proven an effective tool to address hard problems. Recent publications have shown that the Cluster Edit problem is Fixed Parameter Tractable (FPT ). That is, there is a fixed parameter algorithm that can be used to solve the Cluster Edit problem. Traditionally, algorithms, in computer science, are evaluated in terms of the time needed to determine the output as a function of input size only. However, typically in science most real datum contains inherent structure. For Fixed Parameter Tractable (FPT) algorithms, permitting one or more parameters to be given in the input, to further define the question, allows the algorithm to take advantage of any inherit structure in the data [ECFLR05]. A key concept of FPT is kernelization; that is, reducing a problem instance to a core hard sub-problem. The previous best kernelization technique for Cluster Edit was able to reduce the input to within k2 [GGHN05] vertices, when parameterized by k, the edit distance. The edit distance is the number of edit operations required to transform the input graph into a cluster graph (a disjoint union of cliques). Experimental comparisons in [DLL+06] showed that significant improvements were obtained using this reduction rule for the Cluster Edit problem. The study reported in this thesis presents three polynomialtime, many-to-one kernelization algorithms for the Cluster Edit problem, the best of these algorithms produces a linear kernel of at most 6k vertices. In this thesis, we discuss how using new FPT techniques including extremal method compression routine and modelled crown reductions [DFRS04] can be used to kernelize the input for the Cluster Edit problem. Using these new kernelization techniques, it has been possible to improve the number of vertices in the data sets that can be solved optimally, from the previous maximum of around 150 vertices to over 900. More importantly, the edit distance of the graphs that could be solved as also increased from around k = 40 to more than k = 400. This study also provides a comparison of three inductive algorithmic techniques: i) compression routine using a constant factor approximation – Compression Crown Rule Search Algorithm; ii) extremal method (coordinatized kernel) [PR05], using a constructive form of the boundary lemma – Greedy Crown Rule Search Algorithm; iii) extremal method, using an auxiliary (TWIN) graph structure – Crown Rule TWIN Search Algorithm. Algorithms derived using each of the above techniques to obtain linear kernels for the Cluster Edit problem have been evaluated using a variety of data with different exploratory properties. Comparisons have been made in terms of reduction in kernel size, lower bounds obtained and execution time. Novel solutions have been required to obtain approximations within a reasonable time, for the Cluster Edit problem that is within a factor of four of the edit distance (minimum solution size). Most approximation methods performed very badly for some graphs and well for others. Without any guide regarding the quality of the result, a very bad result can be assumed to be close to optimum. Our study has found that just using the highest available lower bound for the approximation is insufficient to improve the result. However, by combining both the highest lower bound obtained and the reduction obtained using kernelization, a 30-fold improvement in the approximation performance ratio is achieved.
Subject: combinatorial problems; data reduction; preprocessing; computational biology; fixed-parameter tractability
Identifier: http://hdl.handle.net/1959.13/928253
Identifier: uon:10368
Language: eng
Full Text

Hits: 747
Visitors: 1594
Downloads: 492

		Thumbnail	File	Description	Size	Format
View Details Download			ATTACHMENT01	Abstract	296 KB	Adobe Acrobat PDF	View Details Download
View Details Download			ATTACHMENT02	Thesis	8 MB	Adobe Acrobat PDF	View Details Download