neal young / publications
© Copyrights are reserved by the publishers.
Download for personal and limited academic use only.

We propose a method for redistricting, decomposing a geographical area into subareas, called districts, so that the populations of the districts are as close as possible and the districts are compact and contiguous. Each district is the intersection of a polygon with the geographical area. The polygons are convex and the average number of sides per polygon is less than six. The polygons tend to be quite compact. With each polygon is associated a center. The center is the centroid of the locations of the residents associated with the polygon. The algorithm can be viewed as a heuristic for finding centers and a balanced assignment of residents to centers so as to minimize the sum of squared distances of residents to centers; hence the solution can be said to have low dispersion.

Data Mining and Knowledge Discovery 30(1):243281(2016)Over the past decade, time series clustering has become an increasingly important research topic in data mining community. Most existing methods for time series clustering rely on distances calculated from the entire raw data using the Euclidean distance or Dynamic Time Warping distance as the distance measure. However, the presence of significant noise, dropouts, or extraneous data can greatly limit the accuracy of clustering in this domain. Moreover, for most real world problems, we cannot expect objects from the same class to be equal in length. As a consequence, most work on time series clustering only considers the clustering of individual time series ``behaviors,'' e.g., individual heart beats or individual gait cycles, and contrives the time series in some way to make them all equal in length. However, automatically formatting the data in such a way is often a harder problem than the clustering itself. In this work, we show that by using only some local patterns and deliberately ignoring the rest of the data, we can mitigate the above problems and cluster time series of different lengths, e.g., cluster one heartbeat with multiple heartbeats. To achieve this, we exploit and extend a recently introduced concept in time series data mining called shapelets. Unlike existing work, our work demonstrates the unintuitive fact that shapelets can be learned from unlabeled time series. We show, with extensive empirical evaluation in diverse domains, that our method is more accurate than existing methods. Moreover, in addition to accurate clustering results, we show that our work also has the potential to give insight into the domains to which it is applied. While a bruteforce algorithm to discover shapelets in an unsupervised way could be untenably slow, we introduce two novel optimization procedures to significantly speed up the unsupervisedshapelet discovery process and allow it to be cast as an anytime algorithm.

work in progress(2016)We empirically evaluate various stackbased LSM mergecompaction policies, including the default policy for Google Bigtable, Cassandra's sizetiered policy, HBase's exploring policy, and new policies proposed in "Online Bigtable merge compaction". We consider in depth the setting where, to keep readresponse time low, the policy must manage compactions to keep the stack size below a specified parameter K. We evaluate the policies by implementing them within the AsterixDB NoSQL database. We also empirically partially validate the theoretical model for stackbased LSM mergecompaction proposed in "Online Bigtable merge compaction".

Springer Encyclopedia of Algorithms:886889(2016)A brief summary of greedy setcover algorithms.
INDEX TERMS: dominating set, greedy algorithm, hitting set, set cover, minimizing a linear function subject to a submodular constraint. 
working paper(2016)We describe the first nearly lineartime approximation algorithms for explicitly given mixed packing/covering linear programs, and for (nonmetric) fractional facility location. We also describe the first parallel algorithms requiring only nearlinear total work and finishing in polylog time. The algorithms compute $(1+\epsilon)$approximate solutions in time (and work) $\tilde O(N/\epsilon^2)$, where $N$ is the number of nonzeros in the constraint matrix. For facility location, $N$ is the number of eligible client/facility pairs.

working paper(2016)NoSQL databases are widely used for massive data storage and realtime web applications. Yet important aspects of these data structures are not well understood. For example, NoSQL databases write most of their data to a collection of files on disk, meanwhile periodically compacting subsets of these files. A compaction policy must choose which files to compact, and when to compact them, without knowing the future workload. Although these choices can affect computational efficiency by orders of magnitude, existing literature lacks tools for designing and analyzing online compaction policies  policies are now chosen largely by trial and error.
Here we introduce tools for the design and analysis of compaction policies for Google Bigtable, propose new policies, give averagecase and worstcase competitive analyses, and present preliminary empirical benchmarks. 
Springer Encyclopedia of Algorithms:14571461(2016)A brief summary of online paging and caching results, 19852013.
INDEX TERMS: paging, caching, weighted caching, weighted paging, file caching, least recently used (paging algorithm), first in first out (paging algorithm), flush when full (paging algorithm), the Marking algorithm (paging algorithm), Balance algorithm (weighted caching algorithm), Greedy Dual (weighted caching algorithm), Landlord (file caching algorithm), Squid (file caching software), kserver problem, primaldual algorithms, randomized algorithms, online algorithms, competitive analysis, competitive ratio, loose competitiveness, accessgraph model, Markov paging. 
working paper(2016)The paper considers the problem of summarising a collection of reviews for an item; the application considered is that of reviews for doctors. We consider the case where there are aspects about the item that are captured in the review, and are organized in an ontology in the form of a DAG. There is also sentiment about the aspects, so the reviews can be thought of as aspect/sentiment pairs. The goal is to find a subset of them to cover all the pairs, where coverage takes into account the ontology structure. We propose and analyse algorithms, and perform quantitative and qualitative experiments on real data.

Journal of Scheduling 18(6):545560(2015); (ICALP '13);The Joint Replenishment Problem (JRP) is a fundamental optimization problem in supplychain management, concerned with optimizing the flow of goods over time from a supplier to retailers. Over time, in response to demands at the retailers, the supplier sends shipments, via a warehouse, to the retailers. The objective is to schedule shipments to minimize the sum of shipping costs and retailers' waiting costs. We study the approximability of JRP with deadlines, where instead of waiting costs the retailers impose strict deadlines. We study the integrality gap of the standard linearprogram (LP) relaxation, giving a lower bound of 1.207, and an upper bound and approximation ratio of 1.574. The best previous upper bound and approximation ratio was 1.667; no lower bound was previously published. For the special case when all demand periods are of equal length we give an upper bound of 1.5, a lower bound of 1.2, and show APXhardness.Journal version of [2013].

SIAM Journal on Computing 44(4):11541172(2015); (IPCO'99)This paper gives a lower bound on the complexity of $(1 \pm \epsilon)$approximately solving a packing or covering problem using a certain class of Lagrangianrelaxation algorithms: any such algorithm, given a problem formed by a random {0,1}matrix, requires (with high probability) a number of iterations proportional to $(\rho \epsilon)^2\log m$. (Here $\rho$ is a technical parameter, the``width'' of the problem instance.)
The class of algorithms in question includes DantzigWolfe decomposition, Benders decomposition, the Lagrangianrelaxation method developed by Held and Karp [1971] for lowerbounding TSP, and algorithms recently studied by many authors (including Serge Plotkin, David Shmoys, and Eva Tardos [1988]; Mike Grigoriadis and L.G. Khachiyan [1996]; Michael Luby and Noam Nisan [1993]; Naveen Garg and Jochen Konemann [1998]; and Lisa Fleischer [1999]). The lower bound matches the known upper bounds within a constant factor. The lower bound is useful because, in practice, the dependence on $\epsilon$ limits the applicability of these algorithms. The lower bound provides some insight into what is necessary to surmount this dependence.Journal version of [1999]. Published online Aug. 2015. 
In 1971, Knuth gave an $O(n^2)$time algorithm for the classic problem of finding an optimal binary search tree. Knuth's algorithm works only for search trees based on 3way comparisons, while most modern computers support only 2way comparisons (e.g., $<$, $\le$, $=$, $\ge$, and $>$). Until this paper, the problem of finding an optimal search tree using 2way comparisons remained open  polytime algorithms were known only for restricted variants. We solve the general case, giving (i) an $O(n^4)$time algorithm and (ii) an $O(n\log n)$time additive3 approximation algorithm.
For finding optimal binary split trees, we (iii) obtain a linear speedup and (iv) prove some previous work incorrect. 
A brief summary of greedy algorithms for set cover and vertex cover.
INDEX TERMS: greedy algorithm, set cover, vertex cover. 
Algorithmica 70(4):648674(2014); FOCS'07We give an approximation algorithm for packing and covering linear programs (linear programs with nonnegative coefficients). Given a constraint matrix with $N$ nonzeros, $r$ rows, and $c$ columns, the algorithm (with high probability) computes feasible primal and dual solutions whose costs are within a factor of $1+\epsilon$ of the optimal cost in time $O((r+c)\log(N)/\epsilon^2 + N)$.Journal version of [2007].

Can one choose a good Huffman code without knowing the underlying distribution? The following Online Slot Allocation (OSA) problem models this and similar problems. Given $n$ nonnegative slot costs $c(1), c(2), ..., c(n)$, requests for items are drawn from a hidden probability distribution $p$ on $[n]$, and revealed online one by one. In response to each request, if the item has not been previously requested, the algorithm (knowing only $c$ and the items requested so far) must allocate some unused slot for the item. The objective is to minimize the cost of the slot allocation, which is the sum, over the items, of the probability of the item times the cost of its assigned slot.
The online FirstComeFirstServed algorithm (FCFS) simply allocates the cheapest slot available when an item needing a slot arrives. FCFS is optimal among online algorithms. For general costs, we show that its competitive ratio is $1+\ln n$. For concave costs, its competitive ratio is 2. For logarithmic costs, the ratio is, asymptotically, 1: the algorithm's expected cost is $\mbox{opt} + O(\log \mbox{opt})$.
A practical application of OSA is Online Huffman Coding (OHC), a variant of Huffman coding in which the codewords must be allocated on demand, without knowing the underlying probability distribution. OHC reduces to OSA, yielding a firstcomefirstserved algorithm for OHC that guarantees nearoptimal expected cost, at most $\mbox{opt} + 2 \log(1+\mbox{opt}) + 2$. 
SIAM Journal on Computing 43(1):2551(2014); SODA'12Minimumweight triangulation (MWT) is NPhard. It has a polynomialtime constantfactor approximation algorithm, and a variety of effective polynomial time heuristics that, for many instances, can find the exact MWT. Linear programs (LPs) for MWT are wellstudied, but previously no connection was known between any LP and any approximation algorithm or heuristic for MWT. Here we show the first such connections: for an LP formulation due to Dantzig et al. (1985): (i) the integrality gap is bounded by a constant; (ii) given any instance, if the aforementioned heuristics find the MWT, then so does the LP. Our analysis improves the best approximation ratio known for MWT (which is an astronomically large constant) by an order of magnitude.Journal version of [2012].

Conference version of [2014].

Algorithmica 66(1):113152(2013); ICALP'09This paper describes a simple greedy Δapproximation algorithm for any covering problem whose objective function is submodular and nondecreasing, and whose feasible region can be expressed as the intersection of arbitrary (closed upwards) covering constraints, each of which constrains at most Δ variables. (A simple example is Vertex Cover, with Δ = 2.) The algorithm generalizes previous approximation algorithms for fundamental covering problems and online paging and caching problems.
(For distributed implementations of these algorithms, see Distributed algorithms for covering, packing and maximum weighted matching .)Journal version of [2009]. 
Theory of Computing 9(22):685702(2013)Given a satisfiable 3SAT formula, how hard is it to find an assignment to the variables that has Hamming distance at most $n/2$ to a satisfying assignment? More generally, consider any polynomialtime verifier for any NPcomplete language. A $d(n)$Hammingapproximation algorithm for the verifier is one that, given any member $x$ of the language, outputs in polynomial time a string $a$ with Hamming distance at most $d(n)$ to some witness $w$, where $(x,w)$ is accepted by the verifier. Previous results have shown that, if P$\ne$NP, then every NPcomplete language has a verifier for which there is no $(n/2n^{2/3+\delta})$Hammingapproximation algorithm, for various constants $\delta\ge 0$.
Our main result is that, if P$\ne$NP, then every paddable NPcomplete language has a verifier that admits no $(n/2+O(\sqrt{n\log n}))$Hammingapproximation algorithm. That is, one cannot get even half the bits right. We also consider natural verifiers for various wellknown NPcomplete problems. They do have $n/2$Hammingapproximation algorithms, but, if P$\ne$NP, have no $(n/2n^\epsilon)$Hammingapproximation algorithms for any constant $\epsilon>0$.
We show similar results for randomized algorithms.First draft circulated in 2003. 
SIAM Journal on Computing 41(3):684713(2012); STOC'02We study the generalization of Huffman Coding in which codeword letters have nonuniform costs (as in Morse code, where the dash is twice as long as the dot). Despite previous work by many authors including Richard Karp [1961] and Kurt Mehlhorn [1980], the problem is not known to be NPhard, nor was it previously known to have a constantfactor approximation algorithm. The paper describes a polynomialtime approximation scheme (PTAS) for the problem. The algorithm computes a $(1+\epsilon)$approximate solution in time $O(n + f(\epsilon) \log^3 n)$, where $n$ is the input size.Journal version of [2002].

Conference version of [2013].

Distributed Computing 24(1):4563(2011); PODC'09, DISC'09This paper gives polylogarithmicround, distributed δapproximation algorithms for covering problems with submodular cost and monotone covering constraints (Submodularcost Covering). The approximation ratio δ is the maximum number of variables in any constraint. Special cases include Covering Mixed Integer Linear Programs (CMIP), and Weighted Vertex Cover (with δ = 2).
Via duality, the paper also gives polylogarithmicround, distributed δapproximation algorithms for Fractional Packing linear programs (where δ is the maximum number of constraints in which any variable occurs), and for Max Weighted cMatching in hypergraphs (where δ is the maximum size of any of the hyperedges; for graphs δ = 2).
The paper also gives parallel (RNC) 2approximation algorithms for CMIP with two variables per constraint and Weighted Vertex Cover.
The algorithms are randomized. All of the approximation ratios exactly match those of comparable centralized algorithms.
(This paper gives distributed implementations of algorithms from Greedy δapproximation algorithm for covering with arbitrary constraints and submodular cost .)Journal version of two conference papers, on [covering] and [packing]. 
Time series shapelets are small, local patterns in a time series that are highly predictive of a class and are thus very useful features for building classifiers and for certain visualization and summarization tasks. While shapelets were introduced only recently, they have already seen significant adoption and extension in the community.
Despite their immense potential as a data mining primitive, there are two important limitations of shapelets. First, their expressiveness is limited to simple binary presence/absence questions. Second, even though shapelets are computed offline, the time taken to compute them is significant.
In this work, we address the latter problem by introducing a novel algorithm that finds shapelets in less time than current methods by an order of magnitude. Our algorithm is based on intelligent caching and reuse of computations, and the admissible pruning of the search space. Because our algorithm is so fast, it creates an opportunity to consider more expressive shapelet queries. In particular, we show for the first time an augmented shapelet representation that distinguishes the data based on conjunctions or disjunctions of shapelets. We call our novel representation LogicalShapelets. We demonstrate the efficiency of our approach on the classic benchmark datasets used for these problems, and show several case studies where logical shapelets significantly outperform the original shapelet representation and other time series classification techniques. We demonstrate the utility of our ideas in domains as diverse as gesture recognition, robotics, and biometrics. 
A 40page sketch of basic techniques in the design and analysis of combinatorial approximation algorithms. Sections: Introduction, Underlying principles, Approximation algorithms with small additive error, Performance guarantees, Randomized rounding and linear programming, Performance ratios and capproximation, Polynomial approximation schemes, Constantfactor performance guarantees, Logarithmic performance guarantees, Multicriteria problems, Hardtoapproximate problems, Research Issues and Summary, Defining Terms, Further Information.

The paper presents distributed and parallel δapproximation algorithms for covering problems, where δ is the maximum number of variables on which any constraint depends (for example, δ=2 for vertex cover).
Specific results include the following. For weighted vertex cover, the first distributed 2approximation algorithm taking O(log n) rounds and the first parallel 2approximation algorithm in RNC. The algorithms generalize to covering mixed integer linear programs (CMIP) with two variables per constraint (δ=2).
 For any covering problem with monotone constraints and submodular cost, a distributed $δ$approximation algorithm taking O(log^{2} C) rounds, where C is the number of constraints. (Special cases include CMIP, facility location, and probabilistic (twostage) variants of these problems.)
(This paper gives distributed implementations of algorithms from Greedy δapproximation algorithm for covering with arbitrary constraints and submodular cost .)Conference version of part of Distributed algorithms for covering, packing and maximum weighted matching. 
We present efficient distributed δapproximation algorithms for fractional packing and maximum weighted bmatching in hypergraphs, where δ is the maximum number of packing constraints in which a variable appears (for maximum weighted bmatching δ is the maximum edge degree  for graphs δ = 2). (a) For δ = 2 the algorithm runs in O(log m) rounds in expectation and with high probability. (b) For general δ, the algorithm runs in O(log^{2} m) rounds in expectation and with high probability.
(This paper gives distributed algorithms for the duals of problems considered in Greedy δapproximation algorithm for covering with arbitrary constraints and submodular cost .)Conference version of part of Distributed Algorithms for Covering, Packing and Maximum Weighted Matching. 
Conference version of [2013].

IEEE Transactions on Mobile Computing 8(5):590605(2009); SECON'06With fully directional communications, nodes must track and periodically update the positions of their discovered neighbors, so that communication with these neighbors is feasible when needed. If tracking fails, neighbors that move out of the directional footprint will need to be rediscovered. The tracking process introduces an overhead, which increases with the number of discovered neighbors. The overhead can be reduced if each node maintains only a subset of its neighbors; however, this may increase the hop count of paths between nodes, i.e., causes a hop stretch. In this work, we study the tradeoffs between node degree and hop stretch. We first design a topology control algorithm to optimize this tradeoff. For the purposes of this design, we assume that nodes perform circular directional transmissions to communicate with the nodes in their directional range; in this way, the network can be modeled as a unit disk graph (UDG). Given a UDG G, our algorithm finds a sparse subgraph G' with a maximum node degree of 6, and connecting each node pair u, v by a path of length hops_{G'}(u,v) = O(hops_{G}(u,v) + log Δ), where Δ is the maximum degree in G, and hops_{G}(u,v) denotes the length of the shortest path between u and v in graph G. We show that this result is near optimal. Based on the insights gained from the above design, we next construct a more practical scheme, which integrates topology control with fully directional neighbor discovery and maintenance. The simulated performance of our practical scheme is only slightly worse than the theoretical optimal performance in terms of node degree and path stretch.Journal version of [2006].

A brief summary of greedy setcover algorithms.
INDEX TERMS: dominating set, greedy algorithm, hitting set, set cover, minimizing a linear function subject to a submodular constraint. 
Algorithmica 50(4):455478(2008); Latin American Theoretical Informatics (LATIN'06; LNCS 3887:311322)Following Mettu and Plaxton, we study incremental algorithms for the $k$medians problem. Such an algorithm must produce a nested sequence $F_1 \subseteq F_2 \subseteq \cdots \subseteq F_n$ of sets of facilities. Mettu and Plaxton show that incremental metric medians has a (roughly) $40$competitive deterministic polynomialtime algorithm. We give improved algorithms, including a $(24+ε)$competitive deterministic polynomialtime algorithm and a $5.44$competitive, randomized, nonpolynomialtime algorithm.
We also consider the competitive ratio with respect to size. An algorithm is $s$sizecompetitive if, for each $k$, the cost of $F_k$ is at most the minimum cost of any set of $k$ facilities, while the size of $F_k$ is at most $s k$. We present optimally competitive algorithms for this problem.
Our proofs reduce incremental medians to the following online bidding problem: faced with some unknown threshold $T>0$, an algorithm must submit ``bids'' $b>0$ until it submits a bid as large as $T$. The algorithm pays the sum of its bids. We describe optimally competitive algorithms for online bidding.
Our results on costcompetitive incremental medians extend to approximately metric distance functions, incremental fractional medians, and incremental bicriteria approximation.Journal version of [2006]. 
A brief summary of online paging and caching results, 19852002.
INDEX TERMS: paging, caching, weighted caching, weighted paging, file caching, least recently used (paging algorithm), first in first out (paging algorithm), flush when full (paging algorithm), the Marking algorithm (paging algorithm), Balance algorithm (weighted caching algorithm), Greedy Dual (weighted caching algorithm), Landlord (file caching algorithm), Squid (file caching software), kserver problem, primaldual algorithms, randomized algorithms, online algorithms, competitive analysis, competitive ratio, loose competitiveness, accessgraph model, Markov paging. 
Journal of Bioinformatics and Computational Biology 5(4):937961(2007); APBC: AsiaPacific Bioinformatics ConferenceWe study the problem of selecting control clones in DNA array hybridization experiments. The problem arises in the OFRG method for analyzing microbial communities. The OFRG method performs classification of rRNA gene clones using binary fingerprints created from a series of hybridization experiments, where each experiment consists of hybridizing a collection of arrayed clones with a single oligonucleotide probe. This experiment produces analog signals, one for each clone, which then need to be classified, that is, converted into binary values 1 and 0 that represent hybridization and nonhybridization events. In addition to the sample rRNA gene clones, the array contains a number of control clones needed to calibrate the classification procedure of the hybridization signals. These control clones must be selected with care to optimize the classification process. We formulate this as a combinatorial optimization problem called Balanced Covering. We prove that the problem is NPhard, and we show some results on hardness of approximation. We propose approximation algorithms based on randomized rounding, and we show that, with high probability, our algorithms approximate well the optimum solution. The experimental results confirm that the algorithms find high quality control clones. The algorithms have been implemented and are publicly available as part of the software package called CloneTools.Journal version of [2007].

Conference version of [2007].

Conference version of [2013].

Given a rooted tree whose leaves have numeric labels, label the nodes so that each leaf label equals the sum of the node labels from the root to the leaf. Minimize the number of nonzero node labels. This problem models finding a parsimonius explanation of observed changes in hierarchical data. We give fast algorithms with applications.

An introduction to greedy approximation algorithms for combinatorial optimization problems. Informal definition of greedy algorithms. Greedy Set Cover, and other generalizations and applications. Greedy algorithms for other problems. Lagrangean relaxation based methods. Local ratio, greedy dual based schemes and greedy rounding. Connected Dominating Sets and other applications. Priority Based Approaches.

Poster in Proceedings of the IEEE International Conference on Database Engineering (ICDE),Poster presentation of [2007].

This paper is a followup to Topology control to simultaneously achieve nearoptimal node degree and low path stretch in ad hoc networks. It gives a related heuristic with more extensive empirical evaluation.

Conference version of [2008].

Information Processing Letters 97:6872(2006); COCOON'05The Reverse Greedy algorithm (RGreedy) for the $k$median problem works as follows. It starts by placing facilities on all nodes. At each step, it removes a facility to minimize the resulting total distance from the customers to the remaining facilities. It stops when $k$ facilities remain. We prove that, if the distance function is metric, then the approximation ratio of RGreedy is between $Ω(\log n/ \log \log n)$ and $O(\log n)$.Journal version of [2005].

SECON: IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and NetworksConference version of [2009].

Journal of Computer and System Sciences 71(4):495505(2005); FOCS'01Given matrices A and B and vectors a, b, c and d, all with nonnegative entries, we consider the problem of minimizing c$\cdot$x subject to x$\in Z^n_+$, Ax ≥ a, Bx ≤ b, and x ≤ d. We give a bicriteriaapproximation algorithm that, given ε$\in$(0, 1], finds a solution of cost O(log(m)/ε^{2}) times optimal, meeting the covering constraints Ax≥a and multiplicity constraints (x≤d), and satisfying Bx≤(1 + ε)b + β, where β is the vector of row sums, β_i=$\sum_j$ B_{ij}. Here m is the number of rows of A.
This gives an O(log m)approximation algorithm for CIP  minimumcost covering integer programs with multiplicity constraints, i.e., the special case when there are no packing constraints Bx ≤ b. The previous best approximation ratio has been O(log max_{j} $\sum_i$A_{ij} since 1982. CIP contains the set cover problem as a special case, so O(log m)approximation is the best possible unless P=NP.Journal version of Tight approximation results for general covering integer programs. 
Conference version of [2006].

Mathematics of Operations Research 29(3):04360461(2004); STOC'99The multiwaycut problem is, given a weighted graph and $k > 2$ terminal nodes, to find a minimumweight set of edges whose removal separates all the terminals. The problem is NPhard, and even NPhard to approximate within $1+δ$ for some small $δ > 0$.
This paper gives a 12/11approximation algorithm for $k=3$ and a 1.3438approximation algorithm in general. Working on this result was particularly fun because the problem of designing the algorithm was itself formulated as an optimization problem and approximately solved by computer. These computations guided the final (noncomputational) solutions.
The paper also gives a proof that for a general class of ``geometric embedding'' relaxations, there are always randomized rounding schemes that match the integrality gap. (Another example of such a geometricembedding relaxation is the semidefiniteprogramming relaxation of maxcut by Goemans and Williamson [1995].)
The geometric relaxation is due to Calinescu, Karloff, and Rabani [1998]. After our paper, the approximation ratio for general $k$ was subsequently improved to 1.3238 in [2013].Journal version of [1999]. 
The Astronomical Journal 125:22762286(2003)This paper describes a modification to the heuristic algorithm presented in ``Data collection for the Sloan Digital Sky Survey: a networkflow heuristic.'' The modification takes into account that galaxies that are too close together cannot be captured in the same snapshot. This was not considered in the original algorithm.

Conference version of [2012].

Algorithmica 33(3):371383(2002); SODA'98This paper introduced a new filecaching algorithm, called Landlord or GreedyDualsize, for use by web proxies and browsers. The paper gave theoretical analyses of the algorithm suggesting that it would perform well in practice.
A modified version of the algorithm is incorporated into the publicdomain Squid web proxy.
The algorithm was independently obtained by Pei Cao and Sandy Irani [1997]. Both they and John Dilley et al [1999] found that the algorithm worked well empirically.
The results in this paper strengthen and generalize those in [1994] and are further generalized in [2009].Journal version of [1998]. 
Congestion control in the current Internet is accomplished mainly by TCP/IP. To understand the macroscopic network behavior that results from TCP/IP and similar endtoend protocols, one main analytic technique is to show that the the protocol maximizes some global objective function of the network traffic.
Here we analyze a particular endtoend, MIMD (multiplicativeincrease, multiplicativedecrease) protocol. We show that if all users of the network use the protocol, and all connections last for at least logarithmically many rounds, then the total weighted throughput (value of all packets received) is near the maximum possible. Our analysis includes roundtriptimes, and (in contrast to most previous analyses) gives explicit convergence rates, allows connections to start and stop, and allows capacities to change.
An application explained in the paper is multipath bandwidth measurement. 
Mixed packing and covering problems are problems that can be formulated as linear programs using only nonnegative coefficients. Examples include multicommodity network flow, the HeldKarp lower bound on TSP, fractional relaxations of set cover, binpacking, knapsack, scheduling problems, minimumweight triangulation, etc. This paper gives approximation algorithms for the general class of problems. The sequential algorithm can be implemented to find an $(1\pmε)$approximate solution in $O(ε^{2}\log m)$ lineartime iterations.
For $ε = O(1)$, these algorithms are currently the fastest known for the general problem. The results generalize previous work on pure packing and covering (the special case when the constraints are all ``lessthan'' or all ``greaterthan'') by Michael Luby and Noam Nisan [1993]; and Naveen Garg and Jochen Konemann [1998]; and Lisa Fleischer [1999] 
Conference version of Approximation algorithms for covering/packing integer programs.

The paper gives approximation algorithms for the $k$medians and facilitylocation problems (both NPhard). For $k$medians, the algorithm returns a solution using at most $\ln(n+n/ε) k$ medians and having cost at most $(1+ε)$ times the cost of the best solution that uses at most $k$ medians. Here $ε>0$ is an input to the algorithm. In comparison, the best previous algorithm [Lin and Jeff Vitter, 1992] had a $(1+1/ε)\ln n$ term instead of the $\ln(n+n/ε)$ term in the performance guarantee. For facility location, the algorithm returns a solution of cost at most $d+\ln(n) k$, provided there exists a solution of cost $d+k$ where $d$ is the assignment cost and $k$ is the facility cost. In comparison, the best previous algorithm [Dorit Hochbaum, 1982] returned a solution of cost at most $\ln(n)(d+k)$. For both problems, the algorithms currently provide the best performance guarantee known for the general (nonmetric) problems.
The paper also introduces a new probabilistic bound (the socalled ChernoffWald bound) for bounding the expectation of the maximum of a collection of sums of random variables, when each sum contains a random number of terms. The bound is used to analyze the randomized rounding scheme that underlies the algorithms. 
Journal of Algorithms 37(1):218235(2000); SODA'98Elias Koutsoupias and Christos Papadimitriou introduced the socalled diffuseadversary model for analyzing paging strategies such as leastrecentlyused (LRU) [1994]. Briefly, the diffuse adversary model is as follows. Fix some $ε >0$ Run the algorithm, and, for each request, allow the adversary to assign a probability distribution to the items such that no item gets probability more than $ε$, then request an item randomly from the distribution. (When $ε$ is near 1, one has the standard competitiveanalysis model; when $ε$ is near 0, the adversary is severely handicapped.) The algorithm is $c$competitive if the expected cost of the algorithm on the adversarial sequence is at most $c$ times the expected cost of the optimal algorithm.
Koutsoupias and Papadimitriou proved that LRU was optimally competitive (among deterministic online algorithms) according to the model, but left open the question of what the performance guarantee for LRU actually was. This paper answers the question, giving an explicit formula for the performance guarantee within a factor of two. The paper also shows that the performance guarantee of firstinfirstout (FIFO) is much worse than that of LRU, and analyzes the performance guarantees of randomized online paging strategies.Journal version of Bounding the diffuse adversary. 
The data broadcast problem is to find a schedule for broadcasting a given set of messages over multiple channels. The goal is to minimize the cost of the broadcast plus the expected response time to clients who periodically and probabilistically tune in to wait for particular messages.
The problem models disseminating data to clients in asymmetric communication environments, where there is a much larger capacity from the information source to the clients than in the reverse direction. Examples include satellites, cable TV, Internet broadcast, and mobile phones. Such environments favor the ``pushbased'' model where the server broadcasts (pushes) its information on the communication medium and multiple clients simultaneously retrieve the specific information of individual interest.
This paper presents the first polynomialtime approximation scheme (PTAS) for data broadcast with $O(1)$ channels and when each message has arbitrary probability, unit length and bounded cost. The best previous polynomialtime approximation algorithm for this case has a performance ratio of 9/8 [BarNoy et al, 2000]. 
A 40page sketch of basic techniques in the design and analysis of combinatorial approximation algorithms. Sections: Introduction, Underlying principles, Approximation algorithms with small additive error, Performance guarantees, Randomized rounding and linear programming, Performance ratios and capproximation, Polynomial approximation schemes, Constantfactor performance guarantees, Logarithmic performance guarantees, Multicriteria problems, Hardtoapproximate problems, Research Issues and Summary, Defining Terms, Further Information.

Two common objectives for evaluating a schedule are the makespan, or schedule length, and the average completion time. This short note gives improved bounds on the existence of schedules that simultaneously optimize both criteria. In particular, for any ρ > 0, there exists a schedule of makespan at most 1+ρ times the minimum, with average completion time at most 1/(1e^{ρ}) times the minimum. The proof uses an infinitedimensional linear program to generalize and strengthen a previous analysis by Cliff Stein and Joel Wein [1997].

Conference version of [2015].

Conference version of [2004].

Conference version of Online paging against adversarially biased random inputs.

Journal of Algorithms 27(2):339356(1998); SODA'96The goal of the Sloan Digital Sky Survey is ``to map in detail onequarter of the entire sky, determining the positions and absolute brightnesses of more than 100 million celestial objects'' [1]. The survey will be performed by taking ``snapshots'' through a large telescope. Each snapshot can capture up to 600 objects from a small circle of the sky (as illustrated here). This paper describes the design and implementation of the algorithm that is being used to determine the snapshots so as to minimize their number. The problem is NPhard in general; the algorithm described is a heuristic, based on Lagrangianrelaxation and mincost network flow. It gets within 515% of a naive lower bound, whereas using a ``uniform'' cover only gets within 2535%.Journal version of [1996].

Conference version of [2002].

Patternmatchingbased documentcompression systems (e.g. for faxing) rely on finding a small set of patterns that can be used to represent all of the ink in the document. Finding an optimal set of patterns is NPhard; previous compression schemes have resorted to heuristics. This paper describes an extension of the crossentropy approach, used previously for measuring pattern similarity, to this problem. This approach reduces the problem to a kmedians problem, for which the paper gives a new algorithm with a provably good performance guarantee. In comparison to previous heuristics (First Fit, with and without generalized Lloyd's/kmeans postprocessing steps), the new algorithm generates a better codebook, resulting in an overall improvement in compression performance of almost 17%.

Journal of Algorithms 24(2):310324(1997)The problem considered is the following. Given a graph with edge weights satisfying the triangle inequality, and a degree bound for each vertex, compute a lowweight spanning tree such that the degree of each vertex is at most its specified bound. The problem is NPhard (it generalizes Traveling Salesman (TSP)). The paper describes a networkflow heuristic for modifying a given tree T to meet the constraints. Choosing T to be a minimum spanning tree (MST) yields approximation algorithms with performance guarantee less than 2 for the problem on geometric graphs with L_{p}norms. The paper also describes a Euclidean graph whose minimum TSP costs twice the MST, disproving a conjecture made in [Lowdegree spanning trees of small weight].

Information Processing Letters 63(5):229235(1997)The paper focuses on two problems: (i) how to orient the edges of an undirected graph in order to maximize the number of ordered vertex pairs $(x,y)$ such that there is a directed path from $x$ to $y$, and (ii) how to orient the edges so as to minimize the number of such pairs. The paper describes a quadratictime algorithm for the first problem, and a proof that the second problem is NPhard to approximate within some constant $1+ε > 1$. The latter proof also shows that the second problem is equivalent to ``comparability graph completion''; neither problem was previously known to be NPhard.

J. Combinatorial Theory, Series A 76(1):4454(1996)

Conference version of [1998].

SIAM Journal on Computing 25(2):355368(1996); STOC'94The degreed spanning tree problem asks for a minimumweight spanning tree in which the degree of each vertex is at most $d$. When $d=2$ the problem is TSP, and in this case, the wellknown Christofides algorithm provides a 1.5approximation algorithm (assuming the edge weights satisfy the triangle inequality).
Christos Papadimitriou and Umesh Vazirani posed the challenge of finding an algorithm with performance guarantee less than 2 for Euclidean graphs (points in $R^n$) and $d > 2$ [1984]. This paper gives the first answer to that challenge, presenting an algorithm to compute a degree3 spanning tree of cost at most 5/3 times the MST. For points in the plane, the ratio improves to 3/2 and the algorithm can also find a degree4 spanning tree of cost at most 5/4 times the MST.Journal version of [1994]. 
Discrete Applied Mathematics 69(3):281289(1996)The MEG (minimum equivalent graph) problem is, given a directed graph, to find a small subset of the edges that maintains all reachability relations between nodes. The problem is NPhard; This paper gives a proof that, for graphs where each directed cycle has at most three edges, the MEG problem is equivalent to maximum bipartite matching, and therefore solvable in polynomial time. This leads to an improvement in the performance guarantee of the previously best approximation algorithm for the general problem, from ``Approximating the minimum equivalent digraph'' [1995].
(This result was improved by Berman et al, WADS, 2009.) 
An experimental implementation and evaluation of data structures from ``Approximate data structures with applications'' [1994]. To summarize: ``Our results suggest that the data structure is practical and can be faster than traditional priority queues when holding a large number of keys, and that tolerance for approximate answers can lead to significant increases in speed.''

SIAM Journal on Computing 25(6):12811304(1996); ICALP'94Describes a nearlineartime algorithm for a variant of Huffman coding, in which the letters may have nonuniform lengths (as in Morse code), but with the restriction that each word to be encoded has equal probability.
See also [Huffman coding with unequal letter costs].Journal version of [1994]. 
SIAM Journal on Computing 24(4):859872(1995); SODA'94The MEG (minimum equivalent graph) problem is, given a directed graph, to find a small subset of the edges that maintains all reachability relations between nodes. The problem is NPhard. This paper gives an approximation algorithm with performance guarantee of pi^{2}/6 ~ 1.64. The algorithm and its analysis are based on the simple idea of contracting long cycles.
(This result was strengthened slightly in ``On strongly connected digraphs with bounded cycle length'' [1996].) The analysis applies directly to 2Exchange, a simple local improvement algorithm, showing that its performance guarantee is 1.75. It was further improved by Berman et al, WADS 2009.)Journal version of [1994]. 
Algorithmica 14(4):305322(1995); SODA'93This paper has a simple lineartime algorithm to find a spanning tree that simultaneously approximates a shortestpath tree and a minimum spanning tree. The algorithm provides a continuous tradeoff: given the two trees and $ε > 0$, the algorithm returns a spanning tree in which the distance between any vertex and the root of the shortestpath tree is at most $1+ε$ times the shortestpath distance, and yet the total weight of the tree is at most $1+2/ε$ times the weight of a minimum spanning tree. This is the best tradeoff possible.Journal version of [1993].

Randomized rounding is a standard method, based on the probabilistic method, for designing combinatorial approximation algorithms. This paper introduces a variant of randomized rounding that can be used to derive Lagrangianrelaxation algorithms and greedy algorithms. This was surprising  the approximation algorithms derived previously by the method require first solving a linear program (or some other optimization problem). Indeed, in Prabhakar Raghavan's seminal paper introducing randomized rounding [1988], he says: "The time taken to solve the linear program relaxations of the integer programs dominates the net running time theoretically (and, most likely, in practice as well)."
The variant of randomized rounding described here avoids this bottleneck, at least for packing/coveringtype problems. The resulting algorithms are Lagrangianrelaxation algorithms and greedy algorithms (in fact the greedy setcover algorithm is an example), and are faster and simpler to implement than standard randomizedrounding algorithms. This approach gives a systematic, coherent, and comprehensive understanding of Lagrangianrelaxation algorithms for packing and covering linear programs.
The ideas introduced here are also used in ``Sequential and parallel algorithms for mixed packing and covering'' [2001], in ``Kmedians, facility location, and the ChernoffWald bound'' [2000]), and in ``On the number of iterations for DantzigWolfe optimization and packingcovering approximation algorithms'' [1999].Lecture notes on this topic are here. 
Cornell University School of ORIE Technical Report 1103(1994)We consider the problem of choosing Euclidean points to maximize the sum of their weighted pairwise distances, when each point is constrained to a ball centered at the origin. We derive a dual minimization problem and show strong duality holds (i.e., the resulting upper bound is tight) when some locally optimal configuration of points is affinely independent. We sketch a polynomial time algorithm for finding a nearoptimal set of points.

Journal of Algorithms 17(2):280289(1994); IPCO'93The paper describes a simple deterministic parallel/distributed $(2+ε)$approximation algorithm for the minimumweight vertexcover problem and its dual (edge/element packing). This paper was one of the first to use the primaldual method for approximation in the distributed setting. This result was strengthened in `` Distributed and parallel algorithms for weighted vertex cover and other covering problems.''Journal version of [1993].

This paper explores the notion of approximate data structures, which return approximately correct answers to queries, but run faster than their exact counterparts. The paper describes approximate variants of the van Emde Boas data structure, which support the same dynamic operations as the standard van Emde Boas data structure (min, max, successor, predecessor, and existence queries, as well as insertion and deletion), except that answers to queries are approximate. The variants support all operations in constant time provided the performance guarantee is 1+1/polylog n, and in O(log log n) time provided the performance guarantee is 1+1/poly(n), for n elements in the data structure.
Applications described include Prim's minimumspanningtree algorithm, Dijkstra's singlesource shortest paths algorithm, and an online variant of Graham's convex hull algorithm. To obtain output which approximates the desired output with the performance guarantee tending to 1, Prim's algorithm requires only linear time, Dijkstra's algorithm requires O(m log log n) time, and the online variant of Graham's algorithm requires constant amortized time per operation. 
Conference version of [1995].

Information Processing Letters 50(1):4955(1994); WADS'93The traditional multicommodity flow problem assumes a given flow network in which multiple commodities are to be maximally routed in response to given demands. This paper considers the multicommodity flow networkdesign problem: given a set of multicommodity flow demands, find a network subject to certain constraints such that the commodities can be maximally routed.
This paper focuses on the case when the network is required to be a tree. The main result is an approximation algorithm for the case when the tree is required to be of constant degree. The algorithm reduces the problem to the minimumweight balancedseparator problem; the performance guarantee of the algorithm is within a factor of 4 of the performance guarantee of the balancedseparator procedure. If Tom Leighton and Satish Rao's balancedseparator procedure [1988] is used, the performance guarantee is $O(\log n)$. This improves the $O(\log^2 n)$ approximation factor that is trivial to obtain by a direct application of the balancedseparator method.Journal version of [1993]. 
Conference version of [1996].

Conference version of [1996].

Von Neumann's MinMax Theorem guarantees that each player of a zerosum matrix game has an optimal mixed strategy. This paper gives an elementary proof that each player has a nearoptimal mixed strategy that chooses uniformly at random from a multiset of pure strategies of size logarithmic in the number of pure strategies available to the opponent.
For exponentially large games, for which even representing an optimal mixed strategy can require exponential space, it follows that there are nearoptimal, linearsize strategies. These strategies are easy to play and serve as small witnesses to the approximate value of the game.
As a corollary, it follows that every language has small ``hard'' multisets of inputs certifying that small circuits can't decide the language. For example, if SAT does not have polynomialsize circuits, then, for each $n$ and $c$, there is a set of $n^{O(c)}$ Boolean formulae of size $n$ such that no circuit of size $n^c$ (or algorithm running in time $n^c$) classifies more than twothirds of the formulae successfully.
The "simple strategies" observation was generalized to Nash equilibrium by Lipton et al. in their paper Playing large games using simple strategies. 
Algorithmica 11(6):525541(1994); SODA'91This paper has two results. The first is based on the surprising observation that the wellknown ``leastrecentlyused'' paging algorithm and the ``balance'' algorithm for weighted caching are linearprogramming primaldual algorithms. This observation leads to a strategy (called ``GreedyDual'') that generalizes them both and has an optimal performance guarantee for weighted caching.
For the second set of results, the paper presents empirical studies of paging algorithms, documenting that in practice, on ``typical'' cache sizes and sequences, the performance of paging strategies are much better than their worstcase analyses in the standard model suggest. The paper then presents theoretical results that support and explain this. For example: on any input sequence, with almost all cache sizes, either the performance guarantee of leastrecentlyused is $O(\log k)$ or the fault rate (in an absolute sense) is insignificant.
These results are strengthened and/or generalized in [1998] and [ 2009].Journal version of [1991]. 
Conference version of [1994].

Conference version of [1995].

Conference version of [1994].

Journal of Algorithms 12(4):685699(1991)The paging problem is that of deciding which pages to keep in a memory of $k$ pages in order to minimize the number of page faults. This paper introduced the marking algorithm , a randomized online algorithm for the paging problem, and a proof that its performance guarantee is $O(\log k)$. This was one of the first results in the area showing that a randomized algorithm can have a strictly better performance guarantee than any deterministic online algorithm (no deterministic online algorithm can have a performance guarantee better than k).

Technical Report, Computer Science Department, Princeton University CSTR34891(1991)The thesis was about online algorithms. The main results were published in papers described above. The unpublished results were mainly exploring the role of linearprogramming duality in online algorithms, including a method for deriving a potentialfunction analysis from a linearprogramming duality analysis, and studies of some online matching algorithms. There were also some minor results analyzing paging strategies with lookahead.

Networks 21(2):205221(1991)The parametric shortest path problem is to find the shortest paths in graph where the edge costs are of the form $w_{ij}+\lambda$, where each $w_{ij}$ is constant and $\lambda$ is a parameter that varies. The problem is to find shortest path trees for every possible value of $\lambda$.
The minimumbalance problem is to find a ``weighting'' of the vertices so that adjusting the edge costs by the vertex weights yields a graph in which, for every cut, the minimum weight of any edge crossing the cut in one direction equals the minimum weight of any edge crossing the cut in the other direction.
The paper presents the fastest known algorithms for both problems. The algorithms run in $O(nm+n^2\log n)$ time. The paper also describes empirical studies of the algorithms on random graphs, suggesting that the expected time for finding minimummean cycles (an important special case of both problems) is $O(m + n \log n)$. 
Technical Report, Computer Science Department, Princeton University CSTR31791(1991)This technical report presented notes from the first eight lectures of the class Many Models of Complexity taught by László Lovász at Princeton University in the fall of 1990. The topic is evasiveness of graph properties: given a graph property, how many edges of the graph an algorithm must check in the worst case before it knows whether the property holds.

Conference version of The KServer Dual and Loose Competitiveness for Paging.