Publications in Selected Areas

High Performance Computing Systems

EuroSys (3), USENIX ATC (1), BigData (2), IPDPS (3), HPDC (3), ICS (5), SC (7), RTSS (2), IROS (1), SoCC (1)

SoCC	IncBoost: Scaling Incremental Graph Processing for Edge Deletions and Weight Updates (2024)
IROS	P4: Pruning and Prediction-based Priority Planning (2024)
EuroSys	Core Graph: Exploiting Edge Centrality to Speedup the Evaluation of Iterative Graph Queries (2024)
EuroSys	Tripoline: Generalized Incremental Graph Processing via Graph Triangle Inequality (2021)
EuroSys	Subway: Minimizing Data Transfer during Out-of-GPU-Memory Graph Processing (2020)
USENIX ATC	Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing (2016)
BigData	BEAD: Batched Evaluation of Iterative Graph-Queries with Evolving Analytics Demands (2020)
BigData	MultiLyra: Scalable Distributed Evaluation of Batches of Iterative Graph Queries (2019)
IPDPS	UVVs: Identifying Unchanged Vertex Values in Evolving Graphs via Intersection-Union Analysis (2026)
IPDPS	COMPI: Concolic Testing for MPI Applications (2018)
IPDPS	Eliminating Intra-warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement (2016)
HPDC	Efficient Processing of Large Graphs via Input Reduction (2016)
HPDC	Parallel Execution Profiles (2016)
HPDC	CuSha: Vertex-Centric Graph Processing on GPUs (2014)
ICS	DSGEN: Concolic Testing GPU Implementations of Concurrent Dynamic Data Structures (2021)
ICS	CuMAS: Data Transfer Aware Multi-Application Scheduling for Shared GPUs (2016)
ICS	PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization (2015)
ICS	Address-aware Fences (2013)
ICS	Load and Store Reuse Using Register File Contents (2001)
SC	ParaStack: Efficient Hang Detection for MPI Programs at Large Scale (2017)
SC	Fence Scoping (2014)
SC	Compiled Communication for All-Optical TDM Networks (1996)
SC	Techniques for Integrating Parallelizing Transformations and Compiler Based Scheduling Methods (1992)
SC	Loop Displacement: An Approach for Transforming and Scheduling Loops for Parallel Execution (1990)
SC	Improving Instruction Cache Performance by Reducing Cache Pollution (1990)
SC	The Design of a RISC based Multiprocessor Chip (1990)
RTSS	Busy-Idle Profiles and Compact Task Graphs: Compile-time Support for ... Scheduling of Real-Time Tasks (1994)
RTSS	Applying Compiler Techniques to Scheduling in Real Time Systems (1990)

Programming Languages and Compilers

OOPSLA (3), PPoPP/PPEALS (5), POPL (3), PLDI/PLDI-20-Years (15), ICCL (3), CGO (6)

OOPSLA	DProf: Distributed Profiler with Strong Guarantees (2019)
OOPSLA	RAIVE: Runtime Assessment of Floating-Point Instability by Vectorization (2015)
OOPSLA	ASPIRE: Exploiting Asynchronous Parallelism in Iterative Algorithms using a Relaxed Consistency based DSM (2014)
PPoPP	PANNS: Enhancing Graph-based Approximate Nearest Neighbor Search through Recency-aware Construction and Parameterized Search (2025)
PPoPP	SpiceC: Scalable Parallelism via implicit copying and explicit Commit (2011)
PPoPP	Enhanced Speculative Parallelization Via Incremental Recovery (2011)
PPoPP	Employing Register Channels for the Exploitation of Instruction Level Parallelism (1990)
PPEALS	Compile-time Techniques for Efficient Utilization of Parallel Memories (1988)
POPL	Bitwidth Aware Global Register Allocation (2003)
POPL	Demand-Driven Computation of Interprocedural Data Flow (1995)
POPL	Generalized Dominators and Post-Dominators (1992)
PLDI	Effective Parallelization of Loops in the Presence of I/O Operations (2012)
PLDI	Supporting Speculative Parallelization in the Presence of Dynamic Data Structures (2010)
PLDI	Towards Locating Execution Omission Errors (2007)
PLDI	Pruning Dynamic Slices With Confidence (2006)
PLDI	Cost Effective Dynamic Program Slicing (2004)
PLDI: 20 Years	Retrospective -- Complete Removal of Redundant Expressions
PLDI	Timestamped Whole Program Path Representation and its Applications (2001)
PLDI	ABCD: Eliminating Array Bounds Checks on Demand (2000)
PLDI	Load-Reuse Analysis: Design and Evaluation (1999)
PLDI	Complete Removal of Redundant Expressions (1998)
PLDI	Partial Dead Code Elimination using Slicing Transformations (1997)
PLDI	Interprocedural Conditional Branch Elimination (1997)
PLDI	A Practical Data Flow Framework for Array Reference Analysis and its Application in Optimizations (1993)
PLDI	A Fresh Look at Optimizing Array Bound Checks (1990)
PLDI	Register Allocation via Clique Separators (1989)
ICCL	Automatic Generation of Microarchitecture Simulators (1998)
ICCL	Path Profile Guided Partial Redundancy Elimination Using Speculation (1998)
ICCL	SPMD Execution of Programs with Dynamic Data Structures on Distributed Memory Machines (1992)
CGO	PreFix: Optimizing the Performance of Heap-Intensive Applications (2025)
CGO	White-Box Program Tuning (2019)
CGO	DrDebug: Deterministic Replay based Cyclic Debugging with Dynamic Slicing (2014)
CGO	Lightweight Fault Detection in Parallelized Programs (2013)
CGO	Extending Path Profiling across Loop Backedges and Procedure Boundaries (2004)
CGO	Hiding Program Slices for Software Security (2003)

Computer Architecture

ASPLOS (8), ISCA (2), MICRO (14), HPCA (3), PACT (12)

ASPLOS	Glign: Taming Misaligned Graph Traversals in Concurrent Graph Processing (2023)
ASPLOS	CommonGraph: Graph Analytics on Evolving Data (2023)
ASPLOS	PnP: Pruning and Prediction for Point-To-Point Iterative Graph Analytics (2019)
ASPLOS	KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations (2017)
ASPLOS	CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing (2017)
ASPLOS	Efficient Sequential Consistency via Conflict Ordering (2012)
ASPLOS	Frequent Value Locality and Value-Centric Data Cache Design (2000)
ASPLOS	The Fuzzy Barrier: A Mechanism for High-Speed Synchronization of Processors (1989)
ISCA	ECMon: Exposing Cache Events for Monitoring (2009)
ISCA	Value Prediction in VLIW Machines (1999)
MICRO	MEGA Evolving Graph Accelerator (2023)
MICRO	JetStream: Graph Analytics on Streaming Data with Event-Driven Hardware Accelerator (2021)
MICRO	GraphPulse: An Event-Driven Hardware Accelerator for Asynchronous Graph Processing (2020)
MICRO	Efficient Warp Execution in Presence of Divergence with Collaborative Context Collection (2015)
MICRO	Copy Or Discard Execution Model For Speculative Parallelization On Multicores (2008)
MICRO	Efficient Use of Invisible Registers in Thumb Code (2005)
MICRO	Whole Execution Traces (2004)
MICRO	Energy Efficient Frequent Value Data Cache Design (2002)
MICRO	Frequent Value Compression in Data Caches (2000)
MICRO	Dynamic Memory Disambiguation in the Presence of Out-of-order Store Issuing (1999)
MICRO	Resource-Sensitive Profile-Directed Data Flow Analysis for Code Optimization (1997)
MICRO	A Shape Matching Approach for Scheduling Fine-Grained Parallelism (192)
MICRO	Executing Loops on a Fine-Grained MIMD Architecture (1991)
MICRO	A Fine-grained MIMD Architecture based upon Register Channels (1990)
HPCA	SENSS: Security Enhancement to Symmeteric Shared Memory Multiprocessors (2005)
HPCA	Global Context-based Value Prediction (1999)
HPCA	Distributed Path Reservation Algorithms for Multiplexed All-Optical Interconnection Networks (1997)
PACT	Scalable SIMD-Efficient Graph Processing on GPUs (2015)
PACT	Stadium Hashing: Scalable and Flexible Hashing on GPUs (2015)
PACT	Shuffling: A Framework for Lock Contention Aware Thread Scheduling for Multicore Multiprocessor Systems (2014)
PACT	No More Backstabbing... A Faithful Scheduling Policy for Multithreaded Programs (2011)
PACT	Efficient Sequential Consistency Using Conditional Fences (2010), Recipient of a PACT 2010 Best Paper Award
PACT	Extended Whole Program Paths (2005)
PACT	Caching and Predicting Branch Sequences for Improved Fetch Effectiveness (1999)
PACT	Superscalar Execution with Direct Data Forwarding (1998)
PACT	Capturing the Effects of Code Improving Transformations (1998)
PACT	Path Profile Guided Partial Dead Code Elimination Using Predication (1997)
PACT	Resource Spackling: A Framework for Integrating Register Allocation in Local and Global Schedulers (1994)
PACT	URSA: A Unified ReSource Allocator for Registers and Functional Units in VLIW Architectures (1993)

Software Engineering

ICSE (5), ASE (1), ESEC-FSE/FSE (5), ISSTA/ISTAV (4), ICSM (10)

ICSE	Dynamic Slicing for Android (2019)
ICSE	Locating Faults Through Automated Predicate Switching (2006)
ICSE	Effective Forward Computation of Dynamic Slices Using Reduced Ordered Binary Decision Diagrams (2004)
ICSE	Precise Dynamic Slicing Algorithms (2003), Recipient of ICSE 2003 Distinguished Paper Award
ICSE	A Demand-Driven Analyzer for Data Flow Testing at the Integration Level (1996)
ASE	Locating Faulty Code Using Failure-Inducing Chops (2005)
FSE	Dynamic Slicing Long Running Programs through Execution Fast Forwarding (2006)
ESEC-FSE	Matching Execution Histories of Program Versions (2005)
ESEC-FSE	Comparison Checking: An Approach to Avoid Debugging of Optimized Code (1999)
ESEC-FSE	Refining Data Flow Information using Infeasible Paths (1997)
FSE	Hybrid Slicing: An Approach for Refining Static Slices using Dynamic Information (1995)
ISSTA	Fault Localization Using Value Replacement (2008)
ISSTA	Dynamic Recognition of Synchronization Operations for Improved Data Race Detection (2008)
ISSTA	Enabling Tracing of Long-Running Multithreaded Programs via Dynamic Execution Reduction (2007)
ISTAV	Loop Monotonic Computations: An Approach for the Efficient Run-time Detection of Races (1991)
ICSM	Detecting Virus Mutations Via Dynamic Matching (2009)
ICSM	Effective and Efficient Localization of Multiple Faults Using Value Replacement (2009)
ICSM	Identifying the Root Causes of Memory Bugs Using Corrupted Memory Location Suppression (2008)
ICSM	Dynamic Slicing of Multithreaded Programs for Race Detection (2008)
ICSM	ONTRAC: A System for Efficient ONline TRACing for Debugging (2007)
ICSM	Matching Control Flow of Program Versions (2007)
ICSM	Priority Based Data Flow Testing (1995)
ICSM	A Framework for Partial Data Flow Analysis (1994)
ICSM	An Approach to Regression Testing using Slicing (1992)
ICSM	A Methodology for Controlling the Size of a Test Suite (1990)

ACM Transactions

TOPLAS/LOPLAS (5), TOSEM (2), TACO (9), TECS (2), TODAES (1)

ACM TOPLAS	Execution Suppression: An Automated Iterative Technique for locating Memory Errors (2010)
ACM TOPLAS	Cost and Precision Tradeoffs of Dynamic Data Slicing Algorithms (2005)
ACM TOPLAS	A Practical Framework for Demand-Driven Interprocedural Data Flow Analysis (1997)
ACM TOPLAS	Efficient Register Allocation Via Coloring Using Clique Separators (1994)
ACM LOPLAS	Optimizing Array Bound Checks Using Flow Analysis (1994)
ACM TOSEM	Hybrid Slicing: Integrating Dynamic Information with Static Analysis (1997)
ACM TOSEM	A Methodology for Controlling the Size of a Test Suite (1993)
ACM TACO	Synergistic Analysis of Evolving Graphs (2016)
ACM TACO	Tumbler: An Effective Load Balancing Technique for MultiCPU Multicore Systems (2016)
ACM TACO	ADAPT: A Framework for Coscheduling Multithreaded Programs (2013)
ACM TACO	A Dynamic Self Scheduling Scheme for Heterogeneous Multiprocessor Architectures (2013)
ACM TACO	PLDS: Partitioning Linked Data Structures for Parallelism (2012)
ACM TACO	Thread Tranquilizer: Dynamically Reducing Performance Variation (2012),
ACM TACO	Dynamic Access Distance Driven Cache Replacement (2011)
ACM TACO	Unified Control Flow and Dependence Traces (2007)
ACM TACO	Whole Execution Traces and their Applications (2005)
ACM TECS	Dynamic Coalescing for 16-bit Instructions (2005)
ACM TECS	Frequent Value Locality and its Applications (2002)
ACM TODAES	Frequent Value Encoding for Low Power Data Buses (2004)