Photo by Amos Nachoum
UCR Matrix Profile Page
Funded by NSF
IIS 1161997 II, IIS 1510741,
The Matrix Profile (and the algorithms to compute it: STAMP,
STOMP, SCRIMP, SCRIMP++ and GPU-STOMP), has the potential to revolutionize time series
data mining because of its generality, versatility, simplicity and
scalability. In particular it has implications for time
motif discovery, time series joins, shapelet discovery (classification), density
estimation, semantic segmentation, visualization, rule discovery, clustering etc (note, for pure similarity search, we suggest you see MASS for Euclidean Distance, and the UCR Suite for DTW)
Our overarching claim has three parts:
1) Given only the Matrix
Profile, most time series data mining tasks are trivial.
2) The Matrix Profile can be computed very efficiently.
3) Algorithms that are built on top the Matrix Profile inherit all its
desirable properties. For
example, we can use the Matrix Profile to find time series motifs.
Because the Matrix Profile can be incrementally computed, we have the
first incremental time series motif discovery algorithm. Because the
Matrix Profile can be computed in an anytime fashion, we have the first
anytime time series motif discovery algorithm. Because the Matrix
Profile can leverage GPUs, we have..
advantages of using the Matrix Profile (over hashing, indexing, brute
forcing a dimensionality reduced representation etc.) for most time
series data mining tasks include:
- It is exact:
For motif discovery, discord discovery, time series joins etc., the
Matrix Profile based methods provide no false positives or false
It is simple and parameter-free:
In contrast, the more general spatial access method algorithms
typically require building and tuning spatial access methods and/or
It is space efficient:
Matrix Profile construction algorithms requires an inconsequential
space overhead, just linear in the time series length with a small
constant factor, allowing massive datasets to be processed in main memory.
It allows anytime algorithms:
While our exact algorithms are extremely scalable, for extremely large
datasets we can compute the Matrix Profile in an anytime fashion,
allowing ultra-fast approximate solutions.
It is incrementally maintainable:
Having computed the Matrix Profile for a dataset, we can incrementally
update it very efficiently. In many domains this means we can
effectively maintain exact joins/motifs/discords on streaming data
not require the user to set similarity/distance thresholds:
For time series joins, the Matrix Profile provides full joins,
eliminating the need to specify a similarity threshold, which is an
unintuitive task for time series.
It can leverage hardware: Matrix Profile construction is embarrassingly parallelizable, both on multicore
processors and in distributed systems.
It has time complexity that is constant
in subsequence length:
This is a very unusual and desirable property; all known time series
join/motif/discord algorithms scale poorly as the subsequence length
grows. In contrast, we have shown time series joins/motifs with
subsequences lengths up to 100,000, at least two orders of magnitude
longer than any other work we are aware of.
It can be constructed in deterministic time:
All join/motif/discord algorithms we are aware of can radically
different times to finish on two (even slightly) different datasets. In
contrast, given only the length of the time series, we can precisely
predict in advance how long it will take to compute the Matrix Profile
It can handle missing data: Even
in the presence of missing data, we can provide answers which are
guaranteed to have no false negatives.
Given all these features, the Matrix Profile has implications for many,
perhaps most, time series data mining tasks.
Profile I: All Pairs Similarity Joins for Time Series: A Unifying View
that Includes Motifs, Discords and Shapelets. Chin-Chia
Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum, Yifei Ding,
Hoang Anh Dau, Diego Furtado Silva, Abdullah Mueen, Eamonn Keogh
(2016). IEEE ICDM
2016. [pdf] [slides]
Profile II: Exploiting a Novel Algorithm and GPUs to break the one
Hundred Million Barrier for Time Series Motifs and Joins. Yan
Zhu, Zachary Zimmerman, Nader Shakibay Senobari, Chin-Chia Michael Yeh,
Gareth Funning, Abdullah Mueen, Philip Berisk and Eamonn Keogh (2016). EEE ICDM 2016. [pdf] [slides] Shortlisted for best paper award.
- Matrix Profile III: The
Matrix Profile allows Visualization of Salient Subsequences in Massive
Time Series. Chin-Chia Michael Yeh, Helga Van Herle, Eamonn Keogh (2016). IEEE ICDM 2016. [pdf] [slides] Supporting Page.
- Matrix Profile IIII: Using Weakly Labeled Time Series to Predict Outcomes. Chin-Chia Michael Yeh, Nickolas Kavantzas and; Eamonn Keogh. VLDB 2017,, Munich Germany.
- Matrix Profile V: A Generic Technique to Incorporate Domain Knowledge into Motif Discovery. Hoang Anh Dau and Eamonn Keogh. [pdf] KDD'17, Halifax, Canada.
- Matrix Profile
VI: Meaningful Multidimensional Motif Discovery. Chin-Chia Michael Yeh, Nickolas Kavantzas, Eamonn Keogh. [pdf] ICDM 2017.
Profile VII: Time Series Chains: A New Primitive for Time Series Data
Mining. Yan Zhu, Makoto Imamura, Daniel Nikovski, and Eamonn Keogh. [pdf] ICDM 2017. Winner best paper award. [slides]
- Matrix Profile VIII: Domain Agnostic
Online Semantic Segmentation at Superhuman Performance Levels. Shaghayegh Gharghabi, Yifei Ding, Chin-Chia Michael Yeh, Kaveh Kamgar,
Liudmila Ulanova, and Eamonn Keogh. [pdf] ICDM
- Matrix Proﬁle IX: (TBA)
- Matrix Proﬁle X: VALMOD - Scalable Discovery of Variable-Length Motifs in Data Series. Michele Linardi ,Yan Zhu ,Themis Palpanas and Eamonn Keogh. SIGMOD 2018.
- Matrix Proﬁle XI: (TBA)
- Time Series Joins, Motifs, Discords and Shapelets: a Unifying View that Exploits the Matrix Profile. Chin-Chia Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum,
Yifei Ding, Hoang Anh Dau, Zachary Zimmerman, Diego Furtado Silva,
Abdullah Mueen, Eamonn Keogh. [pdf] Data Mining and Knowledge Discovery.
- Exploiting a Novel Algorithm and GPUs to Break the Ten Quadrillion Pairwise Comparisons Barrier for Time Series Motifs and Joins. Yan
Zhu, Zachary Zimmerman, Nader Shakibay Senobari, Chin-Chia Michael Yeh,
Gareth Funning, Abdullah Mueen, Philip Brisk and Eamonn Keogh. KAIS Journal [pdf]
- SiMPle: Assessing Music Similarity Using Subsequences. Diego Silva, Chin-Chia Michael Yeh, Gustavo
Batista and Eamonn Keogh (2016).
ISMIR 2016. [pdf]
- Fast Similarity Matrix Profile for Music Analysis and
Exploration. Diego Furtado
Silva, Chin-Chia Michael Yeh, Yan Zhu, Gustavo E. A. P. A. Batista, Eamonn J.
Keogh: (2017). IEEE Transactions on Multimedia [pdf].
- Time Series Chains: A Novel
Tool for Time Series Data Mining.Yan Zhu, Makoto Imamura, Daniel Nikovski, and Eamonn
Keogh. Best of track in IJCAI 2018.
- VALMOD: A Suite for Easy and Exact Detection of Variable
Length Motifs in Data Series (2018). Michele Linardi, Yan Zhu, Themis Palpanas,
Eamonn J. Keogh: Demonstration track. SIGMOD Conference 2018: 1757-1760.
- Domain Agnostic Online Semantic Segmentation for
Multi-Dimensional Time Series. Shaghayegh Gharghabi, Chin-Chia Michael Yeh,
Yifei Ding, Wei Ding, Paul Hibbing, Samuel LaMunion, Andrew Kaplan, Scott E.
Crouter, Eamonn Keogh. Data Mining and Knowledge Discovery.
The Matrix Profile Tutorial
Part 1: PPT or PDF
Part 2: PPT or PDF
Code: Version 2.0
note this code is not STAMP or STOMP, but a new algorithm, SCRIMP
(which has yet to appear in a publication), which is as fast as STOMP,
but also an anytime algorithm. The code is wrapped inside a simple Matlab GUI
(that adds a lot of time overhead), to allow non-specailists to
interact with their data. If you are writing a paper, please do not
comparing timing results to this version, it would not be fair to us (contact us for the faster, but less user friendly code).
First read this, then download this data (some penguin data) and download the code into your Matlab path:
At the command line, type...
>> load MP_first_test_penguin_sample
>> [matrixProfile, profileIndex, motifIndex, discordIndex] = interactiveMatrixProfileVer2(smooth(penguin_sample,10) ,800);
you can see from the image below, even though the data has 109,043
datapoints, and the subsequence is high dimensional (800 datapoints),
in about a second we have already found some very interesting motifs:
The first is a valley happens during a dive. The third motif, a W
shape followed by a long constant, happens when the penguin.
Happy motif hunting, from the Matrix Profile team.