DAME : Disk-Aware Motif Enumeration
This is a supporting page to our paper -
A Disk-Aware Algorithm for Time Series Motif Discovery
,
by Abdullah Mueen, Eamonn Keogh and Nima Bigdely-Shamlo.
The paper is here.
Code and Executables 
We have two versions of DAME; one for motif from a database and the other for subsequence motif. We also have the codes for three Brute Force motif finding algorithms which are lot slower than DAME. Click here to download the codes. Codes are written in C and compiled in windows environment. Please follow the instructions in the Readme.txt in each subfolders for detailed information.

Gunnar Waterstraat from Charité - Universitätsmedizin Berlin has made a linux version of the code which is available here. Thanks to Gunnar Waterstraat.

Spreadsheet of Experimental Resluts 
We have compiled results of all the experiments in an spreadsheet. In the subsequent sections, we will refer to sheets from this document. Execution times may not be exactly the same as in the spreadsheet because of the random referencing, but they should be very close and representative for the claim of our paper. It is worth mentioning some notes about the document.
  • All the real datasets are single time series and processed by the damesub.cpp code.
  • Locations of motifs are all indexed from 0 as in standard C programming language.
  • The directory named "temp" holds the sorted database.
  • All the times are in seconds unless specified otherwise.
  • All the distances are in z-normalized space.
  • All the Brute Force algorithms use "Early Abandonning".
Synthetic Data
Synthetic data we use is a large random walk dataset which is too large for downloads. So we provide the following  MATLAB code snippetwhich reproduce the dataset exactly. The dataset contains 20 files of 50,000 random walks each. The random walks are of length 1024. In the code snippet, the files are named as "u1.txt", "u2.txt" and so on. Make sure rootDir is set to your desired directory. Although saving the files in ascii mode wastes space, DAME reads them only once and uses its own temporary database in binary format, while running. 

randn('state',[362436069;521288629]);
rootDir = 'u';
for i=1:80
    c = random_walk(1024,50000);
    save(sprintf('%s\\u%d.txt',rootDir,i),'c','-ascii');
end
EEG Dataset
IC17 is the Independent Component that we use in the Brain Activity experiment. It consists of 118 epochs. Each epoch contains 640 data points recorded at 256 Hz and span from -1000 to 1500 ms of Target presentation. We vectorize these data into a single column and then look for 600 ms (153 data point) motifs. We run DAME for five times successively and each time we separate out subsequences that are inside the hypersphere of radius 2 from the current motif pair after each run. For each of the motif clusters, we go back to the original latencies of the 600 ms subsequences in the 118 epochs.

The subsequences that are within the hyperspehere of radius 2 from the Motif 1 have a densed distribution around 100ms after target presentation (shown in the next plot).


"Tiny Image" Dataset
In this experiment we use the database of "80 million tiny images" from 80 million tiny images: a large dataset for non-parametric object and scene recognition
by A. Torralba, R. Fergus and W. T. Freeman, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.30(11), pp. 1958-1970, 2008. For the database please contact with the authors of the above paper.

DAME finds 3,836,902 images which have at least one duplicate in the first 40 million images of the above database. DAME also finds 542,603 images that have at leat one near duplicate in the database with distance less than 0.1. We use tiny2ts.m and im2ts.m to convert the image database into databases of time series of length 768 points. Some representitive duplicates and motifs are presented here. For complete list of duplicates and near-duplicates please email us.
EOG Dataset
We have collected an EOG trace from the Sleep Heart Health Study Polysomnography Database. The full trace is here. DAME finds the following motif of 4.0 seconds length in this trace.

The trace is overly sampled with 250Hz sample rate and over 8 million data points. We have downdsampled the trace to about 1 million points by a 8:1 sampling ratio. We have found the following motif using DAME which comes from times when the patient went back and forth between arousal and sleep stage1.
Insect Dataset
In this experiment, we have done conditional motif discovery. We have collected two time series A and B for two different insects for a specific sensor setup which collects data about insects behavior while sucking sap from leaves. We have run DAME  on the concatenated time series (AB) in such a way that resulting motif pair contains subsequences from both A and B. This is an example of generalizing DAME for other definitions of motifs.

The following figure also shows the motif in the original space. This demonstrates that z-normalization removed the offset errors before computing the Euclidean distance, and therefore, motif pair could be so different in their offset from 0.
World Cup Dataset
We have collected the access logs of the official website for the 1998 FIFA world cup. We then extracted the number of accesses that requested the "index.html" page in every minute and generated the minute-access log for "index.html" from the beginning to the end of the world cup.

We have run DAME on this log and found the motif pair shown on the right column. Here the motif length is more than 24 hours (1500 minutes). The days that poped up as motif are the last two days of the first round of the world cup when 8 teams, which had some chance to move to the next round, played their last match in the first round.
This page is created by -
Abdullah Mueen
Department of Computer Science and Engineering,
University of California - Riverside.