CS204 Fall 2005
Learning to Use Traffic Models
Due: 11:59pm Friday November 18, 2005
I. Summary
The original dataset used by Leland et
al. to support their groundbreaking work on self-similarity in
Ethernet traffic is publically available on the web through the Internet Traffic Archive.
You will be applying some of the network traffic analysis methods we
covered in lecture to a small subset of these traces to see first-hand
what real data analysis looks like
II. Description of the Trace files
A. The BC-pAug89 trace file
was collected at the Bellcore Morristown Research and
Engineering facility, on the computing lab's Ethernet, which
carried primarily local traffic, but also all traffic between
Bellcore and the Internet. These traces each contain a million packet
arrivals seen on the Ethernet. The trace began at 11:25 on August 29,
1989, and ran for
about 3142.82 seconds (until 1,000,000 packets had been
captured). One million packets is too much to deal with for a homework
assignment,
so I have created two smaller files, each with 50,000 sample points,
covering two consecutive time intervals
- segment1 covers the period from 673
seconds
to 824 seconds since the start of the trace
- segment2 covers the period from 824
seconds
to 961 seconds since the start of the trace
B. The Caida trace file
was collected two years ago by the Cooperative Association for Internet
Data Analysis (CAIDA) using a special-purpose hardware monitor
connected to a 2.5 Gbps OC-48 SONET link belonging to the tier-1
Internet Service provider MFN. This data set was included in a recent paper about traffic
modeling.
- caida_trace contains a short
segment from this trace with 100,000 sample points.
C. Trace format
Each trace file is in 2-column ASCII format. The first column gives a
floating-point time stamp representing the time in seconds since
the start of a trace. The second column gives the packet
length in bytes. Notice that the times are expressed to 6 places
after the decimal point, giving the appearance of microsecond
resolution. However, the hardware clock may not be that accurate in all
cases.
III. Tasks
- Estimate the Network
Calculus-style minimal arrival
curves for the three segments, say alphai(t)
for the ith trace segment,
which is an upper bound to the
cumulative arrivals over any interval of length t from the ith segment of trace data, which
we will call Ai(t). Compare the minimal arrival
curves for the two segments to see how much the data in the trace has
changed over this time period. NOTE:
Recall that you
will need to create a cumulative input function, Ai(t),
then perform the sub-additive deconvolution of Ai(t)
with itself to find the minimal arrival curve. Since (de-)convolution
is an O(n2) operation for a
sequence of length n, you
should carry out the calculation using various discrete step sizes
(eg., look Ai(t) only when t is a multiple of 25 seconds, or
5
seconds, or 1 second, say, rather than all possible times, say).
- Let us now define gammai(t)
to be the trailing
arrival curve that forms a lower bound to the cumulative
arrivals over any interval of length t from the ith segment of trace data. In
other words, for every interval (s,t) in the ith segment, we now require
that [Ai(t) - Ai(s)] <= gammai(t-s).
- Modify the
expressions for finding the minimal arrival curve to handle the
trailing arrival curve.
- Modify the
program you used in task 1 to find the trailing arrival curves for each
of the two segments. Finally, plot the minimal arrival curve and
trailing arrival curve on the same graph, to compare their shapes.
- What happens to
the difference between alphai(t) and gammai(t)
as t increases?
- Use at
least three of the SELFIS estimators to calculate the Hurst parameter,H, for this data. Recall that the
analysis of self-similarity and/or long-range dependence assumes
stationarity in the data set, so this time you must work with a time
series of incremental, rather
than cumulative values. In
other words, you need to work with the input during the jth fixed-length interval, aj
=A(j·c) - A((j-1)·c), rather than the cumulative
input up to time j·c. This requires you to partition
the trace into a sequence of "bins", each with a fixed length of c seconds. This requires some
caution, especially when c is
small (10 msec or 100 msec, say), because all the bits that belong to
large packet do not arrive instantaneously! Since the data rate
for a 10 Mbps Ethernet is 107 bits/sec, the elapsed time
between the start and end of transmission for a 1500-byte packet is
(12,000 bits)/(107 bits/sec) = 1.2 msec. However, since
there
is only one timestamp given for each packet, you need to determine
whether this represents the time at which its first or last bit was
transmitted, and if the transmission interval crosses the boundary
between "bins" how many of the bits should be credited to each of them.
(HINT: Look in the trace files for closely-spaced packet arrivals and
see if you can find places where one interpretation makes sense and the
other is impossible. For example, does the start of transmission for
the (n+1)st packet seem
to happen before the end of transmission for the nth packet?)
IV. What to turn in
- For
the minimal arrival curve, give me a graph (PDF preferred)
for each of the three trace segments showing the cumulative input
function, Ai(t),
together with the minimal arrival curve that you computed with
different discrete step sizes. Also, give me a copy of the program you
used to generate the minimal arrival curve (C/C++, matlab, MS excel,
etc) together with a README file on how to use it.
- For
the minimal arrival curve, give me the same information as task
1, together with your written answers to questions a - c.
- For
the Hurst parameter estimation, give me a README file that
explains: whether the timestamps mark the beginning or end of each
packet and how you determined this; which estimators you used; the
value(s) of H you found using
each method;
sample input/output files and instructions for generating the results
using SELFIS. Be sure to test the estimators on the full 100,000 sample
dataset along with several smaller datasets to see how stable the
estimate of H is over time.
You should also experiment with several choices for the "bin" size, c.