CS204 Fall 2008

Learning to Use Traffic Models

Due: 11:59pm Monday November 24, 2008

I. Summary

The original dataset used by Leland et al. to support their groundbreaking work on self-similarity in Ethernet traffic is publically available on the web through the Internet Traffic Archive. You will be applying some of the network traffic analysis methods we covered in lecture to a small subset of these traces to see first-hand what real data analysis looks like

II. Description of the Trace files

A. The BC-pAug89 trace file
was collected at the Bellcore Morristown Research and Engineering facility, on the computing lab's Ethernet, which carried primarily local traffic, but also all traffic between Bellcore and the Internet. These traces each contain a million packet arrivals seen on the Ethernet. The trace began at 11:25 on August 29, 1989, and ran for about 3142.82 seconds (until 1,000,000 packets had been captured). One million packets is too much to deal with for a homework assignment, so I have created two smaller files, each with 50,000 sample points, covering two consecutive time intervals

segment1 covers the period from 673 seconds to 824 seconds since the start of the trace (50,000 sample points)
segment2 covers the period from 824 seconds to 961 seconds since the start of the trace (50,000 sample points)

NOTE: This data was collected on a 10 Mbps Ethernet segment, which uses a transmission rate of exactly 10⁷ bits/sec. Also, Ethernet framing includes a mandatory inter-packet gap (96 bit-times, minimum) and preamble/start-frame delimiter (64 bits, constant), which is not counted in the reported packet lengths in these traces. For example, the sample data in the following table indicates that the last packet was transmitted immediately after the second-last packet, whereas the earlier packets followed an idle period on the link.

Timestamp	Length	Transmit Time	Gap (end to previous end)
673.718780
673.720164	1090	8.88E-04	4.96E-04
673.721056	1090	8.88E-04	4.00E-06
673.721944	1090	8.88E-04	3.18E-14

B. The Caida trace file
was collected two years ago by the Cooperative Association for Internet Data Analysis (CAIDA) using a special-purpose hardware monitor connected to a 2.5 Gbps OC-48 SONET link belonging to the tier-1 Internet Service provider MFN. This data set was included in a recent paper about traffic modeling.

caida_trace contains a short segment from this trace with 100,000 sample points.

NOTE: This data was collected on an OC-48 link, which operates at a raw data rate of 2,488.32 * 10⁶ bits/sec. However, Packet-Over-SONET (POS) framing is very complicated, and makdes it difficult to decide whether or not two packets really are transmitted "back-to-back" based on their timestamps. First, POS uses the Point-to-Point Protocol (PPP), rather than Ethernet or ATM, as the data link protocol for the SONET connection between two adjacent routers. PPP adds 10 bytes of framing overhead to each IP datagram, which is not counted in the packet sizes reported in the Caida trace. Second, the low-level SONET optical transport imposes a separate 3-dimensional "row/column/layer" tabular structure on the raw bit stream, which repeats according to a fixed schedule that is completely unrelated to the PPP start-packet/end-packet framing overhead. For more information, read section 4.2 from Stephen Donnelly's thesis for a careful description of the difficulties of timestamping packets travelling over an ATM or POS link. In other words, SONET partitions the physical-layer data stream is partitioned into fixed-size "blocks" that contain 160 bytes of overhead columns along with 4160 bytes of payload columns, which reduces the net data rate of the physical link by about 3.7%, down to 2,396.16 * 10⁶ bits/sec. Moreover, the 160-bytes of SONET overhead is transmitted all at once, rather than spread evenly across the "block", so the timestamps for two "back-to-back" packets would differ by

exactly (packet size + PPP overhead)/(raw data rate) if they were not interrupted by framing overhead, or
exactly (packet size + PPP overhead + 160 bytes)/(raw data rate) if they were interrupted by the framing overhead, or
approximately (packet size + PPP overhead)/(net bandwidth of the link) in all cases

In addition, the discussion in Donnelly's thesis suggests that they prefer to generate timestamps at the start (rather than end) of packet event. Thus, in the sample data below we calculate the gap both ways, from end of this packet to end of previous packet (assuming time stamps occur at the end of packet reception), and from start of this packet to start of the next packet (assuming time stamps occur at the start of packet reception). Unfortunately, the timestamps only show 6 decimal places, even though the hardware they used can produce nanosecond timestamps with an accuracy of +/- 11 nsec. Nevertheless, the sample data indicates that timestamps at the start of packet arrival provide a better fit to the data, and that the first 3 packets shown here were probably transmitted back-to-back, but the 4th packet is clearly separated from the others by an idle gap.

Timestamp	Length	Transmit Time, Raw	Gap (end to previous end)	Gap (front to next front)	Transmit Time, Net	Gap (end to previous end)	Gap (front to next front)
5.680568	40	1.61E-07	1.08E-05	-1.61E-07	1.67E-07	1.08E-05	-1.67E-07
5.680568	1500	4.85E-06	-4.85E-06	1.45E-07	5.04E-06	-5.04E-06	-4.14E-08
5.680573	1500	4.85E-06	1.45E-07	5.15E-06	5.04E-06	-4.14E-08	4.96E-06
5.680583	1500	4.85E-06	5.15E-06	4.51E-05	5.04E-06	4.96E-06	4.50E-05

C. Trace format
Each trace file is in 2-column ASCII format. The first column gives a floating-point time stamp representing the time in seconds since the start of a trace. The second column gives the packet length in bytes. Notice that the times are expressed to 6 places after the decimal point, giving the appearance of microsecond resolution. However, the hardware clock may not be that accurate in all cases.

III. Tasks

Estimate the Network Calculus-style minimal arrival curves for the three segments, say alpha_i(t) for the ith trace segment, which is an upper bound to the cumulative arrivals over any interval of length t from the ith segment of trace data, which we will call A_i(t). Compare the minimal arrival curves for the two segments to see how much the data in the trace has changed over this time period. NOTE: Recall that you will need to create a cumulative input function, A_i(t), then perform the sub-additive deconvolution of A_i(t) with itself to find the minimal arrival curve. Since (de-)convolution is an O(n²) operation for a sequence of length n, you should carry out the calculation using various discrete step sizes (eg., look A_i(t) only when t is a multiple of 25 seconds, or 5 seconds, or 1 second, say, rather than all possible times, say).
Let us now define gamma_i(t) to be the trailing arrival curve that forms a lower bound to the cumulative arrivals over any interval of length t from the ith segment of trace data. In other words, for every interval (s,t) in the ith segment, we now require that [A_i(t) - A_i(s)] >= gamma_i(t-s).

Modify the expressions for finding the minimal arrival curve to handle the trailing arrival curve.
Modify the program you used in task 1 to find the trailing arrival curves for each of the two segments. Finally, plot the minimal arrival curve and trailing arrival curve on the same graph, to compare their shapes.
What happens to the difference between alpha_i(t) and gamma_i(t) as t increases?

Define a link busy period of size K, for K > 1, to be an interval of time that starts and ends with an idle period on the link (i.e., the ending time for the previous packet is less than the starting time for the next packet), and in between there are K consecutive packet transmissions with no intermediate idle periods (i.e., the ending time for the previous packet and the starting time for the next packet are equal). For each trace, estimate the following:

P_k = Prob [ randomly chosen packet was sent in a link busy period of size K] for all 0 < K < 25.

IV. What to turn in

For the minimal arrival curve, give me a graph (PDF preferred) for each of the three trace segments showing the cumulative input function, A_i(t), together with the minimal arrival curve that you computed with different discrete step sizes. Also, give me a copy of the program you used to generate the minimal arrival curve (C/C++, matlab, MS excel, etc) together with a README file on how to use it.
For the trailing arrival curve, give me the same information as task 1, together with your written answers to questions a - c.
For the link busy period estimation, give me a copy of the program you used to generate the data, together with a table (or graph) showing your results for each data set, where you have separated each data set into chunks of 10,000 samples each. Thus your program should calculate {P_k} for five separate chunks from each segment of the BC-pAug89 data set, and for 10 separate chunks for the Caida data set. How consistent are your results across different data sets and/or different chunks within a data set? What does this tell you about the validity of the results?