CS204 Fall 2007

Group Project

Due: 11:59pm Sunday December 16, 2007

I. Summary

The project will be a non-trivial original piece of work, which is produced by an individual student or by 2  students working together as a team. The project could be structured in any of the following ways: 

The end result should be at the level of a decent workshop paper. I will offer several ideas for projects related to traffic characterization issues we have been covering in lecture. In general, these ideas are about using the archive of network traces in the Wide Project in a variety of different ways. However, you are welcome to select a topic of your own choosing if you prefer. Please let me know ASAP when you have chosen a topic so I can offer suggestions on how to proceed before you (attempt to) do all the work.

II. Some Sample Topics

(1) Study the feasibility of network calculus (as an alternative to trace-driven simulation) for estimating queueing delays at a backbone router port.
    Earlier in the quarter, we covered how to estimate queueing delays at a router port using a simple M/M/1 queueing model based on Kleinrock's Independence Assumption applied to Jacksonian product-form queueing networks. Many people have dismissed the results obtained from such analytical models because packets don't randomly change their size as they cross the network (contradicting Kleinrock's assumption), and network measurements have shown that packet interarrival times don't fit a simple exponential model (contradicting the product-form results). The preferred alternative is trace-driven simulation, where you read the sequence of packet arrival times / packet length pairs from the trace file and use them to represent the arrival time and service times for each packet in your simulator. For each packet, you can then calculate its waiting time, as the difference between its arrival time from the trace file and the end of its transmission on the output link, and tabulate mean, variance and distribution of waiting times over all packets served in one run of the simulator.
    Please note that in general a network trace records the end-of-transmission time for every packet leaving a router port, rather than the time of its arrival to the port's transmit queue. Hence, the minimum time delay between consecutive packets in the trace is the transmit time for the second packet, so all queueing delays have been eliminated from the system. Thus, you need to overlay multiple traces (which might represent packets entering the router from different ports that are all routed to "our" chosen output port, as discussed in detail in section IV.e of this technical report.
    Think of the raw network trace data as a representation of the cumulative arrival process, A(t), from the network calculus approach. When A(t) is fed through the trace-driven simulator of a network port, the result is basically the same as forming the min-plus convolution of A(t) and the obvious linear service curve, beta(t) obtained from the data rate of the output port. But what happens if we apply network calculus to obtain the minimal arrival curve, alpha(t), and work with alpha(t) instead of A(t)? The delay theorem says that the maximum delay in the original system cannot be larger than the maximum delay in the modified system using alpha(t) instead of A(t) as the input. But what about the rest of the information in the graph of the input to the modified system, alpha(t), and the corresponding output, {alpha convolved with beta}(t)? In particular, can you find an interpretation for the average separation, or the distribution of the separation, between the two curves? How does the magnitude of the delay information obtained from network calculus compare with the corresponding delay information obtained from the trace-driven simulation?


(2) Design and implement an efficient online algorithm for generating minimum arrival curves to be used as a network monitoring tool.
    Imagine that your job is to build a tool that will help
network administrators visualize how the traffic on some critical link changes over time. Currently, most network monitoring systems (such as the public domain MRTG) are basically just a graphical display of the traffic volume as a function of time, i.e., using the terminology of Network Calculus, they provide a "moving window" view of a(t) as a function of t showing, for example, the amount of traffic that arrived during consecutive 30-second time intervals during the past 15 minutes.

Your task is to provide a method for efficiently converting raw network measurement data into the form of a cumulative arrival curve, alpha(t), that is displayed as a similar time-varying graphical image as the existing tools. Because of the nature of cumulative arrivals, I recommend that you aim for a log-log plot showing alpha(t) as a function of t, so that both fine details (fractions of a millisecond) and long-term trends (minutes) are visible on a single graph. Moreover, since a straight line through the origin still looks like a straight line on a log-log graph, a "well-behaved" system (i.e., smooth, fluid-flow at some fixed average rate) will be easy to recognize from the shape of the graph.
In addition to the minimal arrival curve, your program should also compute a lower bound to the cumulative arrivals, delta(t), which is defined as the minimum (rather than maximum) of all intervals of length t.
    In MRTG, the entire graph represents a "moving window" view of an infinitely long measurement experiment, in which you only get to see a fixed number of the most recent samples.  For example, if the "moving window" displays network activity during the last 15 minutes in 30-second increments, then once every 30 seconds, the graph is updated by dropping the oldest sample, shifting the remaining samples down by one, and adding the new sample to the front. Providing the same functionality in a plot of alpha(t) would be much harder, because a point alpha(x), representimg the highest amount of traffic during any interval of length "x" within the 15-minute moving window at its current position, could represent activity that occurred anywhere within the current window. Thus, it is easy to decide when to increase the value of alpha(x) (i.e., an interval of length x starting from the beginning of the newly added sample has more arrivals than the current value of alpha(x)), knowing when to reduce alpha(x) (and by how much!) is much harder because you need to know when the current interval defining alpha(x) falls out the end of the sliding window, and then what is the next-best value that should be used to replace it. The important part of the problem is to develop an efficient "online" algorithm for updating the set of alpha(t) values each time the moving window advances by one sample. In particular, after each time step, one new candidate becomes eligible (which terminates at the end of the new time window) and one previously-eligible candidate is no longer valid (which starts as the beginning of the previous time window).
    A simpler alternative to online updating of alpha(t) would be to periodically recalculate the entire function. Going back to the example above, where MRTG is displaying data about network activity in the most-recent 15 minute period, with updates happening every 30 seconds, you could recompute the curve once every 5 minutes and include the most recent and, say, 5 previous versions of alpha(t) so you can see how the shape of the curve has been changing over time. In either case, I suggest you design your program to pull data from the WIDE archive, since it is easily available.
    Once you have gotten your graphical presentation working, it is time to do some experiments to see how it could be used by a network administrator. Look at a number of images of "normal" activity and see what features they have in common. Now inject various types of "problems" into the trace data, such as traffic overload due to a denial-of-service attack (eg, superimpose muliple traces to increase the load), link failure (chop out a piece of the trace), etc. How easy are those graphs to interpret? How hard are they to compute? Would you recommend using them to a friend? Is there any value to displaying high-resolution information (remember the log;-log scale allows you to easily display information down to sub-millisecond levels) that is too noisy to display on an MRTG graph, where 30-second averages are as fine as they go?

(3) Use the WIDE data archive to study long-range dependence.
   
Recall that in our discussion of self-similarity and long-range dependence in traffic models, I pointed out the practical problems of trying to measure the Hurst exponent (H) from a network trace file. Obviously, larger values of the Hurst exponent mean that the data in trace file is more strongly correlated, and hence we need a much longer trace file to get a meaningful estimate for H.
    More specifically, since the autocorrelation coefficient, rho(k), for self-similar traffic has the form k(-beta)*C, if we want the correlation between to samples to be at most P, then those samples must be separated by (C/P)
(-1/beta). Thus if we assume P=0.01 and H=.9 then this expression tells us we shouldn't treat two samples from the trace file as independent unless they are separated by a distance of 1010 , which is really difficult!
    Fortunately, the nature of the WIDE archive makes it possible to achieve this by using traces from different days for independent replications of the measurement-of-H experiment. In this way, you can generate multiple sample values for H, and apply the standard results from elementary statistics to find confidence intervals. Try it, and see how large are the confidence intervals in comparison
to the measured value of H.  NOTE: in the formulas for estimating the Hurst exponent we talked about in lecture, the difference from the mean shows up as part of the computation of the variance and/or covariance terms. The mean here is supposed to be the "ideal" mean "mu", not the our sample-mean "guess" X_bar (n). Thus, in your calculations you should be sure to calculate the overall mean across all data from different days first, then plug it into the computation of ACF or aggregate variance for each repetition of the experiement. That is likely to make your (co)variance numbers get bigger, and hence the decay slope more flat.  Look at this issue carefully and try to make sense of this.

    In addition to using data from different days as independent replications of the calculation of the Hurst exponent over relatively short time scales, you can also use the WIDE trace to estimate H from very large time scales (i.e., days, weeks, months). If the network data is truly self-similar (as opposed to merely long-range dependent), then we should get a similar value for H across these very large time scales compared to what we get from short time scales. How consistent is the estimate of H from real data?

(4) See how queue inferencing could be used to estimate the packet waiting times inside the router by analysing the interdeparture times in traffic leaving the router.

   
As I mentioned in class, queue inferencing is a technique for estimating the waiting times that customers experience in a queue by looking only at the sequence of departures. The technique originated in operations research literature, with application to placement of automated teller machines by banks who only had access to the log of customer transactions at the ATM. It was later adapted to network monitoring by Manjunath and myself in this paper. The implementation of the algorithm is a bit messy, but it could be applied to any network trace, such as the ones at the WIDE project. If you are interested in this, I would be happy to talk to you off line.