CS177 Project Part 2: Data Collection and Distribution Selection

In this part of the project, I want you to do the following:
  1. Collect some empirical data about passenger loading and unload times from "the real world"
  2. Find a suitable theoretical distribution family for representing this data and choose its parameters via the MLE method
  3. Test your theoretical distribution against the empirical data for "goodness of fit"

1. Data collection

Choose one or more locations in Riverside that are popular locations for passenger pickup/dropoff, such as:
Plan to spend at least one hour doing data collection, although it doesn't need to be a single block to time. It is important to do it when there is a significant amount of activitat at the target location.  NOTE: you can share raw data with your classmates, so it is a good idea to coordinate your schedules so different people aren't taking the same measurements at the same time. (Why bother?)

For each vehicle, you will want to record information such as the following:
Ideally, you would like hundreds of data samples, so that when you split up the data into categories (pickup vs. dropoff, how many passengers involved, etc) you still have plenty of data for each specific category.  In particular, remember that some of the "goodness of fit" tests do NOT allow you to use the same set of empirical data for MLE parameter fitting and testing, so you really need twice as much as you think you do.

2. Choosing a theoretical distribution

Apply the techniques we talked about in lecture to pick a suitable theoretical distribution family, then use Maximum Likelihood to optimize the parameters of the distribution. Since you are allowed to share data with other students, it is instructive to see if the same theoretical distribution works equally well to data collected by different individuals, or obtained from different locations.

3. Test for "goodness of fit" using (at least) Chi-Squared and K-S test

How many distributions do you need to model the data? Does one-size-fit-all? Or do you need to change parameters for each location? What about the type of passengers, presence of luggage, etc? Is this something that can be modeled as an independent random selection, or is  there evidence to say that  these parameters are correlated (i.e., car with many passengers is probably followed by another car with many passengers).

4. Final thoughts

What do you think your data tells you about how to handle a much larger scale pickup/dropoff point? If instead of one car-at-a-time, suppose you have a location (such as an airport terminal) where a dozen cars are trying to pick up and drop off passengers simultaneously. Which features from your data do you think will be preserved in this larger system? Which ones are going to change dramatically?