CS177_W16_Project2

CS177 Project Part 2: Data Collection and Distribution Selection

In this part of the project, I want you to do the following:

Collect some empirical data about passenger loading and unload times from "the real world"
Find a suitable theoretical distribution family for representing this data and choose its parameters via the MLE method
Test your theoretical distribution against the empirical data for "goodness of fit"

1. Data collection

Choose one or more locations in Riverside that are popular locations for passenger pickup/dropoff, such as:

The curb in front of the Material Science & Engineering building along Aberdeen Drive
The bus stop area in front of Sproul Hall along West Campus Drive
The main entrance to the Mission Inn Hotel in downtown Riverside along Mission Inn Avenue
The main entrance to the Riverside Medical Clinic building at 7117 Brockton Avenue
Some other location(s) of your own choosing. I recommend places like mall entrances, health care facilities, restaurants, medical facilities, movie theaters, etc The Metrolink station in downtown Riverside would be a great location IF you pay attention to the train schedule....

Plan to spend at least one hour doing data collection, although it doesn't need to be a single block to time. It is important to do it when there is a significant amount of activitat at the target location. NOTE: you can share raw data with your classmates, so it is a good idea to coordinate your schedules so different people aren't taking the same measurements at the same time. (Why bother?)

For each vehicle, you will want to record information such as the following:

its arrival time at the target location
whether its purpose was to pick up or drop off passengers
the number of passengers who entered/left the vehicle
whether the driver was one of the passengers (sometimes the driver gets out of his/her car and goes into campus, and the passenger shifts into the driver's seat and takes the car away)
whether there were any packages or luggage involved, or just people
whether any of the people was elderly or disabled (requiring assistance from others, needed crutches, a cane, walker or wheelchair, etc)
its departure time from the target location
figure out something sensible to do about vehicles waiting at the curb for a long time until their passengers arrive, rather than swooping in to pick up a passenger already waiting for them at the rendezvous site

Ideally, you would like hundreds of data samples, so that when you split up the data into categories (pickup vs. dropoff, how many passengers involved, etc) you still have plenty of data for each specific category. In particular, remember that some of the "goodness of fit" tests do NOT allow you to use the same set of empirical data for MLE parameter fitting and testing, so you really need twice as much as you think you do.

2. Choosing a theoretical distribution

Apply the techniques we talked about in lecture to pick a suitable theoretical distribution family, then use Maximum Likelihood to optimize the parameters of the distribution. Since you are allowed to share data with other students, it is instructive to see if the same theoretical distribution works equally well to data collected by different individuals, or obtained from different locations.

3. Test for "goodness of fit" using (at least) Chi-Squared and K-S test

How many distributions do you need to model the data? Does one-size-fit-all? Or do you need to change parameters for each location? What about the type of passengers, presence of luggage, etc? Is this something that can be modeled as an independent random selection, or is there evidence to say that these parameters are correlated (i.e., car with many passengers is probably followed by another car with many passengers).

4. Final thoughts

What do you think your data tells you about how to handle a much larger scale pickup/dropoff point? If instead of one car-at-a-time, suppose you have a location (such as an airport terminal) where a dozen cars are trying to pick up and drop off passengers simultaneously. Which features from your data do you think will be preserved in this larger system? Which ones are going to change dramatically?