Lab 5: The Scheduler
Feb. 7/8, 2007
[153 Home]     [UML]     [Fair-Share Scheduling]     [Kernel Analysis]     [Implementation]     [Resources]

User Mode Linux
New: Your cs153 directory is /class/cs153/cs153_07win/<login>
  1. Building the Kernel: starting from 2.6.9, an up-to-date UML is included in the kernel itself, so we can build UML without any patch.
    • Do it at school:
      1. Make sure you are in your cs153 directory:
                 cd /class/cs153/cs153_07win/<login>
      2. Untar the kernel source code into your directory:
                 tar zxvf /class/cs153/cs153_07win/lgao/linux-2.6.18-uml.tar.gz
      3. Go to the kernel source directory:
                 cd linux-2.6.18
      4. Compile the kernel:
                 ./compileuml
      5. Run the kernel:
                 ./linux
      6. Now you get the login prompt as root.
    • Do it at home:
      1. Download the 2.6.18 kernel and uncompress it:
                 tar xjvf linux-2.6.18.tar.bz2
      2. Download the config file and save it to the top kernel source directory as .config:
                 cp kernel32-2.6.18.config linux-2.6.18/.config
      3. Compile the kernel:
                 make oldconfig ARCH=um; make linux ARCH=um
      4. Download the filesystem image, uncompress it to the top kernel source directory, and rename it to be root_fs:
                 tar xjvf DSL-2.2-root_fs.bz2; mv DSL-2.2-root_fs linux-2.6.18/root_fs
      5. Run the kernel:
                 ./linux
      6. Now you get the login prompt as root.

  2. Exiting the UML environment:
    • halt or
    • shutdown -h now

  3. Accessing the host filesystem
    • Create the mount point if necessary:
               mkdir /mnt/host
    • Mount the appropriate file system:
               mount -t hostfs /home /mnt/host
    • You should now be able to access your host file system from UML at /mnt/host

  4. Test your own code:
    • In the host OS:
      • modify sched.h and sched.c in your cs153 directory, then compile and run the UML
                 ./compileuml; ./linux
      • write your own test program (create processes, specify the gid, and read /proc), compile it using -static:
                 gcc -static test.c -o test
    • In the guest OS: Make sure you can access the executable in your host filesystem, then run that executable.

What is fair-share scheduling?
100% / 3 groups = 33.3% per group
Group 1: (33.3% / 3 tasks) = 11.1% per task
Group 2: (33.3% / 2 tasks) = 16.7% per task
Group 3: (33.3% / 4 tasks) = 8.3% per task
v.s. 100 % 9 tasks = 11.1% per task


Kernel Analysis
  1. Data Structures Used by the Scheduler
    • struct task_struct
      Type Name Description
      long state TASK_RUNNING, TASK_(UN)INTERRUPTIBLE, ...
      int prio dynamic priority based on static_prio and sleep_avg
      int static_prio static priority
      unsigned long rt_priority real-time priority
      unsigned long policy SCHED_NORMAL, SCHED_FIFO, SCHED_RR, SCHED_BATCH
      unsigned int time_slice ticks left in the time quantum of the process
      unsigned int first_time_slice 1 if never exhasusted quantum, otherwise 0
      unsigned long sleep_avg average sleep time
      unsigned long long timestamp time of last context switch that it is replaced or time of last insertion in the runqueue
      unsigned long long last_ran time of last context switch that it is replaced
      struct prio_array * array pointer to the runqueue's priority array that inludes the process
      struct list_head run_list pointers to the next and previous elements in the runqueue list to which the process belongs
      gid_t gid, egid, sgid group ID of the process

    • struct rq
      Type Name Description
      spinlock_t lock Only one task can modify the runqueue at any time
      unsigned long nr_running Number of runnable tasks in the runqueue
      unsigned long expired_timestamp Last time a task is running out of time quantum
      unsigned long long timestamp_last_tick time of last scheduler tick
      int best_expired_prio The highest priority of any expired task
      struct task_struct * curr pointer to the currently running process
      struct task_struct * idle pointer to the idle process
      struct prio_array * active Pointer to the lists of active processes
      struct prio_array * expired Pointer to the lists of expired processes
      struct prio_array [2] arrays The two sets of active and expired processes

    • struct prio_array
      Type Name Description
      unsigned int nr_active number of tasks in the array
      unsigned long [5] bitmap priority bitmap
      struct list_head [MAX_PRIO] queue an array of 140 priority queues (if MAX_PRIO = 140)

    • Question: How to understand p->array->queue + p->prio and which tasks are pointed to by p->runlist?

  2. Functions Used by the Scheduler
    • schedule()
    • scheduler_tick()
    • effective_prio()

  3. How time_slice is changed?
    • In sched_fork(), time_slice is shared between parent and child.
    • In scheduler_tick(), time_slice is decremented, if it becomes 0, a new time_slice is calculated depending on different scheduling policies. The task might be moved around in the priority queue.
    • In sched_exit(), when a process exits, time_slice is retrieved by its parent.

  4. How static_prio is used?
    • It is never changed in the kernel.
    • It is used to calculate the nice value (TASK_NICE(p), TASK_USER_PRIO(p), set_user_nice()), the time slices (task_timeslice()), the interactivity (TASK_INTERACTIVE(p)), dynamic priority (__normal_prio()).
    • task_timeslice() calculate the time slice values based on static_prio:
      • if static_prio < 120, it returns (140-static_prio) * 20 milliseconds
      • if static_prio >= 120, it returns (140-static_prio) * 5 milliseconds

  5. How prio (dynamic priority) is used?
    • It determines which priority array a task will be added/removed:
      Related functions: dequeue_task(), enqueue_task(), requeue_task(), enqueue_task_head()
    • It is calculated based on the static_prio but is modified by bonuses/penalties according to sleep_avg:
      prio = max(100, min(static_prio - bonus + 5, 139))
      Related functions: __normal_prio(), normal_prio(), effective_prio(), recalc_task_prio()

  6. likely/unlikely macros: defined in <include/linux/compiler.h>, used for branch prediction.
    if (likely(x)) // equivalent to "if (x)"
    { A; } // A is more probable
    else
    { B; }
    if (unlikely(x)) // equivalent to "if (x)"
    { A; }
    else
    { B; } // B is more probable

  7. HZ/jiffies: used to measure time in Linux.
    • System timers interrupt the processor at a certain frequency.
    • HZ is the number of timer ticks per second, or, the frequency of timer interrupts. It is defined in <include/asm/param.h>. On x86 systems, it is set to 1000 in the 2.6 kernel, so there are 1000 timer interrupts per second, i.e., a timer interrupt happens every millisecond. n*HZ/100 is the number of timer ticks in n millisecons.
    • jiffies is the number of timer interrupts since the system booted. If HZ is 1000, jiffies is incremented every millisecond, i.e., a jiffy is only 1-millisecond.
    • In sched.h, MIN_TIMESLICE is defined as max(5 * HZ / 1000, 1), which is actually 5ms, DEF_TIMESLICE is defined as (100 * HZ / 1000), which is 100 milliseconds.

How to implement?
  1. Focus on the scheduler code that is crucial for the assignment and ignore the other part. Files of interests are kernel/sched.c and include/linux/sched.h.
    • Reuse the data structures and functions as much as possible. For example, dequeue_task(), enqueue_task(), requeue_task(), enqueue_task_head().
    • Ignore evertying in #ifdef CONFIG_SMP and #endif, and #ifdef CONFIG_SMT and #endif.

  2. Your scheduler should work together with the existing Linux scheduler, so you should add a new scheduling policy: SCHED_GFS (Group-based Fair Sharing).
    • You can set the scheduling policy and the real-time priority of a task via the system call sched_setscheduler() in your test program.
    • You can set the static priority via the system call nice.
    • Other tasks should be scheduled using their default policies.
    • Tips: search for SCHED_BATCH. It is a new policy added from 2.6.16. Processes in this class are scheduled normally, with the exception that they get no "interactivity" bonus when they sleep. Follow similar ways, you'll know how to add your own policy.
    • Tips: you can let the SCHED_GFS processes have the same static priority, and ignore their dynamic priority, so that they are always put in the same priority queue and your scheduler can make the decision purely based on the time slice.
    • Tips: In the schedule() function, next is the process chosen to run next, prev is the one that is running and to be replace.

  3. For simplicity, you can assume a fixed number of groups and let each group get an equal amount of CPU time (100 ms).
    • You should calculate the time slice correctly for each task within a group. time_slice is decremented on each timer interrupt, see the scheduler_tick() function.

  4. Use a round-robin scheme to decide which group to choose. Then decide which task to choose within that group. You can also use round-robin or FIFO to choose the task.
    • Search for the real-time scheduling policy SCHED_FIFO and SCHED_RR to get an idea how FIFO and RR is implemented.

Resources
  1. Online Source Code (2.6.18): Cross-Referencing Linux
  2. Online Book: Understanding the Linux Kernel, 3rd Edition
  3. Lab Manual: p.24-31, p155-166
  4. User Mode Linux
  5. The DamnSmallLinux Filesystem Image
  6. The Kernel Source Code
  7. An excellent article: Understanding the Linux 2.6.9.1 CPU Scheduler