Contents
  1. 1. BSD 进程 Process
  2. 2. Mach 任务 task
  3. 3. 进程调度
    1. 3.1. Mach 线程调度
  4. 4. Mach 调度器特性
    1. 4.1. 控制权转交
    2. 4.2. 使用续体
    3. 4.3. 异步软件陷阱 Asynchronous Software Trap,AST
    4. 4.4. 上下文切换(content switch)
    5. 4.5. 优先级
    6. 4.6. 优先级偏移
    7. 4.7. 运行队列
    8. 4.8. 等待队列
    9. 4.9. CPU 亲缘性

  macOS和Linux都是类Unix系统,与开源的Linux不同,macOS并不开源,但其类 Unix核心Darwin是开源的,在github上可以下载其源代码,这使我们可以一探其究竟。
  Darwin是基于XNU构建的,XNU就是macOS实际的内核了,与Linux相对应。不同的是Linux是宏内核结构而XNU是微内核结构,其差别见下面两幅图。

20170625165537726
20170625165907363

 XNU是混合式内核,是基于Mach和BSD内核构建的,这两个内核就是我们进行内存调度的主角。引用Apple的Github界面所述介绍一下什么是XNU:

What is XNU?

XNU kernel is part of the Darwin operating system for use in macOS and iOS operating systems. XNU is an acronym for X is Not Unix. XNU is a hybrid kernel combining the Mach kernel developed at Carnegie Mellon University with components from FreeBSD and a C++ API for writing drivers called IOKit. XNU runs on x86_64 for both single processor and multi-processor configurations.

 实际上,在XNU中Mach更接近于低层,BSD则建立于其上,XNU给用户态主要提供BSD的接口。在这样的结构下XNU是怎么进行进程调度的呢,这就要先介绍下BSD中进程和Mach中任务的概念。

BSD 进程 Process

 BSD的进程可以唯一地映射到Mach 任务,但是包含的信息比Mach任务提供的基本调度和统计信息要丰富。BSD 进程包含了文件描述符和信号处理程序的数据。进程还支持复杂的谱系,将进程和其父进程、兄弟进程和子进程连接起来。BSD 在struct proc 中维护了进程的很多特性,有趣的是bsd的源代码中有这样一幅“图”:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
/*
* vfork
*
* Description: vfork system call
*
* Parameters: void [no arguments]
*
* Retval: 0 (to child process)
* !0 pid of child (to parent process)
* -1 error (see "Returns:")
*
* Returns: EAGAIN Administrative limit reached
* EINVAL vfork() called during vfork()
* ENOMEM Failed to allocate new process
*
* Note: After a successful call to this function, the parent process
* has its task, thread, and uthread lent to the child process,
* and control is returned to the caller; if this function is
* invoked as a system call, the return is to user space, and
* is effectively running on the child process.
*
* Subsequent calls that operate on process state are permitted,
* though discouraged, and will operate on the child process; any
* operations on the task, thread, or uthread will result in
* changes in the parent state, and, if inheritable, the child
* state, when a task, thread, and uthread are realized for the
* child process at execve() time, will also be effected. Given
* this, it's recemmended that people use the posix_spawn() call
* instead.
*
* BLOCK DIAGRAM OF VFORK
*
* Before:
*
* ,----------------. ,-------------.
* | | task | |
* | parent_thread | ------> | parent_task |
* | | <.list. | |
* `----------------' `-------------'
* uthread | ^ bsd_info | ^
* v | vc_thread v | task
* ,----------------. ,-------------.
* | | | |
* | parent_uthread | <.list. | parent_proc | <-- current_proc()
* | | | |
* `----------------' `-------------'
* uu_proc |
* v
* NULL
*
* After:
*
* ,----------------. ,-------------.
* | | task | |
* ,----> | parent_thread | ------> | parent_task |
* | | | <.list. | |
* | `----------------' `-------------'
* | uthread | ^ bsd_info | ^
* | v | vc_thread v | task
* | ,----------------. ,-------------.
* | | | | |
* | | parent_uthread | <.list. | parent_proc |
* | | | | |
* | `----------------' `-------------'
* | uu_proc | . list
* | v v
* | ,----------------.
* `----- | |
* p_vforkact | child_proc | <-- current_proc()
* | |
* `----------------'
*/

Mach 任务 task

 Mach并不关心进程,而是使用了比进程更轻量级的概念:任务(task)。任务可以看作是一个机器无关的thread执行环境的抽象;或者一个包括虚拟地址空间、IPC空间、处理器资源、调度控制、thread的容器。
严格地说,Mach 的任务并不是其他操作系统中所谓的进程,因为Mach 作为一个微内核的操作系统,并没有提供“进程”的逻辑,而只是提供了最基本的实现。任务是没有生命的。任务存在的目的就是称为一个或多个线程的容器。任务中的线程都在threads成员中维护,这是一个包含thread_count个线程的队列。此外,大部分对任务的操作实际上就是遍历给定任务中的所有线程,并对这些线程进行对应的线程操作。
task定义在osfmk/kern/task.h中,task结构体的定义如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
struct task {
uint32_t vtimers;
#if defined(CONFIG_SCHED_MULTIQ)
sched_group_t sched_group;
queue_head_t threads;
processor_set_t pset_hint;
struct affinity_space *affinity_space;
int thread_count;
uint32_t active_thread_count;
security_token_t sec_token;
audit_token_t audit_token;
uint64_t total_system_time;
uint64_t total_ptime;
uint64_t total_runnable_time;
decl_lck_mtx_data(,itk_lock_data)
struct exception_action exc_actions[EXC_TYPES_COUNT];
struct ipc_port *itk_registered[TASK_PORT_REGISTER_MAX];
struct ipc_space *itk_space;
ledger_t ledger;
#define VM_BACKING_STORE_PRIV 0x1
MACHINE_TASK
#ifdef MACH_BSD
void *bsd_info;
#endif
kcdata_descriptor_t corpse_info;
uint64_t crashed_thread_id;
queue_chain_t corpse_tasks;
#ifdef CONFIG_MACF
struct label * crash_label;
#endif
struct vm_shared_region *shared_region;
#define TF_NONE 0
* Task is running within a 64-bit address space.
#define task_has_64Bit_addr(task)
(((task)->t_flags & TF_64B_ADDR) != 0)
#define task_set_64Bit_addr(task)
((task)->t_flags |= TF_64B_ADDR)
#define task_clear_64Bit_addr(task)
((task)->t_flags &= ~TF_64B_ADDR)
#define task_has_64Bit_data(task)
(((task)->t_flags & TF_64B_DATA) != 0)
#define task_set_64Bit_data(task)
((task)->t_flags |= TF_64B_DATA)
#define task_clear_64Bit_data(task)
((task)->t_flags &= ~TF_64B_DATA)
#define task_is_a_corpse(task)
(((task)->t_flags & TF_CORPSE) != 0)
#define task_set_corpse(task)
((task)->t_flags |= TF_CORPSE)
#define task_corpse_pending_report(task)
(((task)->t_flags & TF_PENDING_CORPSE) != 0)
#define task_set_corpse_pending_report(task)
((task)->t_flags |= TF_PENDING_CORPSE)
#define task_clear_corpse_pending_report(task)
((task)->t_flags &= ~TF_PENDING_CORPSE)
#define task_is_a_corpse_fork(task)
(((task)->t_flags & TF_CORPSE_FORK) != 0)
#define TPF_NONE 0
#ifdef CONFIG_32BIT_TELEMETRY
#endif
#define task_did_exec_internal(task)
(((task)->t_procflags & TPF_DID_EXEC) != 0)
#define task_is_exec_copy_internal(task)
(((task)->t_procflags & TPF_EXEC_COPY) != 0)
#if KPC
uint16_t policy_ru_cpu :4,
policy_ru_cpu_ext :4,
applied_ru_cpu :4,
applied_ru_cpu_ext :4;
uint8_t rusage_cpu_flags;
#if MACH_ASSERT
#endif
uint64_t rusage_cpu_deadline;
thread_call_t rusage_cpu_callt;
#if CONFIG_EMBEDDED
int num_taskwatchers;
int watchapplying;

#if CONFIG_ATM
#endif

#if IMPORTANCE_INHERITANCE

vm_extmod_statistics_data_t extmod_statistics;

struct task_requested_policy requested_policy;
struct task_effective_policy effective_policy;
* Can be merged with imp_donor bits, once the IMPORTANCE_INHERITANCE macro goes away.
io_stat_info_t task_io_stats;
uint64_t task_immediate_writes __attribute__((aligned(8)));
uint64_t task_deferred_writes __attribute__((aligned(8)));
uint64_t task_invalidated_writes __attribute__((aligned(8)));
uint64_t task_metadata_writes __attribute__((aligned(8)));
* The cpu_time_qos_stats fields are protected by the task lock
struct _cpu_time_qos_stats cpu_time_eqos_stats;
struct _cpu_time_qos_stats cpu_time_rqos_stats;

uint32_t task_timer_wakeups_bin_1;
uint32_t task_timer_wakeups_bin_2;
uint64_t task_gpu_ns;
uint64_t task_energy;
#if MONOTONIC
struct mt_task task_monotonic;
int task_volatile_objects;
int task_nonvolatile_objects;
boolean_t task_purgeable_disowning;
boolean_t task_purgeable_disowned;
queue_head_t task_objq;
unsigned int task_thread_limit:16;
#if __arm64__
unsigned int task_legacy_footprint:1;
unsigned int task_region_footprint:1;
unsigned int task_has_crossed_thread_limit:1;
uint32_t exec_token;
coalition_t coalition[COALITION_NUM_TYPES];
queue_chain_t task_coalition[COALITION_NUM_TYPES];
uint64_t dispatchqueue_offset;
#if DEVELOPMENT || DEBUG
boolean_t task_unnested;
int task_disconnected_count;
#endif
#if HYPERVISOR
#if CONFIG_SECLUDED_MEMORY
uint8_t task_can_use_secluded_mem;
uint8_t task_could_use_secluded_mem;
uint8_t task_could_also_use_secluded_mem;
uint8_t task_suppressed_secluded;
uint32_t task_exc_guard;
queue_head_t io_user_clients;
};

###线程 Thread
 与Linux不同,XNU是有真正的线程的。
 线程是利用CPU的基本单位,进程是占有资源的基本单位。为了最大化利用进程时间片的方法,引入线程的概念。通过使用多个线程,程序的指向可以分割表面上看上去并发执行的子任务。线程之间切换的开销比较小,只要保存和恢复寄存器即可。多核处理器更是特别和适合线程,因为多个处理器核心共享同样的cache和ARM,为线程间的共享虚拟内存提供了基础。一般一个进程会包括多个线程。
 线程是Mach中的最小的执行单元。线程表示的是底层的机器寄存器状态以及各种调度统计数据。线程从设计上提供了所需要的大量信息,同时又尽可能地维持最小开销。
在Mach中线程的数据结构非常巨大,因此大部分的线程创建时都是从一个通用的模板复制而来的,这个模板使用默认值填充这个数据结构,这个模板名为thread_template,内核引导过程中被调用的thread_bootstrap( )负责填充这个模板。thread_create_internal( )函数分配新的线程数据结构,然后将换这个模板的内容负责到新的线程数据结构中。
Mach API thread_create( ) 就是通过thread_create_internal( )实现的。

进程调度

 上述BSD的Process和Mach的Task在进行调度时是一对一的映射关系,每一个BSD进程都在底层关联了一个Mach任务对象。实现这种映射的方法是指定一个透明的指针bsd_info,Mach 对bsd_info 完全无知。实际的调度大多数是由Mach通过操作Task来完成。
11121345_Zw5P

 现在面向用户的操作系统基本上都采用抢占式调度方式,macOS当然也是。主要的抢占原则有:

  • 优先权原则
  • 短作业优先原则
  • 时间片原则

 其进程状态切换大致如下图所示
4367959-d40eee18c58e54c

首先系统会调用task_create_internal()创建一个进程,创建完成后task会被加入运行队列。之后就由kern_sig发出信号对进程进行轮转调度。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
task_create_internal(
task_t parent_task,
coalition_t *parent_coalitions __unused,
boolean_t inherit_memory,
__unused boolean_t is_64bit,
boolean_t is_64bit_data,
uint32_t t_flags,
uint32_t t_procflags,
task_t *child_task)
{
task_t new_task;
vm_shared_region_t shared_region;
ledger_t ledger = NULL;
new_task = (task_t) zalloc(task_zone);
.
.
.
ipc_task_enable(new_task);
lck_mtx_lock(&tasks_threads_lock);
queue_enter(&tasks, new_task, task_t, tasks);
tasks_count++;
if (tasks_suspend_state) {
task_suspend_internal(new_task);
}
lck_mtx_unlock(&tasks_threads_lock);

*child_task = new_task;
return(KERN_SUCCESS);
}

Mach 线程调度

Mach 的线程调度算法高度可扩展。通常情况下,Mach只启用了一个调度器。但是Mach的架构运行定义额外的调度器,并且在编译时根据CONFIG_SCHED_的定义设置调度器。每一个调度器对象都维护一个sched_dispatch_table 数据结构,其中以函数指针的方式保存了各种操作,主要的调度操作都要依靠sched_dispatch_table 来完成。Mach使用一个全局表sched_current_dispatch 保存了当前活动的调度算法,并允许运行时切换调度器。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
struct sched_dispatch_table {
const char *sched_name;
void (*init)(void); /* Init global state */
void (*timebase_init)(void); /* Timebase-dependent initialization */
void (*processor_init)(processor_t processor); /* Per-processor scheduler init */
void (*pset_init)(processor_set_t pset); /* Per-processor set scheduler init */

void (*maintenance_continuation)(void); /* Function called regularly */
thread_t (*choose_thread)(
processor_t processor,
int priority,
ast_t reason);

boolean_t steal_thread_enabled;
thread_t (*steal_thread)(
processor_set_t pset);
int (*compute_timeshare_priority)(thread_t thread);
processor_t (*choose_processor)(
processor_set_t pset,
processor_t processor,
thread_t thread);
boolean_t (*processor_enqueue)(
processor_t processor,
thread_t thread,
integer_t options);

void (*processor_queue_shutdown)(
processor_t processor);

boolean_t (*processor_queue_remove)(
processor_t processor,
thread_t thread);

boolean_t (*processor_queue_empty)(processor_t processor);
boolean_t (*priority_is_urgent)(int priority);
ast_t (*processor_csw_check)(processor_t processor);
boolean_t (*processor_queue_has_priority)(processor_t processor,
int priority,
boolean_t gte);

uint32_t (*initial_quantum_size)(thread_t thread);
sched_mode_t (*initial_thread_sched_mode)(task_t parent_task);
boolean_t (*can_update_priority)(thread_t thread);
void (*update_priority)(thread_t thread);
void (*lightweight_update_priority)(thread_t thread);
void (*quantum_expire)(thread_t thread);Runnable threads on per-processor runqueue. Should onlybe used for relative comparisons of load between processors.
int (*processor_runq_count)(processor_t processor);
uint64_t (*processor_runq_stats_count_sum)(processor_t processor);
boolean_t (*processor_bound_count)(processor_t processor);
void (*thread_update_scan)(sched_update_scan_context_t scan_context);
boolean_t direct_dispatch_to_idle_processors;
boolean_t multiple_psets_enabled;
boolean_t sched_groups_enabled;
boolean_t avoid_processor_enabled;
bool (*thread_avoid_processor)(processor_t processor, thread_t thread);
void (*processor_balance)(processor_t processor, processor_set_t pset);
rt_queue_t (*rt_runq)(processor_set_t pset);
void (*rt_init)(processor_set_t pset);
void (*rt_queue_shutdown)(processor_t processor);
void (*rt_runq_scan)(sched_update_scan_context_t scan_context);
int64_t (*rt_runq_count_sum)(void);

uint32_t (*qos_max_parallelism)(int qos, uint64_t options);
void (*check_spill)(processor_set_t pset, thread_t thread);
sched_ipi_type_t (*ipi_policy)(processor_t dst, thread_t thread, boolean_t dst_idle, sched_ipi_event_t event);
bool (*thread_should_yield)(processor_t processor, thread_t thread);
};

当调用thread_create_internal()创建一个线程时,其内部会设置线程的调度策略和优先级,并将其加入任务队列。

1
2
3
4
/* Set the thread's scheduling parameters */
new_thread->sched_mode = SCHED(initial_thread_sched_mode)(parent_task);
new_thread->max_priority = parent_task->max_priority;
new_thread->task_priority = parent_task->priority;
1
2
3
/* Chain the thread onto the task's list */
queue_enter(&parent_task->threads, new_thread, thread_t, task_threads);
parent_task->thread_count++;

接下来调度器就会在收到kernel event(比如时间片轮转)时,调用thread_wakeup_prim()。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/*
* thread_wakeup_prim:
*
* Common routine for thread_wakeup, thread_wakeup_with_result,
* and thread_wakeup_one.
*
*/
kern_return_t thread_wakeup_prim(
event_t event,
boolean_t one_thread,
wait_result_t result)
{
if (__improbable(event == NO_EVENT))
panic("%s() called with NO_EVENT", __func__);

struct waitq *wq = global_eventq(event);

if (one_thread)
return waitq_wakeup64_one(wq, CAST_EVENT64_T(event), result, WAITQ_ALL_PRIORITIES);
else
return waitq_wakeup64_all(wq, CAST_EVENT64_T(event), result, WAITQ_ALL_PRIORITIES);
}

接下来则会进入等待队列(waitq.c),调用waitq_wakeup64_one()唤醒一个线程。而在waitq_wakeup64_one中会实际调用waitq_wakeup64_one_locked(),可以看到这里会调用waitq_select_max_locked()选择一个优先级最高的线程。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
kern_return_t waitq_wakeup64_one_locked(struct waitq *waitq,
event64_t wake_event,
wait_result_t result,
uint64_t *reserved_preposts,
int priority,
waitq_lock_state_t lock_state)
{
thread_t thread;
spl_t th_spl;

assert(waitq_held(waitq));

if (priority == WAITQ_SELECT_MAX_PRI) {
thread = waitq_select_max_locked(waitq, wake_event,
reserved_preposts,
&th_spl);
} else {
thread = waitq_select_one_locked(waitq, wake_event,
reserved_preposts,
priority, &th_spl);
}


if (thread != THREAD_NULL)
waitq_stats_count_wakeup(waitq);
else
waitq_stats_count_fail(waitq);

if (lock_state == WAITQ_UNLOCK)
waitq_unlock(waitq);

if (thread != THREAD_NULL) {
maybe_adjust_thread_pri(thread, priority, waitq);
kern_return_t ret = thread_go(thread, result);
assert(ret == KERN_SUCCESS);
thread_unlock(thread);
splx(th_spl);
return ret;
}

return KERN_NOT_WAITING;
}

其实不管调用的是哪些方法最后都会来到do_waitq_select_n_locked(),在这里面会进行实际的解锁线程的操作,比如调用thread_clear_waitq_state()。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
static void do_waitq_select_n_locked(struct waitq_select_args *args)
{
struct waitq *waitq = args->waitq;
int max_threads = args->max_threads;
thread_t first_thread = THREAD_NULL;
struct waitq *safeq;
uint32_t remaining_eventmask = 0;
uint32_t eventmask;
int *nthreads = args->nthreads;
spl_t spl = 0;

assert(max_threads != 0);

if (!waitq_irq_safe(waitq)) {
/* JMM - add flag to waitq to avoid global lookup if no waiters */
eventmask = _CAST_TO_EVENT_MASK(waitq);
safeq = waitq_get_safeq(waitq);
if (*nthreads == 0)
spl = splsched();
waitq_lock(safeq);
} else {
eventmask = _CAST_TO_EVENT_MASK(args->event);
safeq = waitq;
}

/*
* If the safeq doesn't have an eventmask (not global) or the event
* we're looking for IS set in its eventmask, then scan the threads
* in that queue for ones that match the original <waitq,event> pair.
*/
if (!waitq_is_global(safeq) ||
(safeq->waitq_eventmask & eventmask) == eventmask) {

if (waitq_is_turnstile_queue(safeq)) {
first_thread = waitq_prioq_iterate_locked(safeq, waitq,
spl, args,
&remaining_eventmask);
} else {
first_thread = waitq_queue_iterate_locked(safeq, waitq,
spl, args,
&remaining_eventmask);
}

/*
* Update the eventmask of global queues we just scanned:
* - If we selected all the threads in the queue, we can clear its
* eventmask.
*
* - If we didn't find enough threads to fill our needs, then we can
* assume we looked at every thread in the queue and the mask we
* computed is complete - so reset it.
*/
if (waitq_is_global(safeq)) {
if (waitq_empty(safeq))
safeq->waitq_eventmask = 0;
else if (max_threads < 0 || *nthreads < max_threads)
safeq->waitq_eventmask = remaining_eventmask;
}
}

/*
* Grab the first thread in the queue if no other thread was selected.
* We can guarantee that no one has manipulated this thread because
* it's waiting on the given waitq, and we have that waitq locked.
*/
if (*nthreads == 0 && first_thread != THREAD_NULL && args->threadq) {
/* we know this is the first (and only) thread */
++(*nthreads);
*(args->spl) = (safeq != waitq) ? spl : splsched();
thread_lock(first_thread);
thread_clear_waitq_state(first_thread);
waitq_thread_remove(safeq, first_thread);
enqueue_tail(args->threadq, &(first_thread->wait_links));

/* update the eventmask on [now] empty global queues */
if (waitq_is_global(safeq) && waitq_empty(safeq))
safeq->waitq_eventmask = 0;
}

/* unlock the safe queue if we locked one above */
if (safeq != waitq) {
waitq_unlock(safeq);
if (*nthreads == 0)
splx(spl);
}

if (max_threads > 0 && *nthreads >= max_threads)
return;

/*
* wait queues that are not in any sets
* are the bottom of the recursion
*/
if (!waitq->waitq_set_id)
return;

/* check to see if the set ID for this wait queue is valid */
struct waitq_link *link = wql_get_link(waitq->waitq_set_id);
if (!link) {
/* the waitq set to which this waitq belonged, has been invalidated */
waitq->waitq_set_id = 0;
return;
}

wql_put_link(link);

/*
* If this waitq is a member of any wait queue sets, we need to look
* for waiting thread(s) in any of those sets, and prepost all sets that
* don't have active waiters.
*
* Note that we do a local walk of this waitq's links - we manually
* recurse down wait queue set's with non-zero wqset_q.waitq_set_id
*/
(void)walk_waitq_links(LINK_WALK_ONE_LEVEL, waitq, waitq->waitq_set_id,
WQL_WQS, (void *)args, waitq_select_walk_cb);
}

Mach 调度器特性

控制权转交

Mach允许一个线程主动放弃CPU,但不是将CPU放弃给任何其他线程,而是将CPU转交给自己选择的某个特定的线程。由于Mach 是一个基于消息传递的内核,线程之间通过消息传递通讯,所以这项特性在Mach 中特别有用。通过这个特性,消息的处理延迟可以达到最小,而不需要投机地等待消息处理线程(发送者或接收者)下一次得到调度。这个特性是Mach特有的。

使用续体

续体(continuation)是计算机程序的控制状态的一种抽象表现。 实化了程序状态信息。可以理解为,一个计算续体以数据结构的形式表现了程序在运行过程中某一点的计算状态,相应的数据内容可以被编程语言访问,而不是被运行时环境所隐藏掉。
使用续体可以使线程不用管理自己的栈,线程可以丢弃自己的栈,系统恢复线程执行时不需要恢复线程的栈。续体是缓解上下文切换开销的简单有效的机制。这个特性是Mach特有的。

异步软件陷阱 Asynchronous Software Trap,AST

异步软件陷阱是软件对底层硬件陷阱机制的补充完善,通过使用AST,内核可以响应需要得到关注的out-off-band事件,例如调度事件。
AST是人工引发的非硬件触发的陷阱。AST是内核操作的关键部分,而且是调度时间的底层机制,也是BSD信号的实现基础。AST实现为线程控制块中一个包含各种标志位的字段,这些标志位可以通过thread_ast_set( )分别设置。这个特性是Mach特有的。

上下文切换(content switch)

上下文切换是暂停某个线程的执行,并且将其寄存器状态记录在某个预定义的内存位置中。寄存器状态是和及其相关的。当一个线程被抢占时,CPU 寄存器中会价值另一个线程保存的线程状态,从而恢复到那个线程的执行。
一个线程在CPU上可以执行任意长的时间。执行(execute)指的是这样的一个事实:CPU 寄存器中填满了线程的状态,因此CPU(通过EIP/RIP指令指针或PC程序计数器)执行该线程函数的代码。这个执行会一直持续,直到发生下面某种情况:

  • 线程终止
  • 线程自愿放弃
  • 外部中断打断了线程的执行,外部中断要求CPU 保存线程状态并且立即执行中断处理代码

优先级

每一个Mach线程都包含优先级信息,优先级直接影响线程被调度的频率。Mach 有128个优先级。
内核线程的最低优先级为80,比用户态线程的优先级要高。可以保证内核以及用户维护管理的线程能够抢占用户态的线程。优先级主要在osfmk/kern/sched.h中定义。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
/*
* High-level priority assignments
*
*************************************************************************
* 127 Reserved (real-time)
* A
* +
* (32 levels)
* +
* V
* 96 Reserved (real-time)
* 95 Kernel mode only
* A
* +
* (16 levels)
* +
* V
* 80 Kernel mode only
* 79 System high priority
* A
* +
* (16 levels)
* +
* V
* 64 System high priority
* 63 Elevated priorities
* A
* +
* (12 levels)
* +
* V
* 52 Elevated priorities
* 51 Elevated priorities (incl. BSD +nice)
* A
* +
* (20 levels)
* +
* V
* 32 Elevated priorities (incl. BSD +nice)
* 31 Default (default base for threads)
* 30 Lowered priorities (incl. BSD -nice)
* A
* +
* (20 levels)
* +
* V
* 11 Lowered priorities (incl. BSD -nice)
* 10 Lowered priorities (aged pri's)
* A
* +
* (11 levels)
* +
* V
* 0 Lowered priorities (aged pri's / idle)
*************************************************************************
*/

优先级偏移

给线程分配优先级只是一个开头,这些优先级在运行时常常需要调整。Mach 会针对每一个线程的CPU 利用率和整体系统负载动态调整每一个线程的优先级。

运行队列

线程是通过运行队列管理的。 运行队列是一个多层列表,即一个列表的数组,针对128个优先级中的每一个优先级都要一个队列。Mach 实际采用的方法是检查位图,这样就可以同时检查32个队列,这样时间复杂度为O(4)。

1
2
3
4
5
6
7
8
9
struct run_queue {
int highq; /* highest runnable queue */
bitmap_t bitmap[BITMAP_LEN(NRQS)]; /* run queue bitmap array */
int count; /* # of threads total */
int urgency; /* level of preemption urgency */
queue_head_t queues[NRQS]; /* one for each priority */

struct runq_stats runq_stats;
};

等待队列

当进程或者线程阻塞,就没有必要考虑调度这个线程,因为只有当线程等待的对象或I/O 操作完成或时间发生时才能继续执行。所以可以将线程放在等待队列中。当等待的条件满足之后,一个或多个等待的线程可以被解除阻塞并且再次分发执行。

CPU 亲缘性

在使用多核、SMP 或 超线程的现代架构中,还可以设置某个线程和一个或多个指定CPU 的亲缘性(affinity)。这种亲缘性对于线程和系统来说都是有好处的,因为当线程回到同一个CPU上执行时,线程的数据可能还留在CPU的缓存中,从而提升性能。
在Mach中,线程对CPU 的亲缘性的意思就是绑定。thread_bind( )的目的就是绑定线程,这个函数仅仅是更新thread_t的bound_processor字段。如果这个字段被设置为PROCESSOR_NULL之外的任何值,那么未来的调度策略就会将这个线程分发到对应处理器的运行队列。

Contents
  1. 1. BSD 进程 Process
  2. 2. Mach 任务 task
  3. 3. 进程调度
    1. 3.1. Mach 线程调度
  4. 4. Mach 调度器特性
    1. 4.1. 控制权转交
    2. 4.2. 使用续体
    3. 4.3. 异步软件陷阱 Asynchronous Software Trap,AST
    4. 4.4. 上下文切换(content switch)
    5. 4.5. 优先级
    6. 4.6. 优先级偏移
    7. 4.7. 运行队列
    8. 4.8. 等待队列
    9. 4.9. CPU 亲缘性