kernel源码学习-2

发表于 2022-05-28 更新于 2022-10-25 阅读次数：

SMP

对称多处理器结构 , 英文名称为 “ Symmetrical Multi-Processing “ , 简称 SMP ;

SMP 又称为 UMA , 全称 “ Uniform Memory Access “ , 中文名称 “ 统一内存访问架构 “ ;

在 “ 对称多处理器结构 “ 的系统中 , 所有的 CPU 处理器的地位都是平等的 , 一般指的是服务器设备上 , 运行的多个 CPU , 没有主次/从属关系 , 都是平等的 ;

这些处理器共享所有的设备资源 , 所有的资源对处理器具有相同的可访问性 , 如 : 磁盘 , 内存 , 总线等 ; 多个 CPU 处理器共享相同的物理内存 , 每个 CPU 访问相同的物理地址 , 所消耗的时间是相同的 ;

SMP优点:
1、它们是增加吞吐时的一种划算的方法;
2、由于操作系统由所有处理器共享，它们提供了一个单独的系统映像（容易管理);3、它们对一个单独的问题应用多处理器(并行编程);
4、负载平衡是由操作系统实现的;
5、这种单处理器(UP)编程模型可用于一个SMP中;6、对于共享数据来说，它们是可伸缩的;
7、所有数据可由所有处理器寻址，并且由硬件监视逻辑保持连续性;
8、由于通信经由全局共享内存执行，在处理器之间通信不必使用消息传送库;

SMP局限性:
1、由于高速缓存相关性、锁定机制、共享对象和其它问题，可伸缩性受限制。2、需要新技术来利用多处理器，例如:线程编程和设备驱动程序编程等。

周期性负载均衡:

CPU对应的运行队列数据结构中记录下一次周期性负载均衡的时间，当超过这个时间点后，将触发SCHED_SOFTIRQ软中断来进行负载均衡。
scheduler_tick()–>trigger_load_balance().

用到SMP负载均衡模型的时机:
内核运行中，还有部分情况中需要用掉SMP负载均衡模型来确定最佳运行CPU

1、进程A唤醒进程B时，try_to_wake_up()中会考虑进程B将在那个CPU上运行;

2、进程调用execve()系统调用时

3、fork出子进程，子进程第一次被调度运行。

【Linux运行时调优】:Linux引入重要sysctls来在运行时对调度程序进行调优(单位以纳秒ns)
sched child _runs_frist: child在fork之后进行调度;此为默认设备。如果设置为0，那么先调度parent。

sched_min_granularity_ns:针对CPU密集型任务执行最低级别抢占粒度。
sched_latency_ns:针对CPU密集型任务进行目标抢占延迟。
sched_stat_granularity_ns:收集调度程序统计信息的粒度。

struct sched_domain_topology_level {
    sched_domain_mask_f mask;//函数指针，用于指定某个SDTL层级的cpumask
    sched_domain_flags_f sd_flags;//函数指针，用于指定某个SDTL层级的标志位
    int            flags;
    int            numa_level;
    struct sd_data      data;
#ifdef CONFIG_SCHED_DEBUG
    char                *name;
#endif
};
//表示系统中有多少个可以运行的核心
const struct cpumask *const cpu_possible_mask = to_cpumask(cpu_possible_bits);
EXPORT_SYMBOL(cpu_possible_mask);

static DECLARE_BITMAP(cpu_online_bits, CONFIG_NR_CPUS) __read_mostly;
//表示系统中有多少个正处于运行状态
const struct cpumask *const cpu_online_mask = to_cpumask(cpu_online_bits);
EXPORT_SYMBOL(cpu_online_mask);

static DECLARE_BITMAP(cpu_present_bits, CONFIG_NR_CPUS) __read_mostly;
//表示系统中有多少个具备online条件的核心（有的核心可热插拔）
const struct cpumask *const cpu_present_mask = to_cpumask(cpu_present_bits);
EXPORT_SYMBOL(cpu_present_mask);

static DECLARE_BITMAP(cpu_active_bits, CONFIG_NR_CPUS) __read_mostly;
//表示系统中有剅个活跃的核心
const struct cpumask *const cpu_active_mask = to_cpumask(cpu_active_bits);
EXPORT_SYMBOL(cpu_active_mask);
//这四个变量都是bitmap

RCU

RCU英文全称为Read-Copy-Update，顾名思义就是“读-拷贝-更新”，是内核中重要的同步机制。
RCU原理
RCU记录所有指向共享数据的指针的使用者，当要修改该共享数据时，首先创建一个副本，在副本中修改。所有读访问线程都离开讯临界区之后，指针指向新的修改后副本的指针，并且删除旧数据。写者删除对象，必须等到所有访问被删除对象的读者访问结束，才能够执行销毁操作。RCU关键技术是怎么判断所有读者已经完成访问。等待所有读者访问结束的时间称为宽限期(grace period)。RCU读者不并不需要直接与写者进行同步，读者与写者也能并发的执行。RCU目标最大程序来减少读者的开销。因为也经常使用于读者性能要求高的场景。
RCU优点:

读者开销少，不需要获取任何锁，不需要执行原子指令或内存屏障;没有死锁问题;没有优先级反转的问题;没有内存泄露的危险问题;很好的实时延迟操作。
RCU缺点:

写者的同步开销比较大的，写者之间需要互斥处理;使用其它同步机制复杂。

RCU应用场景
例如:每种锁都有自己适合场景: spin lock不分区reader/writer，对于些读写强度不对称的是不适合的，RW spin lock和seq lock解决了这个问题，seq lock倾向writer,RW spin lock倾向reader。
a、RCU只能保护动态分配的数据结构，并且必须是通过指针访问该数据结构;
b、受RCU保护的临界区不能sleep;
c、读写不对称，对writer的性能没有特别的要求，但是reader性能要求极高;d、reader端对新旧数据不敏感。
RCU适用于需要频繁的读取数据，而相应修改数据并不多的场景。比如:文件系统中，搜索定位目录，而对目录修改相对来讲基本没有。

链表操作
RCU能保护的不仅仅是一般的指针。内核也提供标准函数，使得能通过RCU机制保护双链表，这是RCU机制在内核内部最重要的应用。
有关通过RCU保护的链表，好消息是仍然可以使用标准的链表元素。只有在遍历链表、修改和删除链表元素时，必须调用标准函数的RCU变体。函数名称很容易记住:在标准函数之后附加_rcu后缀。

//RCU标准:内核源码分析:/include/linux/rculist.h
static inline void list _add_rcu(struct list_head *new, struct list_head *head)
static inline void list add_tail_rcu(struct list_head *new,struct list head *head)
static inline void list_del_rcu(struct list_head *entry)
static inline void list_replace_rcu(struct list_head *old,struct list_head*new)

对于writer，RCU操作包含:
1、rcu_ assign_pointer:该函数被writer用来进行removal的操作，在writer完成新版数据分配和更新之后，调用这个函数可以让RCU protected pointer指向RCU protected data。
2、 synchronize_rcu: writer端操作可以是同步的，也就是说，完成更新操作之后，可以调用这个函数等待所有旧版本数据上的reader线程离开临界区，一旦从函数返回，说明旧的共享数据没有任何的引用，直接进行reclaimation的操作。
3、call rcu: writer无法阻塞，这时候可以调用call rcu接口函数，该函数是注册callback直接返回，在适当时机会调用callback函数，完成reclaimation的操作。

removal:

write分配一个new version共享数据进行数据更新，更新完毕后将RCU protected pointer指向新版本的数据，一旦把RCU protected pointer指向的新的数据，也就意味着将推向前台。通过这样的操作，原来reader0,reader1对共享数据的引用被删除，它们都在旧版本的RCU protected data上进行数据访问。
reclamation:

共享数据不能有两个版本，因此一定要在适当的时机回收旧版本的数据。

32核处理器的CPU层次结构:
在32核处理器中，层次结构分成两层次，level0包含两个struct rcu_node，其中每个struct rcu_node管理16个struct rcu data数据结构，分配表示16个CPU的独立struct rcu_data数据结构;在level1层级，有一个structrcu_node节点管理level0层级的两个rcu_node节点，level1层级中的rcu_node 节点称为根节点，level0层级的两个rcu_node节点是叶子结点。

内存优化屏障

在编程时，指令一般不按照源程序顺序执行，原因是为提高程序执行性能，会对它进行优化，主要为两种:编译器优化和CPU执行优化。
优化屏障避免编译的重新排序优化操作，保证编译程序时在优化屏障之前的指令不会在优化屏障之后执行。

编译器屏障：

Linux使用宏barrier实现优化屏障，如gcc编译器的优化屏障宏定义如;

linux内核源码: include/linux/compiler-gcc.h。

1
2

#define barrier() __asm__ __volatile__("": : :"memory")
//asm_表示插入汇编语言程序;_volatile_表示阻止编译对该值进行优化，确保变量使用了用户定义的精确地址，而不是装有同一信息的一些别名。memory表示修改了内存单元。

CPU屏障：

CPU MEMORY BARRIERS
-------------------
The Linux kernel has eight basic CPU memory barriers:
    TYPE        MANDATORY        SMP CONDITIONAL
    ===============    =======================    ========================
    GENERAL        mb()            smp_mb()
    WRITE        wmb()            smp_wmb()
    READ        rmb()            smp_rmb()

内存布局

ARM64架构处理器采用48位物理寻址机制，最大可寻找256TB的物理地址空间。对于
目前应用完全足够，不需要扩展到64位的物理寻址。虚拟地址也同样最大支持48位寻址，所以在处理器架构设计上，把虚拟地址空间划分为两个空间，每个空间最大支持256TB，linux内核在大多数体系结构上都把两个地址划分为:用户空间和内核空间。

用户空间: 0x0000_0000_0000_0000 - 0x0000_FFFF_FFFF_FFFF(只支持到48位)

内核空间: 0xFFFF_0000_0000_0000 - 0xFFFF_FFFF_FFFF_FFFF

线性映射区
.data
.init
.text
modules
PCI I/O
vmenmap
非规范区域
用户空间

.data:

数据段（内核初始化全局变量);
.init:

对应大部分模块初始化数据，初始化结束之后就会释放这部分内存;
.text:

代码段(_text是代码段起始地址，_etext是结束地址);
modules:

128MB内核模块区域，是内核模块使用的虚拟地址空间;
PCI I/o:

pci设备的I/O地址空间;
vmemmap:

内存的物理地址(如果不连续，则会在内存空洞),vmemmap就用来存放空间内存的page结构体的数据的虚拟地址空间。
vmalloc:

vmalloc函数使用的虚拟地址空间。

1、用户空间
    相当于应用程序使用malloc()申请内存，通过free()释放内存。malloc()/free()是glibc库的内存分配器ptmalloc提供的接口,
ptmalloc使用系统调用brk或mmap向内核以页为单位申请内存，然后进行分成很小内存块分配给对应应用程序。
2、内核空间
    虚拟内存管理负责从进程的虚拟地址空间分配虚拟页，sys_brk来扩大或收缩堆，sys_mmap用来在内存映射区域分配虚拟页，
sys_munmap用来释放虚拟页。页分配器负责分配物理页，使用分配器是伙伴分配器
内核空间扩展功能，不连续页分配器提供分配内存的接口vmalloc和释放内存接口vfree。在内存碎片化的时候，申请连续物理页的成功
率比较低，可以申请不连续的物理页，映射到连续的虚拟页，即虚拟地址连续而物理地址不连续。
内存控制组用来控制进程占用的内存资源。当内存碎片化的时候，找不到连续的物理页，内存碎片整理通过迁移方式得到连续的物理
页。在内存不足的时候,页回收负责回收物理页。
3、硬件
    MMU包含一个页表缓存，保存最近使用过的页表映射，避免每次把虚拟地址转换为物理地址都需要查询内存中的页表。解决处理器执行速度和内存速度不匹配问题，中间增加一个缓存。一级缓存分为数据缓存和指令缓存。二级作用协调—级缓存和内存之间的工作效率。

内存管理

堆是进程中主要用于动态分配变量和数据的内存区域，堆的管理对应程序员不是直接可见的。因为它依赖标准库提供的各个辅助函数（其中最重要的是malloc）来分配任意长度的内存区。malloc和内核之间的经典接口是brk系统调用，负责扩展/收缩堆。
堆是一个连续的内存区域，在扩展时自下至上增长。其中mm_struct(include/linux/mm_types.h)结构，包含堆在虚拟地址空间中的起始和当前结束地址(start brk和brk)。

一个进程的虚拟地址空间主要由两个数据结构进行描述。一个是最高层次的mm_struct，较高层次的vm_ area_struct。最高层次mm_struct结构描述一个进程整个虚拟地址空间。较高层次结构描述虚拟地址空间的一个区间(称为虚拟区）。每个进程只有一个mm_struct结构，在每个进程的task_struct结构中，有一个专门用来指向该进程的结构。mm_struct结构是对整个用户空间的描述。

创建内存映射时，在进程的用户虚拟地址空间中分配一个虚拟内存区域。内核采用延
迟分配物理内存的策略，在进程第一次访问虚拟页的时候，产生缺页异常。如果是文件映射，那么分配物理页，把文件指定区间的数据读到物理页中，然后在页表中把虚拟页映射到物理页。如果是匿名映射，就分配物理页，然后在页表中把虚拟页映射到物理页。

两个进程可以使用共享的文件映射实现共享内存。匿名映射通常是私有映射，共享的匿名映射只可能出现在父进程和子进程之间。在进程的虚拟地址空间中，代码段和数据段是私有的文件映射，未初始化数据段、堆栈是私有的匿名映射。

struct mm_struct {
    struct vm_area_struct *mmap;//虚拟内存区域链表        /* list of VMAs */
    struct rb_root mm_rb;//虚拟内存区域红黑树
    u32 vmacache_seqnum;                   /* per-thread vmacache */
#ifdef CONFIG_MMU//在内存映射区域中找到一个没用映射的区域
    unsigned long (*get_unmapped_area) (struct file *filp,
                unsigned long addr, unsigned long len,
                unsigned long pgoff, unsigned long flags);
#endif
    unsigned long mmap_base; //内存映射区域的起始地址       /* base of mmap area */
    unsigned long mmap_legacy_base;         /* base of mmap area in bottom-up allocations */
    unsigned long task_size; //用户虚拟地址空间的长度       /* size of task vm space */
    unsigned long highest_vm_end;        /* highest vma end address */
    pgd_t * pgd;//指向页全局目录，即第一级页表
    atomic_t mm_users; //共享一个用户虚拟地址空间的进程的数量           /* How many users with user space? */
    atomic_t mm_count; //内存描述符的引用计数           /* How many references to "struct mm_struct" (users count as 1) */
    atomic_long_t nr_ptes;            /* PTE page table pages */
    atomic_long_t nr_pmds;            /* PMD page table pages */
    int map_count;                /* number of VMAs */

    spinlock_t page_table_lock;        /* Protects page tables and some counters */
    struct rw_semaphore mmap_sem;

    struct list_head mmlist;        /* List of maybe swapped mm's.    These are globally strung
                         * together off init_mm.mmlist, and are protected
                         * by mmlist_lock
                         */


    unsigned long hiwater_rss;//进程所拥有的最大页框数    /* High-watermark of RSS usage */
    unsigned long hiwater_vm;//进程线性区中最大页数    /* High-water virtual memory usage */

    unsigned long total_vm; //进程地址空间的大小(页数)       /* Total pages mapped */
    unsigned long locked_vm; //上锁的页数   /* Pages that have PG_mlocked set */
    unsigned long pinned_vm;    /* Refcount permanently increased */
    unsigned long shared_vm;    /* Shared pages (files) */
    unsigned long exec_vm;        /* VM_EXEC & ~VM_WRITE */
    unsigned long stack_vm;        /* VM_GROWSUP/DOWN */
    unsigned long def_flags;
    //代码段起始地址喝结束地址 数据段起始地址喝结束地址
    unsigned long start_code, end_code, start_data, end_data;
    //堆起始地址喝结束地址 栈的起始地址
    unsigned long start_brk, brk, start_stack;
    //参数字符串的起始地址喝结束地址  环境变量的起始地址喝结束地址
    unsigned long arg_start, arg_end, env_start, env_end;

    unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

    /*
     * Special counters, in some configurations protected by the
     * page_table_lock, in other configurations by being atomic.
     */
    struct mm_rss_stat rss_stat;

    struct linux_binfmt *binfmt;

    cpumask_var_t cpu_vm_mask_var;

    /* Architecture-specific MM context */
    mm_context_t context;//处理器架构特定的内存管理上下文

    unsigned long flags; /* Must use atomic bitops to access the bits */

    struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_AIO
    spinlock_t            ioctx_lock;
    struct kioctx_table __rcu    *ioctx_table;
#endif
#ifdef CONFIG_MEMCG
    /*
     * "owner" points to a task that is regarded as the canonical
     * user/owner of this mm. All of the following must be true in
     * order for it to be changed:
     *
     * current == mm->owner
     * current->mm != mm
     * new_owner->mm == mm
     * new_owner->alloc_lock is held
     */
    struct task_struct __rcu *owner;
#endif

    /* store ref to file /proc/<pid>/exe symlink points to */
    struct file *exe_file;
#ifdef CONFIG_MMU_NOTIFIER
    struct mmu_notifier_mm *mmu_notifier_mm;
#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
    pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
#ifdef CONFIG_CPUMASK_OFFSTACK
    struct cpumask cpumask_allocation;
#endif
#ifdef CONFIG_NUMA_BALANCING
    /*
     * numa_next_scan is the next time that the PTEs will be marked
     * pte_numa. NUMA hinting faults will gather statistics and migrate
     * pages to new nodes if necessary.
     */
    unsigned long numa_next_scan;

    /* Restart point for scanning and setting pte_numa */
    unsigned long numa_scan_offset;

    /* numa_scan_seq prevents two threads setting pte_numa */
    int numa_scan_seq;
#endif
#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
    /*
     * An operation with batched TLB flushing is going on. Anything that
     * can move process memory needs to flush the TLB when moving a
     * PROT_NONE or PROT_NUMA mapped page.
     */
    bool tlb_flush_pending;
#endif
    struct uprobes_state uprobes_state;
#ifdef CONFIG_X86_INTEL_MPX
    /* address of the bounds directory */
    void __user *bd_addr;
#endif
}

struct vm_area_struct {
    /* The first cache line has the info for VMA tree walking. */        
    //分别保存该虚拟地址空间的首地址喝末地址后第一个字节的地址
    unsigned long vm_start;        /* Our start address within vm_mm. */
    unsigned long vm_end;        /* The first byte after our end address
                       within vm_mm. */

    /* linked list of VM areas per task, sorted by address */
    struct vm_area_struct *vm_next, *vm_prev;//每个片段组成的双链表
    //如果采用链表，影响搜索速度，所以采用红黑树。将VMA作为一个节点加入到红黑树中。
    struct rb_node vm_rb;

    /*
     * Largest free memory gap in bytes to the left of this VMA.
     * Either between this VMA and vma->vm_prev, or between one of the
     * VMAs below us in the VMA rbtree and its ->vm_prev. This helps
     * get_unmapped_area find a free area of the right size.
     */
    unsigned long rb_subtree_gap;

    /* Second cache line starts here. */
    //指向虚拟内存区域对应的用户虚拟地址空间
    struct mm_struct *vm_mm;    /* The address space we belong to. */
    //保护位 访问权限
    pgprot_t vm_page_prot;        /* Access permissions of this VMA. */
    //读写标志位 rwx
    unsigned long vm_flags;        /* Flags, see mm.h. */

    /*
     * For areas with an address space and backing store,
     * linkage into the address_space->i_mmap interval tree.
     */
    //为了支持查询一个文件区间被映射到哪些虚拟内存区域
    struct {
        struct rb_node rb;
        unsigned long rb_subtree_last;
    } shared;

    /*
     * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
     * list, after a COW of one of the file pages.    A MAP_SHARED vma
     * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
     * or brk vma (with NULL file) can only be in an anon_vma list.
     */
    //把虚拟内存区域关联的所有anon_yma实例串联起来，
    //一个虚拟内存区域会关联到父进程的anon_vma实例和自己的anon_vma实例
    struct list_head anon_vma_chain; /* Serialized by mmap_sem &
                      * page_table_lock */
    //指向一个anon_vma实例，结构anon_vma用来组织匿名页被映射到的所有的虚拟地址空间
    struct anon_vma *anon_vma;    /* Serialized by page_table_lock */

    /* Function pointers to deal with this struct. */
    /*
    虚拟内存操作集合
    struct vm_operations_struct {
    void (*open)(struct vm_area_struct * area);
    //在创建虚拟内存区域时调用open方法
    void (*close)(struct vm_area_struct * area);
    //在删除虚拟内存区域时调用close方法
    int (*mremap)(struct vm_area_struct * area);
    //使用系统调用mremap移动虚拟内存区域时调用mremap方法
    int (*fault) (struct vm fault *vmf);
    //访问文件映射的虚拟页时，如果没有映射到物理页，生成缺页异常，
    //异常处理程序调用fault就去来把文件的数据读到文件页缓存当中

    //与fault类似，区别是huge_fault方法针对使用透明巨型页的文件映射
    int (*huge_fault) (struct vm_fault *vmf，enum page_entry_size pe_size);
    读文件映射的虚拟页时，如果没有映射到物理页，生成缺页异常，异常处理程序除了读入正在访问的文件页，还会预读后续的文件页，调用map pages方法在文件的页缓存中分配物理页
    void (*map pages) (struct vm_fault *vmf,
    pgoff_t start pgoff, pgoff_t end pgoff) ;
    //第一次写私有的文件映射时，生成页错误异常，异常处理程序执行写时复制，
    调用page_mkwrite方法以通知文件系统页即将变成可写，以便文件系统检查是否允许写，
    或者等待页进入合适的状态。int (*page_mkwrite) (struct vm_fault *vmf) ;
*/

    const struct vm_operations_struct *vm_ops;

    /* Information about our backing store: */
    unsigned long vm_pgoff;        //文件偏移，单位是页

    struct file * vm_file;       //文件 ，匿名映射则指针为空
    void * vm_private_data;        //指向内存区私有数据
#ifndef CONFIG_MMU
    struct vm_region *vm_region;    /* NOMMU mapping region */
#endif
#ifdef CONFIG_NUMA
    struct mempolicy *vm_policy;    /* NUMA policy for the VMA */
#endif
};

brk系统调用用于指定堆在虚拟地址空间中新的结束地址（如果堆将要收缩，当然可以
小于当前值）。brk系统调用通过do brk增长动态分配区(mm/mmap.c):

static unsigned long do_brk(unsigned long addr, unsigned long len)
{
    struct mm_struct *mm = current->mm;
    struct vm_area_struct *vma, *prev;
    unsigned long flags;
    struct rb_node **rb_link, *rb_parent;
    pgoff_t pgoff = addr >> PAGE_SHIFT;
    int error;

    len = PAGE_ALIGN(len);//对len进行页对齐
    if (!len)
        return addr;

    flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
    //检查是否有足够的虚拟内存内存空间
    error = get_unmapped_area(NULL, addr, len, 0, MAP_FIXED);
    if (error & ~PAGE_MASK)
        return error;

    error = mlock_future_check(mm, mm->def_flags, len);
    if (error)
        return error;

    /*
     * mm->mmap_sem is required to protect against another thread
     * changing the mappings in case we sleep.
     */
    verify_mm_writelocked(mm);

    /*
     * Clear old maps.  this also does some error checking for us
     */
    //循环遍历用户进程红黑树中的VMA，然后根据addr来查找合适的插入点
 munmap_back:
    if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) {
        if (do_munmap(mm, addr, len))
            return -ENOMEM;
        goto munmap_back;
    }

    /* Check against address space limits *after* clearing old maps... */
    //检查是否要对此虚拟区间进行扩充
    if (!may_expand_vm(mm, len >> PAGE_SHIFT))
        return -ENOMEM;

    if (mm->map_count > sysctl_max_map_count)
        return -ENOMEM;
    //判断系统是否有足够的内存
    if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
        return -ENOMEM;

    /* Can we just expand an old private anonymous mapping? */
    //判断是否可以合并
    vma = vma_merge(mm, prev, addr, addr + len, flags,
                    NULL, NULL, pgoff, NULL);
    if (vma)
        goto out;//如果可以，去合并

    /*
     * create a vma struct for an anonymous mapping
     */
    //没办法合并，新创建一个VMA，VMA地址是[addr,addr+len]
    vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
    if (!vma) {
        vm_unacct_memory(len >> PAGE_SHIFT);
        return -ENOMEM;
    }

    INIT_LIST_HEAD(&vma->anon_vma_chain);
    vma->vm_mm = mm;//指向VMA所属于进程struct mm_struct结构
    vma->vm_start = addr;//首地址
    vma->vm_end = addr + len;//末地址
    vma->vm_pgoff = pgoff;//指定文件映射的偏移量，单位为页面
    vma->vm_flags = flags;
    /*
    VMA标志位 
    VM_EXEC：可以执行 
    VM_IO：IO地址空间 
    VM_SHM:IPC共享 
    VM_SHARED:多进程共享
    */
    vma->vm_page_prot = vm_get_page_prot(flags);//VMA访问权限 
    vma_link(mm, vma, prev, rb_link, rb_parent);
out:
    perf_event_mmap(vma);
    mm->total_vm += len >> PAGE_SHIFT;
    if (flags & VM_LOCKED)
        mm->locked_vm += (len >> PAGE_SHIFT);
    vma->vm_flags |= VM_SOFTDIRTY;
    return addr;
}

相关常用函数:
1、mmap()—-创建内存映射

#include <sys/mman.h>
void *mmap(void *addr，size_t length,int prot,int flags,int fd,off_t offset);
系统调用mmap():

进程创建匿名的内存映射，把内存的物理页映射到进程的虚拟地址空间。进程把文件映射到进程的虚拟地址空间，可以像访问内存一样访问文件，不需要调用系统调用read()/write()访问文件，从而避免用户模式和内核模式之间的切换，提高读写文件速度。两个进程针对同一个文件创建共享的内存映射，实现共享内存。

2、munmap()—-删除内存映射

    #include <sys/mman.h>
    int munmap(void *addr, size_t len);
    3、mprotect()—-设置虚拟内存区域的访问权限

#include <sys/mman.h>
int mprotect(void *addr, size_t len, int prot);

内存管理子系统使用节点(node)，区域(zone)、页(page)三级结构描述物理内存:

node:

Node是内存管理最顶层的结构，在NUMA架构下，CPU平均划分为多个Node，每个Node有自己的内存控制器及内存插槽。CPU访问自己Node上内存速度快，而访问其他CPU所关联Node的内存速度慢。UMA被当做只一个Node的NUMA系统。

//mmzone.h
typedef struct pglist_data {
    struct zone node_zones[MAX_NR_ZONES];//内存区域数组
    struct zonelist node_zonelists[MAX_ZONELISTS];//备用区域数组
    int nr_zones;//该节点包含的内存区域数量
#ifdef CONFIG_FLAT_NODE_MEM_MAP    //除了稀疏内存模型以外
    struct page *node_mem_map;//页描述符数组 每个物理页对应一个页描述符
#ifdef CONFIG_PAGE_EXTENSION
    struct page_ext *node_page_ext;//页的扩展属性
#endif
#endif
#ifndef CONFIG_NO_BOOTMEM
    struct bootmem_data *bdata;//引导bootmen分配器
#endif
#ifdef CONFIG_MEMORY_HOTPLUG
    /*
     * Must be held any time you expect node_start_pfn, node_present_pages
     * or node_spanned_pages stay constant.  Holding this will also
     * guarantee that any pfn_valid() stays that way.
     *
     * pgdat_resize_lock() and pgdat_resize_unlock() are provided to
     * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG.
     *
     * Nests above zone->lock and zone->span_seqlock
     */
    spinlock_t node_size_lock;
#endif
    unsigned long node_start_pfn;//该节点的起始物理页号
    unsigned long node_present_pages; //物理页总数
    unsigned long node_spanned_pages; //物理页范围的总长度，包括空洞
    int node_id;//节点标识符
    wait_queue_head_t kswapd_wait;
    wait_queue_head_t pfmemalloc_wait;
    struct task_struct *kswapd;    /* Protected by
                       mem_hotplug_begin/end() */
    int kswapd_max_order;
    enum zone_type classzone_idx;
#ifdef CONFIG_NUMA_BALANCING
    /* Lock serializing the migrate rate limiting window */
    spinlock_t numabalancing_migrate_lock;

    /* Rate limiting time interval */
    unsigned long numabalancing_migrate_next_window;

    /* Number of pages migrated during the rate limiting time interval */
    unsigned long numabalancing_migrate_nr_pages;
#endif
} pg_data_t;

zone：

//mmzone.h
enum zone_type {
#ifdef CONFIG_ZONE_DMA
    /*
     * ZONE_DMA is used when there are devices that are not able
     * to do DMA to all of addressable memory (ZONE_NORMAL). Then we
     * carve out the portion of memory that is needed for these devices.
     * The range is arch specific.
     *
     * Some examples
     *
     * Architecture        Limit
     * ---------------------------
     * parisc, ia64, sparc    <4G
     * s390            <2G
     * arm            Various
     * alpha        Unlimited or 0-16MB.
     *
     * i386, x86_64 and multiple other arches
     *             <16M.
     */
    /*DMA区域
    Direct Memory Access ，意思是直接内存访问。
    如果有些设备不能直接访问所有内存，需要使用DMA区域。ISA*/
    ZONE_DMA,
#endif
#ifdef CONFIG_ZONE_DMA32
    /*
     * x86_64 needs two ZONE_DMAs because it supports devices that are
     * only able to do DMA to the lower 16M but also 32 bit devices that
     * can only do DMA areas below 4G.
     */
    /*DMA32区域64位系统，如果既要支持只能直接访问16MB以下的内存设备,64位系统，
    如果既要支持只能直接访问16MB以下的内存设备，
    又要支持只能直接访问4GB以下内存的32设备，必须使用此DMA32区域*/
    ZONE_DMA32,
#endif
    /*
     * Normal addressable memory is in ZONE_NORMAL. DMA operations can be
     * performed on pages in ZONE_NORMAL if the DMA devices support
     * transfers to all addressable memory.
     */
    /*普通区域
    直接映射到内核虚拟地址空间的内存区域，又称为普通区域，
    又称为直接映射区域，又称为线性映射区域*/

    ZONE_NORMAL,
#ifdef CONFIG_HIGHMEM
    /*
     * A memory area that is only addressable by the kernel through
     * mapping portions into its own address space. This is for example
     * used by i386 to allow the kernel to address the memory beyond
     * 900MB. The kernel will set up special mappings (page
     * table entries on i386) for each page that the kernel needs to
     * access.
     */
    /*高端内存区域
    此区域是32位时代的产物，内核和用户地址空间按1: 3划分，内核地址空间只有1GB，
    不能把1cE以上的内存直接映射到内核地址。
    */
    ZONE_HIGHMEM,
#endif
    /*可移动区域
    它是一个伪内存区域,用来防止内存碎片*/
    ZONE_MOVABLE,
    __MAX_NR_ZONES
};


struct zone {
    /* Read-mostly fields */

    /* zone watermarks, access with *_wmark_pages(zone) macros */
    unsigned long watermark[NR_WMARK];//页分配器使用的流水线

    /*
     * We don't know if the memory that we're going to allocate will be freeable
     * or/and it will be released eventually, so to avoid totally wasting several
     * GB of ram we must reserve some of the lower zone memory (otherwise we risk
     * to run OOM on the lower zones despite there's tons of freeable ram
     * on the higher zones). This array is recalculated at runtime if the
     * sysctl_lowmem_reserve_ratio sysctl changes.
     */
    long lowmem_reserve[MAX_NR_ZONES];//页分配器使用，当前区域保留多少页不能借给高的区域类型

#ifdef CONFIG_NUMA
    int node;
#endif

    /*
     * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
     * this zone's LRU.  Maintained by the pageout code.
     */
    unsigned int inactive_ratio;

    struct pglist_data    *zone_pgdat;//指向内存节点的pglist_data实例
    struct per_cpu_pageset __percpu *pageset;//每处理页集合

    /*
     * This is a per-zone reserve of pages that should not be
     * considered dirtyable memory.
     */
    unsigned long        dirty_balance_reserve;

#ifndef CONFIG_SPARSEMEM
    /*
     * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
     * In SPARSEMEM, this map is stored in struct mem_section
     */
    unsigned long        *pageblock_flags;
#endif /* CONFIG_SPARSEMEM */

#ifdef CONFIG_NUMA
    /*
     * zone reclaim becomes active if more unmapped pages exist.
     */
    unsigned long        min_unmapped_pages;
    unsigned long        min_slab_pages;
#endif /* CONFIG_NUMA */

    /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
    unsigned long        zone_start_pfn;//当前区域的起始物理页号

    /*
     * spanned_pages is the total pages spanned by the zone, including
     * holes, which is calculated as:
     *     spanned_pages = zone_end_pfn - zone_start_pfn;
     *
     * present_pages is physical pages existing within the zone, which
     * is calculated as:
     *    present_pages = spanned_pages - absent_pages(pages in holes);
     *
     * managed_pages is present pages managed by the buddy system, which
     * is calculated as (reserved_pages includes pages allocated by the
     * bootmem allocator):
     *    managed_pages = present_pages - reserved_pages;
     *
     * So present_pages may be used by memory hotplug or memory power
     * management logic to figure out unmanaged pages by checking
     * (present_pages - managed_pages). And managed_pages should be used
     * by page allocator and vm scanner to calculate all kinds of watermarks
     * and thresholds.
     *
     * Locking rules:
     *
     * zone_start_pfn and spanned_pages are protected by span_seqlock.
     * It is a seqlock because it has to be read outside of zone->lock,
     * and it is done in the main allocator path.  But, it is written
     * quite infrequently.
     *
     * The span_seq lock is declared along with zone->lock because it is
     * frequently read in proximity to zone->lock.  It's good to
     * give them a chance of being in the same cacheline.
     *
     * Write access to present_pages at runtime should be protected by
     * mem_hotplug_begin/end(). Any reader who can't tolerant drift of
     * present_pages should get_online_mems() to get a stable value.
     *
     * Read access to managed_pages should be safe because it's unsigned
     * long. Write access to zone->managed_pages and totalram_pages are
     * protected by managed_page_count_lock at runtime. Idealy only
     * adjust_managed_page_count() should be used instead of directly
     * touching zone->managed_pages and totalram_pages.
     */
    unsigned long        managed_pages;//伙伴分配器管理的物理页数量
    unsigned long        spanned_pages;//当前区域跨越的总页数，包括空洞
    unsigned long        present_pages;//当前区域存在的物理页总数，不包括空洞

    const char        *name;//区域名称

    /*
     * Number of MIGRATE_RESERVE page block. To maintain for just
     * optimization. Protected by zone->lock.
     */
    int            nr_migrate_reserve_block;

#ifdef CONFIG_MEMORY_ISOLATION
    /*
     * Number of isolated pageblock. It is used to solve incorrect
     * freepage counting problem due to racy retrieving migratetype
     * of pageblock. Protected by zone->lock.
     */
    unsigned long        nr_isolate_pageblock;
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
    /* see spanned/present_pages for more description */
    seqlock_t        span_seqlock;
#endif

    /*
     * wait_table        -- the array holding the hash table
     * wait_table_hash_nr_entries    -- the size of the hash table array
     * wait_table_bits    -- wait_table_size == (1 << wait_table_bits)
     *
     * The purpose of all these is to keep track of the people
     * waiting for a page to become available and make them
     * runnable again when possible. The trouble is that this
     * consumes a lot of space, especially when so few things
     * wait on pages at a given time. So instead of using
     * per-page waitqueues, we use a waitqueue hash table.
     *
     * The bucket discipline is to sleep on the same queue when
     * colliding and wake all in that wait queue when removing.
     * When something wakes, it must check to be sure its page is
     * truly available, a la thundering herd. The cost of a
     * collision is great, but given the expected load of the
     * table, they should be so rare as to be outweighed by the
     * benefits from the saved space.
     *
     * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
     * primary users of these fields, and in mm/page_alloc.c
     * free_area_init_core() performs the initialization of them.
     */
    wait_queue_head_t    *wait_table;
    unsigned long        wait_table_hash_nr_entries;
    unsigned long        wait_table_bits;

    ZONE_PADDING(_pad1_)
    /* free areas of different sizes */
    struct free_area    free_area[MAX_ORDER];//不同长度的空间区域

    /* zone flags, see below */
    unsigned long        flags;

    /* Write-intensive fields used from the page allocator */
    spinlock_t        lock;

    ZONE_PADDING(_pad2_)

    /* Write-intensive fields used by page reclaim */

    /* Fields commonly accessed by the page reclaim scanner */
    spinlock_t        lru_lock;
    struct lruvec        lruvec;

    /* Evictions & activations on the inactive file list */
    atomic_long_t        inactive_age;

    /*
     * When free pages are below this point, additional steps are taken
     * when reading the number of free pages to avoid per-cpu counter
     * drift allowing watermarks to be breached
     */
    unsigned long percpu_drift_mark;

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
    /* pfn where compaction free scanner should start */
    unsigned long        compact_cached_free_pfn;
    /* pfn where async and sync compaction migration scanner should start */
    unsigned long        compact_cached_migrate_pfn[2];
#endif

#ifdef CONFIG_COMPACTION
    /*
     * On compaction failure, 1<<compact_defer_shift compactions
     * are skipped before trying again. The number attempted since
     * last failure is tracked with compact_considered.
     */
    unsigned int        compact_considered;
    unsigned int        compact_defer_shift;
    int            compact_order_failed;
#endif

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
    /* Set to true when the PG_migrate_skip bits should be cleared */
    bool            compact_blockskip_flush;
#endif

    ZONE_PADDING(_pad3_)
    /* Zone statistics */
    atomic_long_t        vm_stat[NR_VM_ZONE_STAT_ITEMS];
} ____cacheline_internodealigned_in_smp;r

page：

每个物理页对应一个page结构体，称为页描述符，内存节点的pglist_data实例的成员node_mem_map指向该内存节点包含的所有物理页的页描述符组成的数组。

struct page {
    /* First double word block */
    unsigned long flags;        /* Atomic flags, some possibly
                     * updated asynchronously */
    union {
        struct address_space *mapping;    /* If low bit clear, points to
                         * inode address_space, or NULL.
                         * If page mapped as anonymous
                         * memory, low bit is set, and
                         * it points to anon_vma object:
                         * see PAGE_MAPPING_ANON below.
                         */
        void *s_mem;            /* slab first object */
    };

    /* Second double word */
    struct {
        union {
            pgoff_t index;        /* Our offset within mapping. */
            void *freelist;        /* sl[aou]b first free object */
            bool pfmemalloc;    /* If set by the page allocator,
                         * ALLOC_NO_WATERMARKS was set
                         * and the low watermark was not
                         * met implying that the system
                         * is under some pressure. The
                         * caller should try ensure
                         * this page is only used to
                         * free other pages.
                         */
        };

        union {
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
            /* Used for cmpxchg_double in slub */
            unsigned long counters;
#else
            /*
             * Keep _count separate from slub cmpxchg_double data.
             * As the rest of the double word is protected by
             * slab_lock but _count is not.
             */
            unsigned counters;
#endif

            struct {

                union {
                    /*
                     * Count of ptes mapped in
                     * mms, to show when page is
                     * mapped & limit reverse map
                     * searches.
                     *
                     * Used also for tail pages
                     * refcounting instead of
                     * _count. Tail pages cannot
                     * be mapped and keeping the
                     * tail page _count zero at
                     * all times guarantees
                     * get_page_unless_zero() will
                     * never succeed on tail
                     * pages.
                     */
                    atomic_t _mapcount;

                    struct { /* SLUB */
                        unsigned inuse:16;
                        unsigned objects:15;
                        unsigned frozen:1;
                    };
                    int units;    /* SLOB */
                };
                atomic_t _count;        /* Usage count, see below. */
            };
            unsigned int active;    /* SLAB */
        };
    };

    /* Third double word block */
    union {
        struct list_head lru;    /* Pageout list, eg. active_list
                     * protected by zone->lru_lock !
                     * Can be used as a generic list
                     * by the page owner.
                     */
        struct {        /* slub per cpu partial pages */
            struct page *next;    /* Next partial slab */
#ifdef CONFIG_64BIT
            int pages;    /* Nr of partial slabs left */
            int pobjects;    /* Approximate # of objects */
#else
            short int pages;
            short int pobjects;
#endif
        };

        struct slab *slab_page; /* slab fields */
        struct rcu_head rcu_head;    /* Used by SLAB
                         * when destroying via RCU
                         */
        /* First tail page of compound page */
        struct {
            compound_page_dtor *compound_dtor;
            unsigned long compound_order;
        };

#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
        pgtable_t pmd_huge_pte; /* protected by page->ptl */
#endif
    };

    /* Remainder is not double word aligned */
    union {
        unsigned long private;        /* Mapping-private opaque data:
                          * usually used for buffer_heads
                         * if PagePrivate set; used for
                         * swp_entry_t if PageSwapCache;
                         * indicates order in the buddy
                         * system if PG_buddy is set.
                         */
#if USE_SPLIT_PTE_PTLOCKS
#if ALLOC_SPLIT_PTLOCKS
        spinlock_t *ptl;
#else
        spinlock_t ptl;
#endif
#endif
        struct kmem_cache *slab_cache;    /* SL[AU]B: Pointer to slab */
        struct page *first_page;    /* Compound tail pages */
    };

#ifdef CONFIG_MEMCG
    struct mem_cgroup *mem_cgroup;
#endif

    /*
     * On machines where all RAM is mapped into kernel address space,
     * we can simply calculate the virtual address. On machines with
     * highmem some memory is mapped into kernel virtual memory
     * dynamically, so we need a place to store that address.
     * Note that this field could be 16 bits on x86 ... ;)
     *
     * Architectures with slow multiplication can define
     * WANT_PAGE_VIRTUAL in asm/page.h
     */
#if defined(WANT_PAGE_VIRTUAL)
    void *virtual;            /* Kernel virtual address (NULL if
                       not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */

#ifdef CONFIG_KMEMCHECK
    /*
     * kmemcheck wants to track the status of each byte in a page; this
     * is a pointer to such a status block. NULL if not tracked.
     */
    void *shadow;
#endif

#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
    int _last_cpupid;
#endif
}

Big Kernel Lock BKL

Big Kernel Lock (BKL）大内核锁，其为Linux内核中的一种锁，与普通的锁原理基本一致，一旦进程获得BKL，则进入被它保护的临界区，不但该临界区被上锁，所有被保护的临

界区都会一起被锁住。大内核锁一般使用在驱动、文件系统等。大内核锁从Linux 2.6.39已被淘汰。

per-CPU计数器

Linux操作系统，特别是针对SMP或者NUMA架构的多CPU系统的时候，描述每个CPU的私有数据的时候，Linux操作系统提供了per_cpu机制。
per_cpu机制就是让每个CPU都有自己的私有数据段，便于保护与访问。因为上锁会损失性能。

//include/linux/percpu_counter.h
struct percpu_counter {
    raw_spinlock_t lock;//自旋锁 用于在需要准确值时
    s64 count;//计数器的准确值
#ifdef CONFIG_HOTPLUG_CPU
    struct list_head list;    /* All percpu_counters are on a list */
#endif
    s32 __percpu *counters;//该数组缓存了对计数器的操作
};