Linux Proc Filesystem/Linux Proc文件系统

目录

1 介绍

proc虚拟文件系统提供用户态与内核态数据交互的方式,一般挂载在/proc,由initrd启动初始化时挂载。proc下绝大多数文件用以导出内核信息,但也有部分允许修改,这些改变一般用于控制内核模块的功能或行为。

2 分类描述

2.1 内存相关

  • proc/buddyinfo
  • /proc/slabinfo
  • /proc/meminfo
  • /proc/iomem
  • /proc/ioports

3 清单化呈现(字母顺序排序)

3.1 proc/buddyinfo

buddyinfo的内容是Linux系统伙伴管理系统的内存呈现,10列分别表示2nil0至2^10{} PAGE 内存数量。如下是某系统读取到的内容示例:

~$ cat /proc/buddyinfo
Node 0, zone      DMA      1      1      1      0      2      1      1      0      1      1      3
Node 0, zone    DMA32    645    775    517    199     83      7     20     93     50      8    294
Node 0, zone   Normal    143    100     14     14      5      1      3      2      1      2      0

内核代码linux/mm/vmstat.c:

static const struct file_operations buddyinfo_file_operations = {
        .open           = fragmentation_open,
        .read           = seq_read,
        .llseek         = seq_lseek,
        .release        = seq_release,
};

3.2 proc/cgroups

cgroups(Control Groups)是Linux资源管理的特性,实现分级分组进行资源限制和监控。通过伪文件系统cgroupfs提供交互接口。man 7 cgroups获取详细信息。

内核代码fs/proc/cmdline.c:

static int cmdline_proc_show(struct seq_file *m, void *v)
{
        seq_printf(m, "%s\n", saved_command_line);
        return 0;
}

static int cmdline_proc_open(struct inode *inode, struct file *file)
{
        return single_open(file, cmdline_proc_show, NULL);
}

static const struct file_operations cmdline_proc_fops = {
        .open           = cmdline_proc_open,
        .read           = seq_read,
        .llseek         = seq_lseek,
        .release        = single_release,
};

static int __init proc_cmdline_init(void)
{
        proc_create("cmdline", 0, NULL, &cmdline_proc_fops);
        return 0;
}
fs_initcall(proc_cmdline_init);
~$ cat /proc/cgroups
#subsys_name    hierarchy       num_cgroups     enabled
cpuset  2       1       1
cpu     3       1       1
cpuacct 3       1       1
memory  0       1       0
devices 4       76      1
freezer 5       1       1
net_cls 6       1       1
blkio   7       1       1
perf_event      8       1       1
net_prio        6       1       1

Man7信息截取:

Control cgroups, usually referred to as cgroups, are a Linux kernel feature which allow processes to be organized into hierarchical groups whose usage of various types of resources can then be limited and monitored. The kernel's cgroup interface is provided through a pseudo-filesystem called cgroupfs. Grouping is implemented in the core cgroup kernel code, while resource tracking and limits are implemented in a set of per-resource-type subsystems (memory, CPU, and so on).

Terminology A cgroup is a collection of processes that are bound to a set of limits or parameters defined via the cgroup filesystem.

A subsystem is a kernel component that modifies the behavior of the processes in a cgroup. Various subsystems have been implemented, making it possible to do things such as limiting the amount of CPU time and memory available to a cgroup, accounting for the CPU time used by a cgroup, and freezing and resuming execution of the processes in a cgroup. Subsystems are sometimes also known as resource controllers (or simply, controllers).

The cgroups for a controller are arranged in a hierarchy. This hierarchy is defined by creating, removing, and renaming subdirectories within the cgroup filesystem. At each level of the hierarchy, attributes (e.g., limits) can be defined. The limits, control, and accounting provided by cgroups generally have effect throughout the subhierarchy underneath the cgroup where the attributes are defined. Thus, for example, the limits placed on a cgroup at a higher level in the hierarchy cannot be exceeded by descendant cgroups.

Cgroups version 1 controllers Each of the cgroups version 1 controllers is governed by a kernel configuration option (listed below). Additionally, the availability of the cgroups feature is governed by the CONFIG_CGROUPS kernel con‐ figuration option.

cpu (since Linux 2.6.24; CONFIG_CGROUP_SCHED) Cgroups can be guaranteed a minimum number of "CPU shares" when a system is busy. This does not limit a cgroup's CPU usage if the CPUs are not busy. For further information, see Documentation/scheduler/sched-design-CFS.txt.

In Linux 3.2, this controller was extended to provide CPU "bandwidth" control. If the kernel is configured with CON‐ FIG_CFS_BANDWIDTH, then within each scheduling period (defined via a file in the cgroup directory), it is possible to define an upper limit on the CPU time allocated to the processes in a cgroup. This upper limit applies even if there is no other competition for the CPU. Further information can be found in the kernel source file Documentation/scheduler/sched-bwc.txt.

cpuacct (since Linux 2.6.24; CONFIG_CGROUP_CPUACCT) This provides accounting for CPU usage by groups of processes.

Further information can be found in the kernel source file Documentation/cgroup-v1/cpuacct.txt.

cpuset (since Linux 2.6.24; CONFIG_CPUSETS) This cgroup can be used to bind the processes in a cgroup to a specified set of CPUs and NUMA nodes.

Further information can be found in the kernel source file Documentation/cgroup-v1/cpusets.txt.

memory (since Linux 2.6.25; CONFIG_MEMCG) The memory controller supports reporting and limiting of process memory, kernel memory, and swap used by cgroups.

Further information can be found in the kernel source file Documentation/cgroup-v1/memory.txt.

devices (since Linux 2.6.26; CONFIG_CGROUP_DEVICE) This supports controlling which processes may create (mknod) devices as well as open them for reading or writing. The policies may be specified as whitelists and blacklists. Hier‐ archy is enforced, so new rules must not violate existing rules for the target or ancestor cgroups.

Further information can be found in the kernel source file Documentation/cgroup-v1/devices.txt.

freezer (since Linux 2.6.28; CONFIG_CGROUP_FREEZER) The freezer cgroup can suspend and restore (resume) all pro‐ cesses in a cgroup. Freezing a cgroup /A also causes its children, for example, processes in /A/B, to be frozen.

Further information can be found in the kernel source file Documentation/cgroup-v1/freezer-subsystem.txt.

net_cls (since Linux 2.6.29; CONFIG_CGROUP_NET_CLASSID) This places a classid, specified for the cgroup, on network packets created by a cgroup. These classids can then be used in firewall rules, as well as used to shape traffic using tc(8). This applies only to packets leaving the cgroup, not to traffic arriving at the cgroup.

Further information can be found in the kernel source file Documentation/cgroup-v1/net_cls.txt.

blkio (since Linux 2.6.33; CONFIG_BLK_CGROUP) The blkio cgroup controls and limits access to specified block devices by applying IO control in the form of throttling and upper limits against leaf nodes and intermediate nodes in the storage hierarchy.

Two policies are available. The first is a proportional- weight time-based division of disk implemented with CFQ. This is in effect for leaf nodes using CFQ. The second is a throt‐ tling policy which specifies upper I/O rate limits on a device.

Further information can be found in the kernel source file Documentation/cgroup-v1/blkio-controller.txt.

perf_event (since Linux 2.6.39; CONFIG_CGROUP_PERF) This controller allows perf monitoring of the set of processes grouped in a cgroup.

Further information can be found in the kernel source file tools/perf/Documentation/perf-record.txt.

net_prio (since Linux 3.3; CONFIG_CGROUP_NET_PRIO) This allows priorities to be specified, per network interface, for cgroups.

Further information can be found in the kernel source file Documentation/cgroup-v1/net_prio.txt.

hugetlb (since Linux 3.5; CONFIG_CGROUP_HUGETLB) This supports limiting the use of huge pages by cgroups.

Further information can be found in the kernel source file Documentation/cgroup-v1/hugetlb.txt.

pids (since Linux 4.3; CONFIG_CGROUP_PIDS) This controller permits limiting the number of process that may be created in a cgroup (and its descendants).

Further information can be found in the kernel source file Documentation/cgroup-v1/pids.txt.

rdma (since Linux 4.11; CONFIG_CGROUP_RDMA) The RDMA controller permits limiting the use of RDMA/IB-spe‐ cific resources per cgroup.

Further information can be found in the kernel source file Documentation/cgroup-v1/rdma.txt.

3.3 proc/cmdline

Linux系统引导参数:

~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.16.0-4-amd64 root=UUID=62c7ce2d-be7b-43d1-b9e1-6002ee63577d ro quiet

内核代码fs/proc/cmdline.c

static int cmdline_proc_show(struct seq_file *m, void *v)
{
        seq_printf(m, "%s\n", saved_command_line);
        return 0;
}

static int cmdline_proc_open(struct inode *inode, struct file *file)
{
        return single_open(file, cmdline_proc_show, NULL);
}

static const struct file_operations cmdline_proc_fops = {
        .open           = cmdline_proc_open,
        .read           = seq_read,
        .llseek         = seq_lseek,
        .release        = single_release,
};

static int __init proc_cmdline_init(void)
{
        proc_create("cmdline", 0, NULL, &cmdline_proc_fops);
        return 0;
}
fs_initcall(proc_cmdline_init);

3.4 proc/consoles

consoles显示系统中所有可见的console信息。

内核代码fs/proc/consoles.c:

static const struct seq_operations consoles_op = {
        .start  = c_start,
        .next   = c_next,
        .stop   = c_stop,
        .show   = show_console_dev
};

static int consoles_open(struct inode *inode, struct file *file)
{
        return seq_open(file, &consoles_op);
}

static const struct file_operations proc_consoles_operations = {
        .open           = consoles_open,
        .read           = seq_read,
        .llseek         = seq_lseek,
        .release        = seq_release,
};

static int __init proc_consoles_init(void)
{
        proc_create("consoles", 0, NULL, &proc_consoles_operations);
        return 0;
}
fs_initcall(proc_consoles_init);

3.5 proc/cpuinfo

cpuinfo显示CPU相关信息:

~$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
stepping        : 3
microcode       : 0x1c
cpu MHz         : 3200.390
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall n
x pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl
vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb t
pr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts
bugs            :
bogomips        : 6384.87
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
stepping        : 3
...

内核代码由平台无关fs/proc/cpuinfo.c和平台相关(arch/x86/kernel/cpu/proc.c)组成:

static int cpuinfo_open(struct inode *inode, struct file *file)
{
        arch_freq_prepare_all();
        return seq_open(file, &cpuinfo_op);
}

static const struct file_operations proc_cpuinfo_operations = {
        .open           = cpuinfo_open,
        .read           = seq_read,
        .llseek         = seq_lseek,
        .release        = seq_release,
};

static int __init proc_cpuinfo_init(void)
{
        proc_create("cpuinfo", 0, NULL, &proc_cpuinfo_operations);
        return 0;
}
fs_initcall(proc_cpuinfo_init);

3.6 proc/crypto

crypto显示系统支持的密码算法。部分内容示例如下:

~$ cat /proc/crypto
name         : crct10dif
driver       : crct10dif-pclmul
module       : crct10dif_pclmul
priority     : 200
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 1
digestsize   : 2
...
name         : sha1
driver       : sha1-generic
module       : kernel
priority     : 0
refcnt       : 1
selftest     : passed
internal     : no
type         : shash
blocksize    : 64
digestsize   : 20

内核代码crypto/proc.c:

static const struct seq_operations crypto_seq_ops = {
        .start          = c_start,
        .next           = c_next,
        .stop           = c_stop,
        .show           = c_show
};

static int crypto_info_open(struct inode *inode, struct file *file)
{
        return seq_open(file, &crypto_seq_ops);
}

static const struct file_operations proc_crypto_ops = {
        .open           = crypto_info_open,
        .read           = seq_read,
        .llseek         = seq_lseek,
        .release        = seq_release
};

void __init crypto_init_proc(void)
{
        proc_create("crypto", 0, NULL, &proc_crypto_ops);
}

void __exit crypto_exit_proc(void)
{
        remove_proc_entry("crypto", NULL);
}

3.7 proc/devices

devices显示字符设备和块设备主设备号。如下:

~$ cat /proc/devices
Character devices:
  1 mem
  4 /dev/vc/0
  4 tty
  4 ttyS
  5 /dev/tty
  5 /dev/console
  5 /dev/ptmx
  6 lp
  7 vcs
 10 misc
 13 input
 21 sg
 29 fb
 81 video4linux
 99 ppdev
116 alsa
128 ptm
136 pts
180 usb
189 usb_device
216 rfcomm

...

253 tpm
254 gpiochip

Block devices:
259 blkext
  8 sd
 65 sd
 66 sd
 67 sd
 68 sd
 69 sd
 70 sd
 71 sd
128 sd
129 sd
130 sd
131 sd

内核代码fs/proc/devices.c:

static const struct seq_operations devinfo_ops = {
        .start = devinfo_start,
        .next  = devinfo_next,
        .stop  = devinfo_stop,
        .show  = devinfo_show
};

static int devinfo_open(struct inode *inode, struct file *filp)
{
        return seq_open(filp, &devinfo_ops);
}

static const struct file_operations proc_devinfo_operations = {
        .open           = devinfo_open,
        .read           = seq_read,
        .llseek         = seq_lseek,
        .release        = seq_release,
};

static int __init proc_devices_init(void)
{
        proc_create("devices", 0, NULL, &proc_devinfo_operations);
        return 0;
}
fs_initcall(proc_devices_init);

static int devinfo_show(struct seq_file *f, void *v)
{
        int i = *(loff_t *) v;

        if (i < CHRDEV_MAJOR_MAX) {
                if (i == 0)
                        seq_puts(f, "Character devices:\n");
                chrdev_show(f, i);
        }
#ifdef CONFIG_BLOCK
        else {
                i -= CHRDEV_MAJOR_MAX;
                if (i == 0)
                        seq_puts(f, "\nBlock devices:\n");
                blkdev_show(f, i);
        }
#endif
        return 0;
}

void chrdev_show(struct seq_file *f, off_t offset)
{
        struct char_device_struct *cd;

        mutex_lock(&chrdevs_lock);
        for (cd = chrdevs[major_to_index(offset)]; cd; cd = cd->next) {
                if (cd->major == offset)
                        seq_printf(f, "%3d %s\n", cd->major, cd->name);
        }
        mutex_unlock(&chrdevs_lock);
}

void blkdev_show(struct seq_file *seqf, off_t offset)
{
        struct blk_major_name *dp;

        mutex_lock(&block_class_lock);
        for (dp = major_names[major_to_index(offset)]; dp; dp = dp->next)
                if (dp->major == offset)
                        seq_printf(seqf, "%3d %s\n", dp->major, dp->name);
        mutex_unlock(&block_class_lock);
}

3.8 proc/diskstats

diskstats显示disk统计信息。如下是示例,输出内容解析参考Documentation/iostats.txt

cat /proc/diskstats
   8      16 sdb 476 0 37508 15096 0 0 0 0 0 2608 15096
   8      17 sdb1 50 0 4160 1780 0 0 0 0 0 1780 1780
   8      18 sdb2 48 0 4144 1400 0 0 0 0 0 1380 1400
   8      19 sdb3 48 0 4144 1952 0 0 0 0 0 1884 1952
   8      20 sdb4 2 0 4 176 0 0 0 0 0 176 176
   8      21 sdb5 46 0 4128 2588 0 0 0 0 0 1640 2588
   8      22 sdb6 46 0 4128 1052 0 0 0 0 0 1004 1052
   8      23 sdb7 48 0 4144 984 0 0 0 0 0 980 984
   8      24 sdb8 50 0 4160 2536 0 0 0 0 0 2016 2536
   8      25 sdb9 48 0 4144 1188 0 0 0 0 0 1188 1188
   8       0 sda 28874 1231 1655786 10164 6376 11927 340456 6204 0 5348 16352
   8       1 sda1 52 0 4176 24 0 0 0 0 0 16 24
   8       2 sda2 28795 1231 1649538 10136 6225 11927 340456 6068 0 5216 16188

各列概要如下:

 1 - major number
 2 - minor mumber
 3 - device name
 4 - reads completed successfully
 5 - reads merged
 6 - sectors read
 7 - time spent reading (ms)
 8 - writes completed
 9 - writes merged
10 - sectors written
11 - time spent writing (ms)
12 - I/Os currently in progress
13 - time spent doing I/Os (ms)
14 - weighted time spent doing I/Os (ms)

第4-14列说明如下:

Field 1 – # of reads completed
    This is the total number of reads completed successfully.

Field 2 – # of reads merged, field 6 – # of writes merged
    Reads and writes which are adjacent to each other may be merged for
    efficiency. Thus two 4K reads may become one 8K read before it is
    ultimately handed to the disk, and so it will be counted (and queued)
    as only one I/O. This field lets you know how often this was done.

Field 3 – # of sectors read
    This is the total number of sectors read successfully.

Field 4 – # of milliseconds spent reading
    This is the total number of milliseconds spent by all reads (as
    measured from __make_request() to end_that_request_last()).

Field 5 – # of writes completed
    This is the total number of writes completed successfully.

Field 6 – # of writes merged
    See the description of field 2.

Field 7 – # of sectors written
    This is the total number of sectors written successfully.

Field 8 – # of milliseconds spent writing
    This is the total number of milliseconds spent by all writes (as
    measured from __make_request() to end_that_request_last()).

Field 9 – # of I/Os currently in progress
    The only field that should go to zero. Incremented as requests are
    given to appropriate struct request_queue and decremented as they finish.

Field 10 – # of milliseconds spent doing I/Os
    This field increases so long as field 9 is nonzero.

Field 11 – weighted # of milliseconds spent doing I/Os
    This field is incremented at each I/O start, I/O completion, I/O
    merge, or read of these stats by the number of I/Os in progress
    (field 9) times the number of milliseconds spent doing I/O since the
    last update of this field. This can provide an easy measure of both
    I/O completion time and the backlog that may be accumulating.

内核代码block/genhd.c

static const struct seq_operations diskstats_op = {
        .start  = disk_seqf_start,
        .next   = disk_seqf_next,
        .stop   = disk_seqf_stop,
        .show   = diskstats_show
};

static int diskstats_open(struct inode *inode, struct file *file)
{
        return seq_open(file, &diskstats_op);
}

static const struct file_operations proc_diskstats_operations = {
        .open           = diskstats_open,
        .read           = seq_read,
        .llseek         = seq_lseek,
        .release        = seq_release,
};

static int __init proc_genhd_init(void)
{
        proc_create("diskstats", 0, NULL, &proc_diskstats_operations);
        proc_create("partitions", 0, NULL, &proc_partitions_operations);
        return 0;
}
module_init(proc_genhd_init);

static int diskstats_show(struct seq_file *seqf, void *v)
{
        struct gendisk *gp = v;
        struct disk_part_iter piter;
        struct hd_struct *hd;
        char buf[BDEVNAME_SIZE];
        unsigned int inflight[2];
        int cpu;

        /*
        if (&disk_to_dev(gp)->kobj.entry == block_class.devices.next)
                seq_puts(seqf,  "major minor name"
                                "     rio rmerge rsect ruse wio wmerge "
                                "wsect wuse running use aveq"
                                "\n\n");
        */

        disk_part_iter_init(&piter, gp, DISK_PITER_INCL_EMPTY_PART0);
        while ((hd = disk_part_iter_next(&piter))) {
                cpu = part_stat_lock();
                part_round_stats(gp->queue, cpu, hd);
                part_stat_unlock();
                part_in_flight(gp->queue, hd, inflight);
                seq_printf(seqf, "%4d %7d %s %lu %lu %lu "
                           "%u %lu %lu %lu %u %u %u %u\n",
                           MAJOR(part_devt(hd)), MINOR(part_devt(hd)),
                           disk_name(gp, hd->partno, buf),
                           part_stat_read(hd, ios[READ]),
                           part_stat_read(hd, merges[READ]),
                           part_stat_read(hd, sectors[READ]),
                           jiffies_to_msecs(part_stat_read(hd, ticks[READ])),
                           part_stat_read(hd, ios[WRITE]),
                           part_stat_read(hd, merges[WRITE]),
                           part_stat_read(hd, sectors[WRITE]),
                           jiffies_to_msecs(part_stat_read(hd, ticks[WRITE])),
                           inflight[0],
                           jiffies_to_msecs(part_stat_read(hd, io_ticks)),
                           jiffies_to_msecs(part_stat_read(hd, time_in_queue))
                        );
        }
        disk_part_iter_exit(&piter);

        return 0;
}

3.9 proc/dma

This is a list of the registered ISA DMA (direct memory access) channels in use.

~$ cat /proc/dma
 4: cascade

内核代码kernel/dma.c

static int proc_dma_open(struct inode *inode, struct file *file)
{
        return single_open(file, proc_dma_show, NULL);
}

static const struct file_operations proc_dma_operations = {
        .open           = proc_dma_open,
        .read           = seq_read,
        .llseek         = seq_lseek,
        .release        = single_release,
};

static int __init proc_dma_init(void)
{
        proc_create("dma", 0, NULL, &proc_dma_operations);
        return 0;
}

__initcall(proc_dma_init);

#ifdef MAX_DMA_CHANNELS
static int proc_dma_show(struct seq_file *m, void *v)
{
        int i;

        for (i = 0 ; i < MAX_DMA_CHANNELS ; i++) {
                if (dma_chan_busy[i].lock) {
                        seq_printf(m, "%2d: %s\n", i,
                                   dma_chan_busy[i].device_id);
                }
        }
        return 0;
}
#else
static int proc_dma_show(struct seq_file *m, void *v)
{
        seq_puts(m, "No DMA\n");
        return 0;
}
#endif /* MAX_DMA_CHANNELS */

3.10 proc/execdomains

execdomains显示当前支持的可执行文件domains(ABI)。当前总是显示Linux [kernel]:

cat /proc/execdomains
0-0     Linux                   [kernel]

内核代码kernel/exec_domain.c

#ifdef CONFIG_PROC_FS
static int execdomains_proc_show(struct seq_file *m, void *v)
{
        seq_puts(m, "0-0\tLinux           \t[kernel]\n");
        return 0;
}

static int execdomains_proc_open(struct inode *inode, struct file *file)
{
        return single_open(file, execdomains_proc_show, NULL);
}

static const struct file_operations execdomains_proc_fops = {
        .open           = execdomains_proc_open,
        .read           = seq_read,
        .llseek         = seq_lseek,
        .release        = single_release,
};

static int __init proc_execdomains_init(void)
{
        proc_create("execdomains", 0, NULL, &execdomains_proc_fops);
        return 0;
}
module_init(proc_execdomains_init);
#endif

SYSCALL_DEFINE1(personality, unsigned int, personality)
{
        unsigned int old = current->personality;

        if (personality != 0xffffffff)
                set_personality(personality);

        return old;
}

3.11 proc/fb

fb(frame buffer)显示当前系统fb信息:

~$ cat /proc/fb
0 inteldrmfb

内核代码drivers/video/fbdev/core/fbmem.c

static int fb_seq_show(struct seq_file *m, void *v)
{
        int i = *(loff_t *)v;
        struct fb_info *fi = registered_fb[i];

        if (fi)
                seq_printf(m, "%d %s\n", fi->node, fi->fix.id);
        return 0;
}

static const struct seq_operations proc_fb_seq_ops = {
        .start  = fb_seq_start,
        .next   = fb_seq_next,
        .stop   = fb_seq_stop,
        .show   = fb_seq_show,
};

static int proc_fb_open(struct inode *inode, struct file *file)
{
        return seq_open(file, &proc_fb_seq_ops);
}

static const struct file_operations fb_proc_fops = {
        .owner          = THIS_MODULE,
        .open           = proc_fb_open,
        .read           = seq_read,
        .llseek         = seq_lseek,
        .release        = seq_release,
};

3.12 proc/filesystems

filesystems列举当前支持的文件系统类型:

~$ cat /proc/filesystems
nodev   sysfs
nodev   rootfs
nodev   ramfs
nodev   bdev
nodev   proc
nodev   cpuset
nodev   cgroup
nodev   cgroup2
nodev   tmpfs
nodev   devtmpfs
nodev   debugfs
nodev   tracefs
nodev   securityfs
nodev   sockfs
nodev   bpf
nodev   pipefs
nodev   hugetlbfs
nodev   devpts
nodev   pstore
nodev   mqueue
        ext3
        ext2
        ext4
nodev   autofs
nodev   binfmt_misc
        fuseblk
nodev   fuse
nodev   fusectl

内核代码fs/filesystems.c

static int filesystems_proc_show(struct seq_file *m, void *v)
{
        struct file_system_type * tmp;

        read_lock(&file_systems_lock);
        tmp = file_systems;
        while (tmp) {
                seq_printf(m, "%s\t%s\n",
                        (tmp->fs_flags & FS_REQUIRES_DEV) ? "" : "nodev",
                        tmp->name);
                tmp = tmp->next;
        }
        read_unlock(&file_systems_lock);
        return 0;
}

static int filesystems_proc_open(struct inode *inode, struct file *file)
{
        return single_open(file, filesystems_proc_show, NULL);
}

static const struct file_operations filesystems_proc_fops = {
        .open           = filesystems_proc_open,
        .read           = seq_read,
        .llseek         = seq_lseek,
        .release        = single_release,
};

static int __init proc_filesystems_init(void)
{
        proc_create("filesystems", 0, NULL, &filesystems_proc_fops);
        return 0;
}
module_init(proc_filesystems_init);

3.13 proc/fs

fs目录下包含子目录,每个子目录是一种当前正在使用的文件系统名称(比如ext4),每个文件系统子目录下二级子目录是每个挂载的块设备(比如sda2),其下文件是其属性。

proc/fs是由fs/proc/root.c创建的:

void __init proc_root_init(void)
{
        int err;

        proc_init_inodecache();
        set_proc_pid_nlink();
        err = register_filesystem(&proc_fs_type);
        if (err)
                return;

        proc_self_init();
        proc_thread_self_init();
        proc_symlink("mounts", NULL, "self/mounts");

        proc_net_init();

#ifdef CONFIG_SYSVIPC
        proc_mkdir("sysvipc", NULL);
#endif
        proc_mkdir("fs", NULL);
        proc_mkdir("driver", NULL);
        proc_create_mount_point("fs/nfsd"); /* somewhere for the nfsd filesystem to be mounted */
#if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE)
        /* just give it a mountpoint */
        proc_create_mount_point("openprom");
#endif
        proc_tty_init();
        proc_mkdir("bus", NULL);
        proc_sys_init();
}

每个子目录是由不同的文件系统创建的,比如ext4: fs/ext4/sysfs.c

static const char proc_dirname[] = "fs/ext4";
static struct proc_dir_entry *ext4_proc_root;

int __init ext4_init_sysfs(void)
{
        int ret;

        kobject_set_name(&ext4_kset.kobj, "ext4");
        ext4_kset.kobj.parent = fs_kobj;
        ret = kset_register(&ext4_kset);
        if (ret)
                return ret;

        ret = kobject_init_and_add(&ext4_feat, &ext4_feat_ktype,
                                   NULL, "features");
        if (ret)
                kset_unregister(&ext4_kset);
        else
                ext4_proc_root = proc_mkdir(proc_dirname, NULL);
        return ret;
}

static const struct ext4_proc_files {
        const char *name;
        const struct file_operations *fops;
} proc_files[] = {
        PROC_FILE_LIST(options),
        PROC_FILE_LIST(es_shrinker_info),
        PROC_FILE_LIST(mb_groups),
        { NULL, NULL },
};

int ext4_register_sysfs(struct super_block *sb)
{
        struct ext4_sb_info *sbi = EXT4_SB(sb);
        const struct ext4_proc_files *p;
        int err;

        sbi->s_kobj.kset = &ext4_kset;
        init_completion(&sbi->s_kobj_unregister);
        err = kobject_init_and_add(&sbi->s_kobj, &ext4_sb_ktype, NULL,
                                   "%s", sb->s_id);
        if (err)
                return err;

        if (ext4_proc_root)
                sbi->s_proc = proc_mkdir(sb->s_id, ext4_proc_root);

        if (sbi->s_proc) {
                for (p = proc_files; p->name; p++)
                        proc_create_data(p->name, S_IRUGO, sbi->s_proc,
                                         p->fops, sb);
        }
        return 0;
}

3.14 proc/interrups

interrupts显示每CPU每设备的中断数,此文件信息可用来分析中断是否均衡:

~$ cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  0:         18          0          0          0  IR-IO-APIC   2-edge      timer
  1:          3          1          1          8  IR-IO-APIC   1-edge      i8042
  8:          1          0          0          0  IR-IO-APIC   8-edge      rtc0
  9:       1076       2273        812        836  IR-IO-APIC   9-fasteoi   acpi
 12:         42        475         34         51  IR-IO-APIC  12-edge      i8042
 16:          2         23          2          2  IR-IO-APIC  16-fasteoi   ehci_hcd:usb1
 18:          0          0          1          0  IR-IO-APIC  18-fasteoi   i801_smbus
 23:          3         28          0          2  IR-IO-APIC  23-fasteoi   ehci_hcd:usb2
 24:          0          0          0          0  DMAR-MSI   0-edge      dmar0
 25:          0          0          0          0  DMAR-MSI   1-edge      dmar1
 28:         17          1          3          1  IR-PCI-MSI 1572864-edge      rtsx_pci
 29:        160       1281        141        110  IR-PCI-MSI 409600-edge      enp0s25
 30:       7815      43472       5581       5237  IR-PCI-MSI 327680-edge      xhci_hcd
 31:      29423      41453      16894      16308  IR-PCI-MSI 512000-edge      ahci[0000:00:1f.2]
 32:         21          0          0          0  IR-PCI-MSI 360448-edge      mei_me
 33:        234        129         18          5  IR-PCI-MSI 442368-edge      snd_hda_intel:card1
 34:         17         25          9          6  IR-PCI-MSI 1048576-edge
 35:      21776      66376      16760      12078  IR-PCI-MSI 32768-edge      i915
 36:       7089      20303       4455       3968  IR-PCI-MSI 2097152-edge      iwlwifi
 37:        116        855        136        140  IR-PCI-MSI 49152-edge      snd_hda_intel:card0
NMI:          0          0          0          0   Non-maskable interrupts
LOC:     310517     250892     326730     272651   Local timer interrupts
SPU:          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0   Performance monitoring interrupts
IWI:          0          0          2          0   IRQ work interrupts
RTR:          0          0          0          0   APIC ICR read retries
RES:      32446      33569      28799      18822   Rescheduling interrupts
CAL:      50777      60493      62390      53925   Function call interrupts
TLB:      48661      58330      60778      51797   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0   Threshold APIC interrupts
DFR:          0          0          0          0   Deferred Error APIC interrupts
MCE:          0          0          0          0   Machine check exceptions
MCP:         12         12         12         12   Machine check polls
ERR:          0
MIS:          0
PIN:          0          0          0          0   Posted-interrupt notification event
PIW:          0          0          0          0   Posted-interrupt wakeup event

内核代码fs/proc/interrupts.ckernel/irq/proc.c

static const struct seq_operations int_seq_ops = {
        .start = int_seq_start,
        .next  = int_seq_next,
        .stop  = int_seq_stop,
        .show  = show_interrupts
};

static int interrupts_open(struct inode *inode, struct file *filp)
{
        return seq_open(filp, &int_seq_ops);
}

static const struct file_operations proc_interrupts_operations = {
        .open           = interrupts_open,
        .read           = seq_read,
        .llseek         = seq_lseek,
        .release        = seq_release,
};

static int __init proc_interrupts_init(void)
{
        proc_create("interrupts", 0, NULL, &proc_interrupts_operations);
        return 0;
}
fs_initcall(proc_interrupts_init);

int show_interrupts(struct seq_file *p, void *v)
{
        static int prec;

        unsigned long flags, any_count = 0;
        int i = *(loff_t *) v, j;
        struct irqaction *action;
        struct irq_desc *desc;

        if (i > ACTUAL_NR_IRQS)
                return 0;

        if (i == ACTUAL_NR_IRQS)
                return arch_show_interrupts(p, prec);

        /* print header and calculate the width of the first column */
        if (i == 0) {
                for (prec = 3, j = 1000; prec < 10 && j <= nr_irqs; ++prec)
                        j *= 10;

                seq_printf(p, "%*s", prec + 8, "");
                for_each_online_cpu(j)
                        seq_printf(p, "CPU%-8d", j);
                seq_putc(p, '\n');
        }

        irq_lock_sparse();
        desc = irq_to_desc(i);
        if (!desc)
                goto outsparse;

        raw_spin_lock_irqsave(&desc->lock, flags);
        for_each_online_cpu(j)
                any_count |= kstat_irqs_cpu(i, j);
        action = desc->action;
        if ((!action || irq_desc_is_chained(desc)) && !any_count)
                goto out;

        seq_printf(p, "%*d: ", prec, i);
        for_each_online_cpu(j)
                seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));

        if (desc->irq_data.chip) {
                if (desc->irq_data.chip->irq_print_chip)
                        desc->irq_data.chip->irq_print_chip(&desc->irq_data, p);
                else if (desc->irq_data.chip->name)
                        seq_printf(p, " %8s", desc->irq_data.chip->name);
                else
                        seq_printf(p, " %8s", "-");
        } else {
                seq_printf(p, " %8s", "None");
        }
        if (desc->irq_data.domain)
                seq_printf(p, " %*d", prec, (int) desc->irq_data.hwirq);
        else
                seq_printf(p, " %*s", prec, "");
#ifdef CONFIG_GENERIC_IRQ_SHOW_LEVEL
        seq_printf(p, " %-8s", irqd_is_level_type(&desc->irq_data) ? "Level" : "Edge");
#endif
        if (desc->name)
                seq_printf(p, "-%-8s", desc->name);

        if (action) {
                seq_printf(p, "  %s", action->name);
                while ((action = action->next) != NULL)
                        seq_printf(p, ", %s", action->name);
        }

        seq_putc(p, '\n');
out:
        raw_spin_unlock_irqrestore(&desc->lock, flags);
outsparse:
        irq_unlock_sparse();
        return 0;
}

3.15 iomem

I/O内存映射表。包括预留、BIOS、显卡、内存、PCI地址空间等。部分PCIe地址空间因安全原因,BIOS不暴露给OS,此时显示为reserved。

~$ cat /proc/iomem
00000000-00000fff : reserved
00001000-0009d7ff : System RAM
0009d800-0009ffff : reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c7fff : Video ROM
000c8000-000cbfff : pnp 00:00
000cc000-000cffff : pnp 00:00
000d0000-000d3fff : pnp 00:00
000d4000-000d7fff : pnp 00:00
000d8000-000dbfff : pnp 00:00
000dc000-000dffff : pnp 00:00
000e0000-000fffff : reserved
  000f0000-000fffff : System ROM
00100000-1fffffff : System RAM
  01000000-01519c00 : Kernel code
  01519c01-018ecdff : Kernel data
  01a21000-01af2fff : Kernel bss
20000000-201fffff : reserved
20200000-40003fff : System RAM
40004000-40004fff : reserved
40005000-cdba6fff : System RAM
cdba7000-dae9efff : reserved
dae9f000-daf9efff : ACPI Non-volatile Storage
daf9f000-daffefff : ACPI Tables
dafff000-df9fffff : reserved
  dba00000-df9fffff : Graphics Stolen Memory
dfa00000-febfffff : PCI Bus 0000:00
  e0000000-efffffff : 0000:00:02.0
  f0000000-f03fffff : 0000:00:02.0
  f0400000-f0bfffff : PCI Bus 0000:02
  f0c00000-f13fffff : PCI Bus 0000:04
  f1400000-f1bfffff : PCI Bus 0000:04
  f1c00000-f1cfffff : PCI Bus 0000:03
    f1c00000-f1c01fff : 0000:03:00.0
      f1c00000-f1c01fff : iwlwifi
  f1d00000-f24fffff : PCI Bus 0000:02
    f1d00000-f1d000ff : 0000:02:00.0
      f1d00000-f1d000ff : mmc0
  f2500000-f251ffff : 0000:00:19.0
    f2500000-f251ffff : e1000e
  f2520000-f252ffff : 0000:00:14.0
    f2520000-f252ffff : xhci_hcd
  f2530000-f2533fff : 0000:00:1b.0
    f2530000-f2533fff : ICH HD audio
  f2534000-f25340ff : 0000:00:1f.3
  f2535000-f253500f : 0000:00:16.0
    f2535000-f253500f : mei_me
  f2538000-f25387ff : 0000:00:1f.2
    f2538000-f25387ff : ahci
  f2539000-f25393ff : 0000:00:1d.0
    f2539000-f25393ff : ehci_hcd
  f253a000-f253a3ff : 0000:00:1a.0
    f253a000-f253a3ff : ehci_hcd
  f253b000-f253bfff : 0000:00:19.0
    f253b000-f253bfff : e1000e
  f253c000-f253cfff : 0000:00:16.3
  f8000000-fbffffff : PCI MMCONFIG 0000 [bus 00-3f]
    f8000000-fbffffff : reserved
      f8000000-fbffffff : pnp 00:01
fec00000-fec00fff : reserved
  fec00000-fec003ff : IOAPIC 0
fed00000-fed003ff : HPET 0
  fed00000-fed003ff : PNP0103:00
fed08000-fed08fff : reserved
fed10000-fed19fff : reserved
  fed10000-fed17fff : pnp 00:01
  fed18000-fed18fff : pnp 00:01
  fed19000-fed19fff : pnp 00:01
fed1c000-fed1ffff : reserved
  fed1c000-fed1ffff : pnp 00:01
    fed1f410-fed1f414 : iTCO_wdt
      fed1f410-fed1f414 : iTCO_wdt
fed40000-fed4bfff : PCI Bus 0000:00
  fed45000-fed4bfff : pnp 00:01
fed90000-fed90fff : dmar0
fed91000-fed91fff : dmar1
fee00000-fee00fff : Local APIC
  fee00000-fee00fff : reserved
ffc00000-ffffffff : reserved
  fffff000-ffffffff : pnp 00:01
100000000-41e5fffff : System RAM
41e600000-41effffff : reserved
41f000000-41fffffff : RAM buffer

内核代码kernel/resource.c

#ifdef CONFIG_PROC_FS

enum { MAX_IORES_LEVEL = 5 };

static void *r_start(struct seq_file *m, loff_t *pos)
        __acquires(resource_lock)
{
        struct resource *p = m->private;
        loff_t l = 0;
        read_lock(&resource_lock);
        for (p = p->child; p && l < *pos; p = r_next(m, p, &l))
                ;
        return p;
}

static void r_stop(struct seq_file *m, void *v)
        __releases(resource_lock)
{
        read_unlock(&resource_lock);
}

static int r_show(struct seq_file *m, void *v)
{
        struct resource *root = m->private;
        struct resource *r = v, *p;
        unsigned long long start, end;
        int width = root->end < 0x10000 ? 4 : 8;
        int depth;

        for (depth = 0, p = r; depth < MAX_IORES_LEVEL; depth++, p = p->parent)
                if (p->parent == root)
                        break;

        if (file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN)) {
                start = r->start;
                end = r->end;
        } else {
                start = end = 0;
        }

        seq_printf(m, "%*s%0*llx-%0*llx : %s\n",
                        depth * 2, "",
                        width, start,
                        width, end,
                        r->name ? r->name : "<BAD>");
        return 0;
}

static const struct seq_operations resource_op = {
        .start  = r_start,
        .next   = r_next,
        .stop   = r_stop,
        .show   = r_show,
};

static int ioports_open(struct inode *inode, struct file *file)
{
        int res = seq_open(file, &resource_op);
        if (!res) {
                struct seq_file *m = file->private_data;
                m->private = &ioport_resource;
        }
        return res;
}

static int iomem_open(struct inode *inode, struct file *file)
{
        int res = seq_open(file, &resource_op);
        if (!res) {
                struct seq_file *m = file->private_data;
                m->private = &iomem_resource;
        }
        return res;
}

static const struct file_operations proc_ioports_operations = {
        .open           = ioports_open,
        .read           = seq_read,
        .llseek         = seq_lseek,
        .release        = seq_release,
};

static const struct file_operations proc_iomem_operations = {
        .open           = iomem_open,
        .read           = seq_read,
        .llseek         = seq_lseek,
        .release        = seq_release,
};

static int __init ioresources_init(void)
{
        proc_create("ioports", 0, NULL, &proc_ioports_operations);
        proc_create("iomem", 0, NULL, &proc_iomem_operations);
        return 0;
}
__initcall(ioresources_init);

#endif /* CONFIG_PROC_FS */

3.16 proc/ioports

注册的IO端口区间,用inb、outb访问:

~$ cat /proc/ioports
0000-0cf7 : PCI Bus 0000:00
  0000-001f : dma1
  0020-0021 : pic1
  0040-0043 : timer0
  0050-0053 : timer1
  0060-0060 : keyboard
  0061-0061 : PNP0800:00
  0062-0062 : PNP0C09:00
    0062-0062 : EC data
  0064-0064 : keyboard
  0066-0066 : PNP0C09:00
    0066-0066 : EC cmd
  0070-0071 : rtc0
  0080-008f : dma page reg
  00a0-00a1 : pic2
  00c0-00df : dma2
  00f0-00ff : fpu
    00f0-00f0 : PNP0C04:00
  03c0-03df : vga+
  0400-0403 : ACPI PM1a_EVT_BLK
  0404-0405 : ACPI PM1a_CNT_BLK
  0408-040b : ACPI PM_TMR
  0410-0415 : ACPI CPU throttle
  0420-042f : ACPI GPE0_BLK
  0430-0433 : iTCO_wdt
    0430-0433 : iTCO_wdt
  0450-0450 : ACPI PM2_CNT_BLK
  0460-047f : iTCO_wdt
    0460-047f : iTCO_wdt
  0500-057f : pnp 00:01
  0800-080f : pnp 00:01
0cf8-0cff : PCI conf1
0d00-ffff : PCI Bus 0000:00
  15e0-15ef : pnp 00:01
  1600-167f : pnp 00:01
  3000-3fff : PCI Bus 0000:04
  4000-4fff : PCI Bus 0000:02
  5000-503f : 0000:00:02.0
  5060-507f : 0000:00:1f.2
    5060-507f : ahci
  5080-509f : 0000:00:19.0
  50a0-50a7 : 0000:00:1f.2
    50a0-50a7 : ahci
  50a8-50af : 0000:00:1f.2
    50a8-50af : ahci
  50b0-50b7 : 0000:00:16.3
    50b0-50b7 : serial
  50b8-50bb : 0000:00:1f.2
    50b8-50bb : ahci
  50bc-50bf : 0000:00:1f.2
    50bc-50bf : ahci
  efa0-efbf : 0000:00:1f.3
    efa0-efbf : i801_smbus

内核代码kernel/resource.c,参见iomem代码

3.17 proc/kallsyms

内核导出的符号定义,供module(X)工具链接和绑定使用:

~$ cat /proc/kallsyms
0000000000000000 A irq_stack_union
0000000000000000 A __per_cpu_start
ffffffff810002b8 T _stext
ffffffff81001000 T hypercall_page
ffffffff81001000 T xen_hypercall_set_trap_table
ffffffff81001020 T xen_hypercall_mmu_update
ffffffff81001040 T xen_hypercall_set_gdt
ffffffff81001060 T xen_hypercall_stack_switch
ffffffff81001080 T xen_hypercall_set_callbacks
ffffffff810010a0 T xen_hypercall_fpu_taskswitch
ffffffff810010c0 T xen_hypercall_sched_op_compat
ffffffff810010e0 T xen_hypercall_platform_op
ffffffff81001100 T xen_hypercall_set_debugreg
ffffffff81001120 T xen_hypercall_get_debugreg
ffffffff81001140 T xen_hypercall_update_descriptor
ffffffff81001160 T xen_hypercall_ni
ffffffff81001180 T xen_hypercall_memory_op
...
ffffffff8113fb60 T find_get_entries
ffffffff8113fc90 T find_get_pages
ffffffff8113fdd0 T mempool_kfree
ffffffff8113fde0 T mempool_alloc_slab
ffffffff8113fe00 T mempool_free_slab
ffffffff8113fe20 T mempool_alloc_pages
ffffffff8113fe30 T mempool_free_pages
ffffffff8113fe40 t remove_element.isra.1
ffffffff8113fe60 T mempool_destroy
ffffffff8113fec0 T mempool_alloc
ffffffff81140010 t add_element
ffffffff81140030 T mempool_free
ffffffff811400c0 T mempool_create_node
ffffffff81140200 T mempool_create
ffffffff81140220 T mempool_resize
ffffffff811403c0 T mempool_kmalloc

3.18 proc/meminfo

~$ cat /proc/meminfo
MemTotal:       16145220 kB
MemFree:        14074612 kB
MemAvailable:   14689140 kB
Buffers:           88384 kB
Cached:           878512 kB
SwapCached:            0 kB
Active:          1272364 kB
Inactive:         597024 kB
Active(anon):     903844 kB
Inactive(anon):   217676 kB
Active(file):     368520 kB
Inactive(file):   379348 kB
Unevictable:         120 kB
Mlocked:             120 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               100 kB
Writeback:             0 kB
AnonPages:        902604 kB
...

3.19 proc/<pid>

3.19.1 proc/<pid>/oom_adj

调整进程OOM-killing亲和度。有效范围[-17, +15]。-17是特殊值,禁止OOM-Killing。数值越大,OOM时被选中的可能性越大。默认为0,需要CAP_SYS_RESOURCE权限修改此值。2.6.36之后版本不建议使用此方式,请使用/proc/<pid>/oom_score_adj代替。

3.19.2 proc/<pid>/oom_score

查看当前进程OOM-killer分值。分数越高越容易被OOM-Killer选中。基本分值与进程占用内存相关,且随着fork数量、CPU占用、nice、privileged、是否直接存取硬件调整:

This file displays the current score that the kernel gives to this process for the purpose of selecting a process for the OOM-killer. A higher score means that the process is more likely to be selected by the OOM-killer. The basis for this score is the amount of memory used by the process, with increases (+) or decreases (-) for factors including:

  • whether the process creates a lot of children using fork(2) (+);
  • whether the process has been running a long time, or has used a lot of CPU time (-);
  • whether the process has a low nice value (i.e., > 0) (+);
  • whether the process is privileged (-); and
  • whether the process is making direct hardware access (-).
  • The oom_score also reflects the adjustment specified by the oom_score_adj or oom_adj setting for the

process.

3.19.3 proc/<pid>/oom_score_adj

调整OOM-killer选择该进程的坏值。数值范围[-1000, +1000]。-1000表示禁止被选中。重要的常驻程序可以设置-1000禁止进程被OOM-killer选中杀死:

#include <errno.h>
#include <stdio.h>

static int disable_oom()
{
    FILE *fp = fopen("/proc/self/oom_score_adj", "w");

    if (!fp) {
            fprintf(stderr, "open oom_score_adj failed\n");
            return -1;
    }

    fprintf(fp, "%i", -1000);
    fclose(fp);

    return 0;
}

int main()
{
        int ret;

        ret = disable_oom();
        if (0 == ret)
                printf("disable oom success\n");

        // do post work ...

        return 0;
}
disable oom success

3.19.4 proc/<pid>/stack

进程内核态栈调用符号跟踪。

~$ cat /proc/self/stack
[<ffffffff810695d9>] do_wait+0x1d9/0x230
[<ffffffff8106a637>] SyS_wait4+0x67/0xe0
[<ffffffff81068430>] child_wait_callback+0x0/0x60
[<ffffffff81514a0d>] system_call_fast_compare_end+0x10/0x15
[<ffffffffffffffff>] 0xffffffffffffffff

~$ cat /proc/self/stack
[<ffffffff81021bae>] save_stack_trace_tsk+0x1e/0x40
[<ffffffff81208ebd>] proc_pid_stack+0x8d/0xe0
[<ffffffff81209ac7>] proc_single_show+0x47/0x80
[<ffffffff811ca132>] seq_read+0xe2/0x360
[<ffffffff811a8723>] vfs_read+0x93/0x170
[<ffffffff811a9352>] SyS_read+0x42/0xa0
[<ffffffff81516a28>] page_fault+0x28/0x30
[<ffffffff81514a0d>] system_call_fast_compare_end+0x10/0x15
[<ffffffffffffffff>] 0xffffffffffffffff

~$ sudo cat /proc/1/stack
[<ffffffff811e8730>] ep_send_events_proc+0x0/0x1b0
[<ffffffff811e9079>] ep_scan_ready_list.isra.7+0x199/0x1c0
[<ffffffff811e931a>] ep_poll+0x25a/0x340
[<ffffffff810970a0>] default_wake_function+0x0/0x10
[<ffffffff811ea7a4>] SyS_epoll_wait+0xb4/0xe0
[<ffffffff81514a0d>] system_call_fast_compare_end+0x10/0x15
[<ffffffffffffffff>] 0xffffffffffffffff

3.19.5 proc/<pid>/stat

进程状态信息,ps使用此处信息,对应代码kernel/fs/proc/array.c,截取部分意义代码:

/*
 * The task state array is a strange "bitmap" of
 * reasons to sleep. Thus "running" is zero, and
 * you can test for combinations of others with
 * simple bit tests.
 */
static const char * const task_state_array[] = {

        /* states in TASK_REPORT: */
        "R (running)",          /* 0x00 */
        "S (sleeping)",         /* 0x01 */
        "D (disk sleep)",       /* 0x02 */
        "T (stopped)",          /* 0x04 */
        "t (tracing stop)",     /* 0x08 */
        "X (dead)",             /* 0x10 */
        "Z (zombie)",           /* 0x20 */
        "P (parked)",           /* 0x40 */

        /* states beyond TASK_REPORT: */
        "I (idle)",             /* 0x80 */
};

Man页有各细节描述:

~$ cat /proc/self/stat 1897 (bash) S 1892 1897 1897 34817 6762 4202496 76561 174332 1 115 113 36 134 64 20 0 1 0 3871 31305728 1893 18446744073709551615 4194304 5184116 140734690585584 140734690584264 140388838861628 0 65536 3670020 1266777851 0 0 0 17 2 0 0 2 0 0 7282144 7319112 11919360 140734690589286 140734690589292 140734690589292 140734690590702 0

proc[pid]/stat Status information about the process. This is used by ps(1). It is defined in the kernel source file fs/proc/array.c.

The fields, in order, with their proper scanf(3) format specifiers, are:

(1) pid %d The process ID.

(2) comm %s The filename of the executable, in parentheses. This is visible whether or not the executable is swapped out.

(3) state %c One of the following characters, indicating process state:

R Running

S Sleeping in an interruptible wait

D Waiting in uninterruptible disk sleep

Z Zombie

T Stopped (on a signal) or (before Linux 2.6.33) trace stopped

t Tracing stop (Linux 2.6.33 onward)

W Paging (only before Linux 2.6.0)

X Dead (from Linux 2.6.0 onward)

x Dead (Linux 2.6.33 to 3.13 only)

K Wakekill (Linux 2.6.33 to 3.13 only)

W Waking (Linux 2.6.33 to 3.13 only)

P Parked (Linux 3.9 to 3.13 only)

(4) ppid %d The PID of the parent of this process.

(5) pgrp %d The process group ID of the process.

(6) session %d The session ID of the process.

(7) tty_nr %d The controlling terminal of the process. (The minor device number is contained in the combination of bits 31 to 20 and 7 to 0; the major device number is in bits 15 to 8.)

(8) tpgid %d The ID of the foreground process group of the controlling terminal of the process.

(9) flags %u The kernel flags word of the process. For bit meanings, see the PF_* defines in the Linux kernel source file include/linux/sched.h. Details depend on the kernel ver‐ sion.

The format for this field was %lu before Linux 2.6.

(1) minflt %lu The number of minor faults the process has made which have not required loading a memory page from disk.

(11) cminflt %lu The number of minor faults that the process's waited-for children have made.

(12) majflt %lu The number of major faults the process has made which have required loading a memory page from disk.

(13) cmajflt %lu The number of major faults that the process's waited-for children have made.

(14) utime %lu Amount of time that this process has been scheduled in user mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK)). This includes guest time, guest_time (time spent running a virtual CPU, see below), so that applications that are not aware of the guest time field do not lose that time from their calculations.

(15) stime %lu Amount of time that this process has been scheduled in kernel mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK)).

(16) cutime %ld Amount of time that this process's waited-for children have been scheduled in user mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK)). (See also times(2).) This includes guest time, cguest_time (time spent running a virtual CPU, see below).

(17) cstime %ld Amount of time that this process's waited-for children have been scheduled in kernel mode, measured in clock ticks (divide by sysconf(_SC_CLK_TCK)).

(18) priority %ld (Explanation for Linux 2.6) For processes running a real- time scheduling policy (policy below; see sched_setsched‐ uler(2)), this is the negated scheduling priority, minus one; that is, a number in the range -2 to -100, corre‐ sponding to real-time priorities 1 to 99. For processes running under a non-real-time scheduling policy, this is the raw nice value (setpriority(2)) as represented in the kernel. The kernel stores nice values as numbers in the range 0 (high) to 39 (low), corresponding to the user- visible nice range of -20 to 19.

Before Linux 2.6, this was a scaled value based on the scheduler weighting given to this process.

(19) nice %ld The nice value (see setpriority(2)), a value in the range 19 (low priority) to -20 (high priority).

(20) num_threads %ld Number of threads in this process (since Linux 2.6). Before kernel 2.6, this field was hard coded to 0 as a placeholder for an earlier removed field.

(21) itrealvalue %ld The time in jiffies before the next SIGALRM is sent to the process due to an interval timer. Since kernel 2.6.17, this field is no longer maintained, and is hard coded as 0.

(22) starttime %llu The time the process started after system boot. In ker‐ nels before Linux 2.6, this value was expressed in jiffies. Since Linux 2.6, the value is expressed in clock ticks (divide by sysconf(_SC_CLK_TCK)).

The format for this field was %lu before Linux 2.6.

(23) vsize %lu Virtual memory size in bytes.

(24) rss %ld Resident Set Size: number of pages the process has in real memory. This is just the pages which count toward text, data, or stack space. This does not include pages which have not been demand-loaded in, or which are swapped out.

(25) rsslim %lu Current soft limit in bytes on the rss of the process; see the description of RLIMIT_RSS in getrlimit(2).

(26) startcode %lu The address above which program text can run.

(27) endcode %lu The address below which program text can run.

(28) startstack %lu The address of the start (i.e., bottom) of the stack.

(29) kstkesp %lu The current value of ESP (stack pointer), as found in the kernel stack page for the process.

(30) kstkeip %lu The current EIP (instruction pointer).

(31) signal %lu The bitmap of pending signals, displayed as a decimal number. Obsolete, because it does not provide informa‐ tion on real-time signals; use proc[pid]/status instead.

(32) blocked %lu The bitmap of blocked signals, displayed as a decimal number. Obsolete, because it does not provide informa‐ tion on real-time signals; use proc[pid]/status instead.

(33) sigignore %lu The bitmap of ignored signals, displayed as a decimal number. Obsolete, because it does not provide informa‐ tion on real-time signals; use proc[pid]/status instead.

(34) sigcatch %lu The bitmap of caught signals, displayed as a decimal num‐ ber. Obsolete, because it does not provide information on real-time signals; use proc[pid]/status instead.

(35) wchan %lu This is the "channel" in which the process is waiting. It is the address of a location in the kernel where the process is sleeping. The corresponding symbolic name can be found in proc[pid]/wchan.

(36) nswap %lu Number of pages swapped (not maintained).

(37) cnswap %lu Cumulative nswap for child processes (not maintained).

(38) exit_signal %d (since Linux 2.1.22) Signal to be sent to parent when we die.

(39) processor %d (since Linux 2.2.8) CPU number last executed on.

(40) rt_priority %u (since Linux 2.5.19) Real-time scheduling priority, a number in the range 1 to 99 for processes scheduled under a real-time policy, or 0, for non-real-time processes (see sched_setsched‐ uler(2)).

(41) policy %u (since Linux 2.5.19) Scheduling policy (see sched_setscheduler(2)). Decode using the SCHED_* constants in linux/sched.h.

The format for this field was %lu before Linux 2.6.22.

(42) delayacct_blkio_ticks %llu (since Linux 2.6.18) Aggregated block I/O delays, measured in clock ticks (centiseconds).

(43) guest_time %lu (since Linux 2.6.24) Guest time of the process (time spent running a virtual CPU for a guest operating system), measured in clock ticks (divide by sysconf(_SC_CLK_TCK)).

(44) cguest_time %ld (since Linux 2.6.24) Guest time of the process's children, measured in clock ticks (divide by sysconf(_SC_CLK_TCK)).

(45) start_data %lu (since Linux 3.3) Address above which program initialized and uninitialized (BSS) data are placed.

(46) end_data %lu (since Linux 3.3) Address below which program initialized and uninitialized (BSS) data are placed.

(47) start_brk %lu (since Linux 3.3) Address above which program heap can be expanded with brk(2).

(48) arg_start %lu (since Linux 3.5) Address above which program command-line arguments (argv) are placed.

(49) arg_end %lu (since Linux 3.5) Address below program command-line arguments (argv) are placed.

(50) env_start %lu (since Linux 3.5) Address above which program environment is placed.

(51) env_end %lu (since Linux 3.5) Address below which program environment is placed.

(52) exit_code %d (since Linux 3.5) The thread's exit status in the form reported by wait‐ pid(2).

3.19.6 proc/<pid>/statm

进程内存使用评估,以Page为单位。

~$ cat statm
7643 1893 850 242 0 1040 0

proc[pid]/statm Provides information about memory usage, measured in pages. The columns are:

size (1) total program size (same as VmSize in proc[pid]/status) resident (2) resident set size (same as VmRSS in proc[pid]/status) share (3) shared pages (i.e., backed by a file) text (4) text (code) lib (5) library (unused in Linux 2.6) data (6) data + stack dt (7) dirty pages (unused in Linux 2.6)

3.19.7 proc/<pid>/status

可读方式提供更多进程信息:

~$ cat /proc/self/status
Name:   cat
State:  R (running)
Tgid:   6955
Ngid:   0
Pid:    6955
PPid:   1897
TracerPid:      0
Uid:    1000    1000    1000    1000
Gid:    1000    1000    1000    1000
FDSize: 256
Groups: 24 25 29 30 44 46 108 110 113 118 1000
VmPeak:    11012 kB
VmSize:    11012 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:       712 kB
VmRSS:       712 kB
VmData:      324 kB
VmStk:       136 kB
VmExe:        48 kB
VmLib:      1796 kB
VmPTE:        40 kB
VmSwap:        0 kB
Threads:        1
SigQ:   0/63000
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
Seccomp:        0
Cpus_allowed:   ff
Cpus_allowed_list:      0-7
Mems_allowed:   00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        0
nonvoluntary_ctxt_switches:     1
  • Name: Command run by this process.
  • State: Current state of the process. One of "R (running)", "S (sleeping)", "D (disk sleep)", "T (stopped)", "T (tracing stop)", "Z (zombie)", or "X (dead)".
  • Tgid: Thread group ID (i.e., Process ID).
  • Pid: Thread ID (see gettid(2)).
  • PPid: PID of parent process.
  • TracerPid: PID of process tracing this process (0 if not being traced).
  • Uid, Gid: Real, effective, saved set, and filesystem UIDs (GIDs).
  • FDSize: Number of file descriptor slots currently allocated.
  • Groups: Supplementary group list.
  • VmPeak: Peak virtual memory size.
  • VmSize: Virtual memory size.
  • VmLck: Locked memory size (see mlock(3)).
  • VmHWM: Peak resident set size ("high water mark").
  • VmRSS: Resident set size.
  • VmData, VmStk, VmExe: Size of data, stack, and text segments.
  • VmLib: Shared library code size.
  • VmPTE: Page table entries size (since Linux 2.6.10).
  • Threads: Number of threads in process containing this thread.
  • SigQ: This field contains two slash-separated numbers that relate to queued signals for the real user ID of this process. The first of these is the number of currently queued signals for this real user ID, and the second is the resource limit on the number of queued signals for this process (see the description of RLIMIT_SIGPENDING in getrlimit(2)).
  • SigPnd, ShdPnd: Number of signals pending for thread and for process as a whole (see pthreads(7) and signal(7)).
  • SigBlk, SigIgn, SigCgt: Masks indicating signals being blocked, ignored, and caught (see signal(7)).
  • CapInh, CapPrm, CapEff: Masks of capabilities enabled in inheri‐ table, permitted, and effective sets (see capabilities(7)).
  • CapBnd: Capability Bounding set (since Linux 2.6.26, see capabil‐ ities(7)).
  • Cpus_allowed: Mask of CPUs on which this process may run (since Linux 2.6.24, see cpuset(7)).
  • Cpus_allowed_list: Same as previous, but in "list format" (since Linux 2.6.26, see cpuset(7)).
  • Mems_allowed: Mask of memory nodes allowed to this process (since Linux 2.6.24, see cpuset(7)).
  • Mems_allowed_list: Same as previous, but in "list format" (since Linux 2.6.26, see cpuset(7)).
  • voluntary_ctxt_switches, nonvoluntary_ctxt_switches: Number of voluntary and involuntary context switches (since Linux 2.6.23).

3.19.8 proc/<pid>/syscall

进程最近系统调用的信息。第一列为调用号,其后是stack地址、ecx,和6个通用寄存器信息。

~$ cat /proc/self/syscall
0 0x3 0x7f065d1ad000 0x20000 0x7ffcb1120dd0 0xffffffff 0x0 0x7ffcb1120f70 0x7f065cce0ba0

3.20 proc/self

proc/self是一个符号链接,总是指向执行进程本身(/proc/<pid>)。

~$ ls -ld /proc/self
lrwxrwxrwx 1 root root 0 Dec 20 18:47 /proc/self -> 2289

3.21 oom_score


4 内核实现

5 参考资料

IBM developer procfs、seq_file、debugfs and relayfs
https://www.ibm.com/developerworks/cn/linux/l-kerns-usrs2/
LWN Driver porting: The seq_file interface
https://lwn.net/Articles/22355/