Documentation for /proc/sys/vm/* Kernel version 2.4.28 ============================================================= (c) 1998, 1999, Rik van Riel - Initial version (c) 2004, Marc-Christian Petersen - Removed non-existent knobs which were removed in early 2.4 stages - Corrected values for bdflush - Documented missing tunables - Documented aa-vm tunables For general info and legal blurb, please look in README. ============================================================= This file contains the documentation for the sysctl files in /proc/sys/vm and is valid for Linux kernel v2.4.28. The files in this directory can be used to tune the operation of the virtual memory (VM) subsystem of the Linux kernel, and three of the files (bdflush, max-readahead, min-readahead) also have some influence on disk usage. Default values and initialization routines for most of these files can be found in mm/vmscan.c, mm/page_alloc.c and mm/filemap.c. Currently, these files are in /proc/sys/vm: - bdflush - block_dump - kswapd - laptop_mode - max-readahead - min-readahead - max_map_count - overcommit_memory - page-cluster - pagetable_cache - vm_anon_lru - vm_cache_scan_ratio - vm_gfp_debug - vm_lru_balance_ratio - vm_mapped_ratio - vm_passes - vm_vfs_scan_ratio ============================================================= bdflush: -------- This file controls the operation of the bdflush kernel daemon. The source code to this struct can be found in fs/buffer.c. It currently contains 9 integer values, of which 6 are actually used by the kernel. nfract: The first parameter governs the maximum number of dirty buffers in the buffer cache. Dirty means that the contents of the buffer still have to be written to disk (as opposed to a clean buffer, which can just be forgotten about). Setting this to a high value means that Linux can delay disk writes for a long time, but it also means that it will have to do a lot of I/O at once when memory becomes short. A low value will spread out disk I/O more evenly, at the cost of more frequent I/O operations. The default value is 30%, the minimum is 0%, and the maximum is 100%. ndirty: The second parameter (ndirty) gives the maximum number of dirty buffers that bdflush can write to the disk in one time. A high value will mean delayed, bursty I/O, while a small value can lead to memory shortage when bdflush isn't woken up often enough. The default value is 500 dirty buffers, the minimum is 1, and the maximum is 50000. dummy2: The third parameter is not used. dummy3: The fourth parameter is not used. interval: The fifth parameter, interval, is the minimum rate at which kupdate will wake and flush. The value is in jiffies (clockticks), the number of jiffies per second is normally 100 (Alpha is 1024). Thus, x*HZ is x seconds. The default value is 5 seconds, the minimum is 0 seconds, and the maximum is 10,000 seconds. age_buffer: The sixth parameter, age_buffer, governs the maximum time Linux waits before writing out a dirty buffer to disk. The value is in jiffies. The default value is 30 seconds, the minimum is 1 second, and the maximum 10,000 seconds. sync: The seventh parameter, nfract_sync, governs the percentage of buffer cache that is dirty before bdflush activates synchronously. This can be viewed as the hard limit before bdflush forces buffers to disk. The default is 60%, the minimum is 0%, and the maximum is 100%. stop_bdflush: The eighth parameter, nfract_stop_bdflush, governs the percentage of buffer cache that is dirty which will stop bdflush. The default is 20%, the miniumum is 0%, and the maxiumum is 100%. dummy5: The ninth parameter is not used. So the default is: 30 500 0 0 500 3000 60 20 0 for 100 HZ. ============================================================= block_dump: ----------- It can happen that the disk still keeps spinning up and you don't quite know why or what causes it. The laptop mode patch has a little helper for that as well. When set to 1, it will dump info to the kernel message buffer about what process caused the io. Be careful when playing with this setting. It is advisable to shut down syslog first! The default is 0. ============================================================= kswapd: ------- Kswapd is the kernel swapout daemon. That is, kswapd is that piece of the kernel that frees memory when it gets fragmented or full. Since every system is different, you'll probably want some control over this piece of the system. The numbers in this page correspond to the numbers in the struct pager_daemon {tries_base, tries_min, swap_cluster }; The tries_base and swap_cluster probably have the largest influence on system performance. tries_base The maximum number of pages kswapd tries to free in one round is calculated from this number. Usually this number will be divided by 4 or 8 (see mm/vmscan.c), so it isn't as big as it looks. When you need to increase the bandwidth to/ from swap, you'll want to increase this number. tries_min This is the minimum number of times kswapd tries to free a page each time it is called. Basically it's just there to make sure that kswapd frees some pages even when it's being called with minimum priority. swap_cluster This is the number of pages kswapd writes in one turn. You want this large so that kswapd does it's I/O in large chunks and the disk doesn't have to seek often, but you don't want it to be too large since that would flood the request queue. The default value is: 512 32 8. ============================================================= laptop_mode: ------------ Setting this to 1 switches the vm (and block layer) to laptop mode. Leaving it to 0 makes the kernel work like before. When in laptop mode, you also want to extend the intervals desribed in Documentation/laptop-mode.txt. See the laptop-mode.sh script for how to do that. The default value is 0. ============================================================= max-readahead: -------------- This tunable affects how early the Linux VFS will fetch the next block of a file from memory. File readahead values are determined on a per file basis in the VFS and are adjusted based on the behavior of the application accessing the file. Anytime the current position being read in a file plus the current read ahead value results in the file pointer pointing to the next block in the file, that block will be fetched from disk. By raising this value, the Linux kernel will allow the readahead value to grow larger, resulting in more blocks being prefetched from disks which predictably access files in uniform linear fashion. This can result in performance improvements, but can also result in excess (and often unnecessary) memory usage. Lowering this value has the opposite affect. By forcing readaheads to be less aggressive, memory may be conserved at a potential performance impact. The default value is 31. ============================================================= min-readahead: -------------- Like max-readahead, min-readahead places a floor on the readahead value. Raising this number forces a files readahead value to be unconditionally higher, which can bring about performance improvements, provided that all file access in the system is predictably linear from the start to the end of a file. This of course results in higher memory usage from the pagecache. Conversely, lowering this value, allows the kernel to conserve pagecache memory, at a potential performance cost. The default value is 3. ============================================================= max_map_count: -------------- This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling malloc, directly by mmap and mprotect, and also when loading shared libraries. While most applications need less than a thousand maps, certain programs, particularly malloc debuggers, may consume lots of them, e.g. up to one or two maps per allocation. The default value is 65536. ============================================================= overcommit_memory: ------------------ This value contains a flag to enable memory overcommitment. When this flag is 0, the kernel checks before each malloc() to see if there's enough memory left. If the flag is nonzero, the system pretends there's always enough memory. This feature can be very useful because there are a lot of programs that malloc() huge amounts of memory "just-in-case" and don't use much of it. The default value is 0. Look at: mm/mmap.c::vm_enough_memory() for more information. ============================================================= page-cluster: ------------- The Linux VM subsystem avoids excessive disk seeks by reading multiple pages on a page fault. The number of pages it reads is dependent on the amount of memory in your machine. The number of pages the kernel reads in at once is equal to 2 ^ page-cluster. Values above 2 ^ 5 don't make much sense for swap because we only cluster swap data in 32-page groups. ============================================================= pagetable_cache: ---------------- The kernel keeps a number of page tables in a per-processor cache (this helps a lot on SMP systems). The cache size for each processor will be between the low and the high value. On a low-memory, single CPU system you can safely set these values to 0 so you don't waste the memory. On SMP systems it is used so that the system can do fast pagetable allocations without having to acquire the kernel memory lock. For large systems, the settings are probably OK. For normal systems they won't hurt a bit. For small systems (<16MB ram) it might be advantageous to set both values to 0. The default value is: 25 50. ============================================================= vm_anon_lru: ------------ select if to immdiatly insert anon pages in the lru. Immediatly means as soon as they're allocated during the page faults. If this is set to 0, they're inserted only after the first swapout. Having anon pages immediatly inserted in the lru allows the VM to know better when it's worthwhile to start swapping anonymous ram, it will start to swap earlier and it should swap smoother and faster, but it will decrease scalability on the >16-ways of an order of magnitude. Big SMP/NUMA definitely can't take an hit on a global spinlock at every anon page allocation. Low ram machines that swaps all the time want to turn this on (i.e. set to 1). The default value is 0. ============================================================= vm_cache_scan_ratio: -------------------- is how much of the inactive LRU queue we will scan in one go. A value of 6 for vm_cache_scan_ratio implies that we'll scan 1/6 of the inactive lists during a normal aging round. The default value is 6. ============================================================= vm_gfp_debug: ------------ is when __alloc_pages fails, dump us a stack. This will mostly happen during OOM conditions (hopefully ;) The default value is 0. ============================================================= vm_lru_balance_ratio: --------------------- controls the balance between active and inactive cache. The bigger vm_balance is, the easier the active cache will grow, because we'll rotate the active list slowly. A value of 2 means we'll go towards a balance of 1/3 of the cache being inactive. The default value is 2. ============================================================= vm_mapped_ratio: ---------------- controls the pageout rate, the smaller, the earlier we'll start to pageout. The default value is 100. ============================================================= vm_passes: ---------- is the number of vm passes before failing the memory balancing. Take into account 3 passes are needed for a flush/wait/free cycle and that we only scan 1/vm_cache_scan_ratio of the inactive list at each pass. The default value is 60. ============================================================= vm_vfs_scan_ratio: ------------------ is what proportion of the VFS queues we will scan in one go. A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the unused-inode, dentry and dquot caches will be freed during a normal aging round. Big fileservers (NFS, SMB etc.) probably want to set this value to 3 or 2. The default value is 6. =============================================================