Documentation for /proc/sys/vm/*	Kernel version 2.4.28
=============================================================

 (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
    - Initial version

 (c) 2004, Marc-Christian Petersen <m.c.p@linux-systeme.com>
    - Removed non-existent knobs which were removed in early
      2.4 stages
    - Corrected values for bdflush
    - Documented missing tunables
    - Documented aa-vm tunables


For general info and legal blurb, please look in README.
=============================================================

This file contains the documentation for the sysctl files in
/proc/sys/vm and is valid for Linux kernel v2.4.28.

The files in this directory can be used to tune the operation
of the virtual memory (VM) subsystem of the Linux kernel, and
three of the files (bdflush, max-readahead, min-readahead)
also have some influence on disk usage.

Default values and initialization routines for most of these
files can be found in mm/vmscan.c, mm/page_alloc.c and
mm/filemap.c.

Currently, these files are in /proc/sys/vm:
- bdflush
- block_dump
- kswapd
- laptop_mode
- max-readahead
- min-readahead
- max_map_count
- overcommit_memory
- page-cluster
- pagetable_cache
- vm_anon_lru
- vm_cache_scan_ratio
- vm_gfp_debug
- vm_lru_balance_ratio
- vm_mapped_ratio
- vm_passes
- vm_vfs_scan_ratio
=============================================================


bdflush:
--------
This file controls the operation of the bdflush kernel
daemon. The source code to this struct can be found in
fs/buffer.c. It currently contains 9 integer values,
of which 6 are actually used by the kernel.

nfract:		The first parameter governs the maximum
		number of dirty buffers in the buffer
		cache. Dirty means that the contents of the
		buffer still have to be written to disk (as
		opposed to a clean buffer, which can just be
		forgotten about). Setting this to a high
		value means that Linux can delay disk writes
		for a long time, but it also means that it
		will have to do a lot of I/O at once when
		memory becomes short. A low value will
		spread out disk I/O more evenly, at the cost
		of more frequent I/O operations. The default
		value is 30%, the minimum is 0%, and the
		maximum is 100%.

ndirty:		The second parameter (ndirty) gives the
		maximum number of dirty buffers that bdflush
		can write to the disk in one time. A high
		value will mean delayed, bursty I/O, while a
		small value can lead to memory shortage when
		bdflush isn't woken up often enough. The
		default value is 500 dirty buffers, the
		minimum is 1, and the maximum is 50000.

dummy2:		The third parameter is not used.

dummy3:		The fourth parameter is not used.

interval:	The fifth parameter, interval, is the minimum
		rate at which kupdate will wake and flush.
		The value is in jiffies (clockticks), the
		number of jiffies per second is normally 100
		(Alpha is 1024). Thus, x*HZ is x seconds. The
		default value is 5 seconds, the minimum	is 0
		seconds, and the maximum is 10,000 seconds.

age_buffer:	The sixth parameter, age_buffer, governs the
		maximum time Linux waits before writing out a
		dirty buffer to disk. The value is in jiffies.
		The default value is 30 seconds, the minimum
		is 1 second, and the maximum 10,000 seconds.

sync:		The seventh parameter, nfract_sync, governs
		the percentage of buffer cache that is dirty
		before bdflush activates synchronously. This
		can be viewed as the hard limit before
		bdflush forces buffers to disk. The default
		is 60%,	the minimum is 0%, and the maximum
		is 100%.

stop_bdflush:	The eighth parameter, nfract_stop_bdflush,
		governs the percentage of buffer cache that
		is dirty which will stop bdflush. The default
		is 20%, the miniumum is 0%, and the maxiumum
		is 100%.

dummy5:		The ninth parameter is not used.

So the default is: 30 500 0 0 500 3000 60 20 0   for 100 HZ.
=============================================================


block_dump:
-----------
It can happen that the disk still keeps spinning up and you
don't quite know why or what causes it. The laptop mode patch
has a little helper for that as well. When set to 1, it will
dump info to the kernel message buffer about what process
caused the io. Be careful when playing with this setting.
It is advisable to shut down syslog first! The default is 0.
=============================================================


kswapd:
-------
Kswapd is the kernel swapout daemon. That is, kswapd is that
piece of the kernel that frees memory when it gets fragmented
or full. Since every system is different, you'll probably
want some control over this piece of the system.

The numbers in this page correspond to the numbers in the
struct pager_daemon {tries_base, tries_min, swap_cluster
}; The tries_base and swap_cluster probably have the
largest influence on system performance.

tries_base	The maximum number of pages kswapd tries to
		free in one round is calculated from this
		number. Usually this number will be divided
		by 4 or 8 (see mm/vmscan.c), so it isn't as
		big as it looks.
		When you need to increase the bandwidth to/
		from swap, you'll want to increase this
		number.

tries_min	This is the minimum number of times kswapd
		tries to free a page each time it is called.
		Basically it's just there to make sure that
		kswapd frees some pages even when it's being
		called with minimum priority.

swap_cluster	This is the number of pages kswapd writes in
		one turn. You want this large so that kswapd
		does it's I/O in large chunks and the disk
		doesn't have to seek often, but you don't
		want it to be too large since that would
		flood the request queue.

The default value is: 512 32 8.
=============================================================


laptop_mode:
------------
Setting this to 1 switches the vm (and block layer) to laptop
mode. Leaving it to 0 makes the kernel work like before. When
in laptop mode, you also want to extend the intervals
desribed in Documentation/laptop-mode.txt.
See the laptop-mode.sh script for how to do that.

The default value is 0.
=============================================================


max-readahead:
--------------
This tunable affects how early the Linux VFS will fetch the
next block of a file from memory. File readahead values are
determined on a per file basis in the VFS and are adjusted
based on the behavior of the application accessing the file.
Anytime the current position being read in a file plus the
current read ahead value results in the file pointer pointing
to the next block in the file, that block will be fetched
from disk. By raising this value, the Linux kernel will allow
the readahead value to grow larger, resulting in more blocks
being prefetched from disks which predictably access files in
uniform linear fashion. This can result in performance
improvements, but can also result in excess (and often
unnecessary) memory usage. Lowering this value has the
opposite affect. By forcing readaheads to be less aggressive,
memory may be conserved at a potential performance impact.

The default value is 31.
=============================================================


min-readahead:
--------------
Like max-readahead, min-readahead places a floor on the
readahead value. Raising this number forces a files readahead
value to be unconditionally higher, which can bring about
performance improvements, provided that all file access in
the system is predictably linear from the start to the end of
a file. This of course results in higher memory usage from
the pagecache. Conversely, lowering this value, allows the
kernel to conserve pagecache memory, at a potential
performance cost.

The default value is 3.
=============================================================


max_map_count:
--------------
This file contains the maximum number of memory map areas a
process may have. Memory map areas are used as a side-effect
of calling malloc, directly by mmap and mprotect, and also
when loading shared libraries.

While most applications need less than a thousand maps,
certain programs, particularly malloc debuggers, may consume 
lots of them, e.g. up to one or two maps per allocation.

The default value is 65536.
=============================================================


overcommit_memory:
------------------
This value contains a flag to enable memory overcommitment.
When this flag is 0, the kernel checks before each malloc()
to see if there's enough memory left. If the flag is nonzero,
the system pretends there's always enough memory.

This feature can be very useful because there are a lot of
programs that malloc() huge amounts of memory "just-in-case"
and don't use much of it. The default value is 0.

Look at: mm/mmap.c::vm_enough_memory() for more information.
=============================================================


page-cluster:
-------------
The Linux VM subsystem avoids excessive disk seeks by reading
multiple pages on a page fault. The number of pages it reads
is dependent on the amount of memory in your machine.

The number of pages the kernel reads in at once is equal to
2 ^ page-cluster. Values above 2 ^ 5 don't make much sense
for swap because we only cluster swap data in 32-page groups.
=============================================================


pagetable_cache:
----------------
The kernel keeps a number of page tables in a per-processor
cache (this helps a lot on SMP systems). The cache size for
each processor will be between the low and the high value.

On a low-memory, single CPU system you can safely set these
values to 0 so you don't waste the memory. On SMP systems it
is used so that the system can do fast pagetable allocations
without having to acquire the kernel memory lock.

For large systems, the settings are probably OK. For normal
systems they won't hurt a bit. For small systems (<16MB ram)
it might be advantageous to set both values to 0.

The default value is: 25 50.
=============================================================


vm_anon_lru:
------------
select if to immdiatly insert anon pages in the lru.
Immediatly means as soon as they're allocated during the page
faults. If this is set to 0, they're inserted only after the
first swapout.
  
Having anon pages immediatly inserted in the lru allows the
VM to know better when it's worthwhile to start swapping
anonymous ram, it will start to swap earlier and it should
swap smoother and faster, but it will decrease scalability
on the >16-ways of an order of magnitude. Big SMP/NUMA
definitely can't take an hit on a global spinlock at
every anon page allocation.

Low ram machines that swaps all the time want to turn
this on (i.e. set to 1).

The default value is 0.
=============================================================


vm_cache_scan_ratio:
--------------------
is how much of the inactive LRU queue we will scan in one go.
A value of 6 for vm_cache_scan_ratio implies that we'll scan
1/6 of the inactive lists during a normal aging round.

The default value is 6.
=============================================================


vm_gfp_debug:
------------
is when __alloc_pages fails, dump us a stack. This will
mostly happen during OOM conditions (hopefully ;)

The default value is 0.
=============================================================


vm_lru_balance_ratio:
---------------------
controls the balance between active and inactive cache. The
bigger vm_balance is, the easier the active cache will grow,
because we'll rotate the active list slowly. A value of 2
means we'll go towards a balance of 1/3 of the cache being
inactive.

The default value is 2.
=============================================================


vm_mapped_ratio:
----------------
controls the pageout rate, the smaller, the earlier we'll
start to pageout.

The default value is 100.
=============================================================


vm_passes:
----------
is the number of vm passes before failing the memory
balancing. Take into account 3 passes are needed for a
flush/wait/free cycle and that we only scan
1/vm_cache_scan_ratio of the inactive list at each pass.

The default value is 60.
=============================================================


vm_vfs_scan_ratio:
------------------
is what proportion of the VFS queues we will scan in one go.
A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the
unused-inode, dentry and dquot caches will be freed during a
normal aging round.
Big fileservers (NFS, SMB etc.) probably want to set this
value to 3 or 2.

The default value is 6.
=============================================================