Documentation/sysctl/vm.txt

   1 Documentation for /proc/sys/vm/*        Kernel version 2.4.28
   2 =============================================================
   3
   4  (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
   5     - Initial version
   6
   7  (c) 2004, Marc-Christian Petersen <m.c.p@linux-systeme.com>
   8     - Removed non-existent knobs which were removed in early
   9       2.4 stages
  10     - Corrected values for bdflush
  11     - Documented missing tunables
  12     - Documented aa-vm tunables
  13
  14
  15
  16 For general info and legal blurb, please look in README.
  17 =============================================================
  18
  19 This file contains the documentation for the sysctl files in
  20 /proc/sys/vm and is valid for Linux kernel v2.4.28.
  21
  22 The files in this directory can be used to tune the operation
  23 of the virtual memory (VM) subsystem of the Linux kernel, and
  24 three of the files (bdflush, max-readahead, min-readahead)
  25 also have some influence on disk usage.
  26
  27 Default values and initialization routines for most of these
  28 files can be found in mm/vmscan.c, mm/page_alloc.c and
  29 mm/filemap.c.
  30
  31 Currently, these files are in /proc/sys/vm:
  32 - bdflush
  33 - block_dump
  34 - kswapd
  35 - laptop_mode
  36 - max-readahead
  37 - min-readahead
  38 - max_map_count
  39 - overcommit_memory
  40 - page-cluster
  41 - pagetable_cache
  42 - vm_anon_lru
  43 - vm_cache_scan_ratio
  44 - vm_gfp_debug
  45 - vm_lru_balance_ratio
  46 - vm_mapped_ratio
  47 - vm_passes
  48 - vm_vfs_scan_ratio
  49 =============================================================
  50
  51
  52
  53 bdflush:
  54 --------
  55 This file controls the operation of the bdflush kernel
  56 daemon. The source code to this struct can be found in
  57 fs/buffer.c. It currently contains 9 integer values,
  58 of which 6 are actually used by the kernel.
  59
  60 nfract:         The first parameter governs the maximum
  61                 number of dirty buffers in the buffer
  62                 cache. Dirty means that the contents of the
  63                 buffer still have to be written to disk (as
  64                 opposed to a clean buffer, which can just be
  65                 forgotten about). Setting this to a high
  66                 value means that Linux can delay disk writes
  67                 for a long time, but it also means that it
  68                 will have to do a lot of I/O at once when
  69                 memory becomes short. A low value will
  70                 spread out disk I/O more evenly, at the cost
  71                 of more frequent I/O operations. The default
  72                 value is 30%, the minimum is 0%, and the
  73                 maximum is 100%.
  74
  75 ndirty:         The second parameter (ndirty) gives the
  76                 maximum number of dirty buffers that bdflush
  77                 can write to the disk in one time. A high
  78                 value will mean delayed, bursty I/O, while a
  79                 small value can lead to memory shortage when
  80                 bdflush isn't woken up often enough. The
  81                 default value is 500 dirty buffers, the
  82                 minimum is 1, and the maximum is 50000.
  83
  84 dummy2:         The third parameter is not used.
  85
  86 dummy3:         The fourth parameter is not used.
  87
  88 interval:       The fifth parameter, interval, is the minimum
  89                 rate at which kupdate will wake and flush.
  90                 The value is in jiffies (clockticks), the
  91                 number of jiffies per second is normally 100
  92                 (Alpha is 1024). Thus, x*HZ is x seconds. The
  93                 default value is 5 seconds, the minimum is 0
  94                 seconds, and the maximum is 10,000 seconds.
  95
  96 age_buffer:     The sixth parameter, age_buffer, governs the
  97                 maximum time Linux waits before writing out a
  98                 dirty buffer to disk. The value is in jiffies.
  99                 The default value is 30 seconds, the minimum
 100                 is 1 second, and the maximum 10,000 seconds.
 101
 102 sync:           The seventh parameter, nfract_sync, governs
 103                 the percentage of buffer cache that is dirty
 104                 before bdflush activates synchronously. This
 105                 can be viewed as the hard limit before
 106                 bdflush forces buffers to disk. The default
 107                 is 60%, the minimum is 0%, and the maximum
 108                 is 100%.
 109
 110 stop_bdflush:   The eighth parameter, nfract_stop_bdflush,
 111                 governs the percentage of buffer cache that
 112                 is dirty which will stop bdflush. The default
 113                 is 20%, the miniumum is 0%, and the maxiumum
 114                 is 100%.
 115
 116 dummy5:         The ninth parameter is not used.
 117
 118 So the default is: 30 500 0 0 500 3000 60 20 0   for 100 HZ.
 119 =============================================================
 120
 121
 122
 123 block_dump:
 124 -----------
 125 It can happen that the disk still keeps spinning up and you
 126 don't quite know why or what causes it. The laptop mode patch
 127 has a little helper for that as well. When set to 1, it will
 128 dump info to the kernel message buffer about what process
 129 caused the io. Be careful when playing with this setting.
 130 It is advisable to shut down syslog first! The default is 0.
 131 =============================================================
 132
 133
 134
 135 kswapd:
 136 -------
 137 Kswapd is the kernel swapout daemon. That is, kswapd is that
 138 piece of the kernel that frees memory when it gets fragmented
 139 or full. Since every system is different, you'll probably
 140 want some control over this piece of the system.
 141
 142 The numbers in this page correspond to the numbers in the
 143 struct pager_daemon {tries_base, tries_min, swap_cluster
 144 }; The tries_base and swap_cluster probably have the
 145 largest influence on system performance.
 146
 147 tries_base      The maximum number of pages kswapd tries to
 148                 free in one round is calculated from this
 149                 number. Usually this number will be divided
 150                 by 4 or 8 (see mm/vmscan.c), so it isn't as
 151                 big as it looks.
 152                 When you need to increase the bandwidth to/
 153                 from swap, you'll want to increase this
 154                 number.
 155
 156 tries_min       This is the minimum number of times kswapd
 157                 tries to free a page each time it is called.
 158                 Basically it's just there to make sure that
 159                 kswapd frees some pages even when it's being
 160                 called with minimum priority.
 161
 162 swap_cluster    This is the number of pages kswapd writes in
 163                 one turn. You want this large so that kswapd
 164                 does it's I/O in large chunks and the disk
 165                 doesn't have to seek often, but you don't
 166                 want it to be too large since that would
 167                 flood the request queue.
 168
 169 The default value is: 512 32 8.
 170 =============================================================
 171
 172
 173
 174 laptop_mode:
 175 ------------
 176 Setting this to 1 switches the vm (and block layer) to laptop
 177 mode. Leaving it to 0 makes the kernel work like before. When
 178 in laptop mode, you also want to extend the intervals
 179 desribed in Documentation/laptop-mode.txt.
 180 See the laptop-mode.sh script for how to do that.
 181
 182 The default value is 0.
 183 =============================================================
 184
 185
 186
 187 max-readahead:
 188 --------------
 189 This tunable affects how early the Linux VFS will fetch the
 190 next block of a file from memory. File readahead values are
 191 determined on a per file basis in the VFS and are adjusted
 192 based on the behavior of the application accessing the file.
 193 Anytime the current position being read in a file plus the
 194 current read ahead value results in the file pointer pointing
 195 to the next block in the file, that block will be fetched
 196 from disk. By raising this value, the Linux kernel will allow
 197 the readahead value to grow larger, resulting in more blocks
 198 being prefetched from disks which predictably access files in
 199 uniform linear fashion. This can result in performance
 200 improvements, but can also result in excess (and often
 201 unnecessary) memory usage. Lowering this value has the
 202 opposite affect. By forcing readaheads to be less aggressive,
 203 memory may be conserved at a potential performance impact.
 204
 205 The default value is 31.
 206 =============================================================
 207
 208
 209
 210 min-readahead:
 211 --------------
 212 Like max-readahead, min-readahead places a floor on the
 213 readahead value. Raising this number forces a files readahead
 214 value to be unconditionally higher, which can bring about
 215 performance improvements, provided that all file access in
 216 the system is predictably linear from the start to the end of
 217 a file. This of course results in higher memory usage from
 218 the pagecache. Conversely, lowering this value, allows the
 219 kernel to conserve pagecache memory, at a potential
 220 performance cost.
 221
 222 The default value is 3.
 223 =============================================================
 224
 225
 226
 227 max_map_count:
 228 --------------
 229 This file contains the maximum number of memory map areas a
 230 process may have. Memory map areas are used as a side-effect
 231 of calling malloc, directly by mmap and mprotect, and also
 232 when loading shared libraries.
 233
 234 While most applications need less than a thousand maps,
 235 certain programs, particularly malloc debuggers, may consume
 236 lots of them, e.g. up to one or two maps per allocation.
 237
 238 The default value is 65536.
 239 =============================================================
 240
 241
 242
 243 overcommit_memory:
 244 ------------------
 245 This value contains a flag to enable memory overcommitment.
 246 When this flag is 0, the kernel checks before each malloc()
 247 to see if there's enough memory left. If the flag is nonzero,
 248 the system pretends there's always enough memory.
 249
 250 This feature can be very useful because there are a lot of
 251 programs that malloc() huge amounts of memory "just-in-case"
 252 and don't use much of it. The default value is 0.
 253
 254 Look at: mm/mmap.c::vm_enough_memory() for more information.
 255 =============================================================
 256
 257
 258
 259 page-cluster:
 260 -------------
 261 The Linux VM subsystem avoids excessive disk seeks by reading
 262 multiple pages on a page fault. The number of pages it reads
 263 is dependent on the amount of memory in your machine.
 264
 265 The number of pages the kernel reads in at once is equal to
 266 2 ^ page-cluster. Values above 2 ^ 5 don't make much sense
 267 for swap because we only cluster swap data in 32-page groups.
 268 =============================================================
 269
 270
 271
 272 pagetable_cache:
 273 ----------------
 274 The kernel keeps a number of page tables in a per-processor
 275 cache (this helps a lot on SMP systems). The cache size for
 276 each processor will be between the low and the high value.
 277
 278 On a low-memory, single CPU system you can safely set these
 279 values to 0 so you don't waste the memory. On SMP systems it
 280 is used so that the system can do fast pagetable allocations
 281 without having to acquire the kernel memory lock.
 282
 283 For large systems, the settings are probably OK. For normal
 284 systems they won't hurt a bit. For small systems (<16MB ram)
 285 it might be advantageous to set both values to 0.
 286
 287 The default value is: 25 50.
 288 =============================================================
 289
 290
 291
 292 vm_anon_lru:
 293 ------------
 294 select if to immdiatly insert anon pages in the lru.
 295 Immediatly means as soon as they're allocated during the page
 296 faults. If this is set to 0, they're inserted only after the
 297 first swapout.
 298
 299 Having anon pages immediatly inserted in the lru allows the
 300 VM to know better when it's worthwhile to start swapping
 301 anonymous ram, it will start to swap earlier and it should
 302 swap smoother and faster, but it will decrease scalability
 303 on the >16-ways of an order of magnitude. Big SMP/NUMA
 304 definitely can't take an hit on a global spinlock at
 305 every anon page allocation.
 306
 307 Low ram machines that swaps all the time want to turn
 308 this on (i.e. set to 1).
 309
 310 The default value is 0.
 311 =============================================================
 312
 313
 314
 315 vm_cache_scan_ratio:
 316 --------------------
 317 is how much of the inactive LRU queue we will scan in one go.
 318 A value of 6 for vm_cache_scan_ratio implies that we'll scan
 319 1/6 of the inactive lists during a normal aging round.
 320
 321 The default value is 6.
 322 =============================================================
 323
 324
 325
 326 vm_gfp_debug:
 327 ------------
 328 is when __alloc_pages fails, dump us a stack. This will
 329 mostly happen during OOM conditions (hopefully ;)
 330
 331 The default value is 0.
 332 =============================================================
 333
 334
 335
 336 vm_lru_balance_ratio:
 337 ---------------------
 338 controls the balance between active and inactive cache. The
 339 bigger vm_balance is, the easier the active cache will grow,
 340 because we'll rotate the active list slowly. A value of 2
 341 means we'll go towards a balance of 1/3 of the cache being
 342 inactive.
 343
 344 The default value is 2.
 345 =============================================================
 346
 347
 348
 349 vm_mapped_ratio:
 350 ----------------
 351 controls the pageout rate, the smaller, the earlier we'll
 352 start to pageout.
 353
 354 The default value is 100.
 355 =============================================================
 356
 357
 358
 359 vm_passes:
 360 ----------
 361 is the number of vm passes before failing the memory
 362 balancing. Take into account 3 passes are needed for a
 363 flush/wait/free cycle and that we only scan
 364 1/vm_cache_scan_ratio of the inactive list at each pass.
 365
 366 The default value is 60.
 367 =============================================================
 368
 369
 370
 371 vm_vfs_scan_ratio:
 372 ------------------
 373 is what proportion of the VFS queues we will scan in one go.
 374 A value of 6 for vm_vfs_scan_ratio implies that 1/6th of the
 375 unused-inode, dentry and dquot caches will be freed during a
 376 normal aging round.
 377 Big fileservers (NFS, SMB etc.) probably want to set this
 378 value to 3 or 2.
 379
 380 The default value is 6.
 381 =============================================================