Documentation/DocBook/kernel-locking.tmpl

   1 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V3.1//EN"[]>
   2
   3 <book id="LKLockingGuide">
   4  <bookinfo>
   5   <title>Unreliable Guide To Locking</title>
   6
   7   <authorgroup>
   8    <author>
   9     <firstname>Paul</firstname>
  10     <othername>Rusty</othername>
  11     <surname>Russell</surname>
  12     <affiliation>
  13      <address>
  14       <email>rusty@rustcorp.com.au</email>
  15      </address>
  16     </affiliation>
  17    </author>
  18   </authorgroup>
  19
  20   <copyright>
  21    <year>2000</year>
  22    <holder>Paul Russell</holder>
  23   </copyright>
  24
  25   <legalnotice>
  26    <para>
  27      This documentation is free software; you can redistribute
  28      it and/or modify it under the terms of the GNU General Public
  29      License as published by the Free Software Foundation; either
  30      version 2 of the License, or (at your option) any later
  31      version.
  32    </para>
  33
  34    <para>
  35      This program is distributed in the hope that it will be
  36      useful, but WITHOUT ANY WARRANTY; without even the implied
  37      warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
  38      See the GNU General Public License for more details.
  39    </para>
  40
  41    <para>
  42      You should have received a copy of the GNU General Public
  43      License along with this program; if not, write to the Free
  44      Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
  45      MA 02111-1307 USA
  46    </para>
  47
  48    <para>
  49      For more details see the file COPYING in the source
  50      distribution of Linux.
  51    </para>
  52   </legalnotice>
  53  </bookinfo>
  54
  55  <toc></toc>
  56   <chapter id="intro">
  57    <title>Introduction</title>
  58    <para>
  59      Welcome, to Rusty's Remarkably Unreliable Guide to Kernel
  60      Locking issues.  This document describes the locking systems in
  61      the Linux Kernel as we approach 2.4.
  62    </para>
  63    <para>
  64      It looks like <firstterm linkend="gloss-smp"><acronym>SMP</acronym>
  65      </firstterm> is here to stay; so everyone hacking on the kernel
  66      these days needs to know the fundamentals of concurrency and locking
  67      for SMP.
  68    </para>
  69
  70    <sect1 id="races">
  71     <title>The Problem With Concurrency</title>
  72     <para>
  73       (Skip this if you know what a Race Condition is).
  74     </para>
  75     <para>
  76       In a normal program, you can increment a counter like so:
  77     </para>
  78     <programlisting>
  79       very_important_count++;
  80     </programlisting>
  81
  82     <para>
  83       This is what they would expect to happen:
  84     </para>
  85
  86     <table>
  87      <title>Expected Results</title>
  88
  89      <tgroup cols=2 align=left>
  90
  91       <thead>
  92        <row>
  93         <entry>Instance 1</entry>
  94         <entry>Instance 2</entry>
  95        </row>
  96       </thead>
  97
  98       <tbody>
  99        <row>
 100         <entry>read very_important_count (5)</entry>
 101         <entry></entry>
 102        </row>
 103        <row>
 104         <entry>add 1 (6)</entry>
 105         <entry></entry>
 106        </row>
 107        <row>
 108         <entry>write very_important_count (6)</entry>
 109         <entry></entry>
 110        </row>
 111        <row>
 112         <entry></entry>
 113         <entry>read very_important_count (6)</entry>
 114        </row>
 115        <row>
 116         <entry></entry>
 117         <entry>add 1 (7)</entry>
 118        </row>
 119        <row>
 120         <entry></entry>
 121         <entry>write very_important_count (7)</entry>
 122        </row>
 123       </tbody>
 124
 125      </tgroup>
 126     </table>
 127
 128     <para>
 129      This is what might happen:
 130     </para>
 131
 132     <table>
 133      <title>Possible Results</title>
 134
 135      <tgroup cols=2 align=left>
 136       <thead>
 137        <row>
 138         <entry>Instance 1</entry>
 139         <entry>Instance 2</entry>
 140        </row>
 141       </thead>
 142
 143       <tbody>
 144        <row>
 145         <entry>read very_important_count (5)</entry>
 146         <entry></entry>
 147        </row>
 148        <row>
 149         <entry></entry>
 150         <entry>read very_important_count (5)</entry>
 151        </row>
 152        <row>
 153         <entry>add 1 (6)</entry>
 154         <entry></entry>
 155        </row>
 156        <row>
 157         <entry></entry>
 158         <entry>add 1 (6)</entry>
 159        </row>
 160        <row>
 161         <entry>write very_important_count (6)</entry>
 162         <entry></entry>
 163        </row>
 164        <row>
 165         <entry></entry>
 166         <entry>write very_important_count (6)</entry>
 167        </row>
 168       </tbody>
 169      </tgroup>
 170     </table>
 171
 172     <para>
 173       This overlap, where what actually happens depends on the
 174       relative timing of multiple tasks, is called a race condition.
 175       The piece of code containing the concurrency issue is called a
 176       critical region.  And especially since Linux starting running
 177       on SMP machines, they became one of the major issues in kernel
 178       design and implementation.
 179     </para>
 180     <para>
 181       The solution is to recognize when these simultaneous accesses
 182       occur, and use locks to make sure that only one instance can
 183       enter the critical region at any time.  There are many
 184       friendly primitives in the Linux kernel to help you do this.
 185       And then there are the unfriendly primitives, but I'll pretend
 186       they don't exist.
 187     </para>
 188    </sect1>
 189   </chapter>
 190
 191   <chapter id="locks">
 192    <title>Two Main Types of Kernel Locks: Spinlocks and Semaphores</title>
 193
 194    <para>
 195      There are two main types of kernel locks.  The fundamental type
 196      is the spinlock
 197      (<filename class=headerfile>include/asm/spinlock.h</filename>),
 198      which is a very simple single-holder lock: if you can't get the
 199      spinlock, you keep trying (spinning) until you can.  Spinlocks are
 200      very small and fast, and can be used anywhere.
 201    </para>
 202    <para>
 203      The second type is a semaphore
 204      (<filename class=headerfile>include/asm/semaphore.h</filename>): it
 205      can have more than one holder at any time (the number decided at
 206      initialization time), although it is most commonly used as a
 207      single-holder lock (a mutex).  If you can't get a semaphore,
 208      your task will put itself on the queue, and be woken up when the
 209      semaphore is released.  This means the CPU will do something
 210      else while you are waiting, but there are many cases when you
 211      simply can't sleep (see <xref linkend="sleeping-things">), and so
 212      have to use a spinlock instead.
 213    </para>
 214    <para>
 215      Neither type of lock is recursive: see
 216      <xref linkend="techniques-deadlocks">.
 217    </para>
 218
 219    <sect1 id="uniprocessor">
 220     <title>Locks and Uniprocessor Kernels</title>
 221
 222     <para>
 223       For kernels compiled without <symbol>CONFIG_SMP</symbol>, spinlocks
 224       do not exist at all.  This is an excellent design decision: when
 225       no-one else can run at the same time, there is no reason to
 226       have a lock at all.
 227     </para>
 228
 229     <para>
 230       You should always test your locking code with <symbol>CONFIG_SMP</symbol>
 231       enabled, even if you don't have an SMP test box, because it
 232       will still catch some (simple) kinds of deadlock.
 233     </para>
 234
 235     <para>
 236       Semaphores still exist, because they are required for
 237       synchronization between <firstterm linkend="gloss-usercontext">user
 238       contexts</firstterm>, as we will see below.
 239     </para>
 240    </sect1>
 241
 242    <sect1 id="rwlocks">
 243     <title>Read/Write Lock Variants</title>
 244
 245     <para>
 246       Both spinlocks and semaphores have read/write variants:
 247       <type>rwlock_t</type> and <structname>struct rw_semaphore</structname>.
 248       These divide users into two classes: the readers and the writers.  If
 249       you are only reading the data, you can get a read lock, but to write to
 250       the data you need the write lock.  Many people can hold a read lock,
 251       but a writer must be sole holder.
 252     </para>
 253
 254     <para>
 255       This means much smoother locking if your code divides up
 256       neatly along reader/writer lines.  All the discussions below
 257       also apply to read/write variants.
 258     </para>
 259    </sect1>
 260
 261     <sect1 id="usercontextlocking">
 262      <title>Locking Only In User Context</title>
 263
 264      <para>
 265        If you have a data structure which is only ever accessed from
 266        user context, then you can use a simple semaphore
 267        (<filename>linux/asm/semaphore.h</filename>) to protect it.  This
 268        is the most trivial case: you initialize the semaphore to the number
 269        of resources available (usually 1), and call
 270        <function>down_interruptible()</function> to grab the semaphore, and
 271        <function>up()</function> to release it.  There is also a
 272        <function>down()</function>, which should be avoided, because it
 273        will not return if a signal is received.
 274      </para>
 275
 276      <para>
 277        Example: <filename>linux/net/core/netfilter.c</filename> allows
 278        registration of new <function>setsockopt()</function> and
 279        <function>getsockopt()</function> calls, with
 280        <function>nf_register_sockopt()</function>.  Registration and
 281        de-registration are only done on module load and unload (and boot
 282        time, where there is no concurrency), and the list of registrations
 283        is only consulted for an unknown <function>setsockopt()</function>
 284        or <function>getsockopt()</function> system call.  The
 285        <varname>nf_sockopt_mutex</varname> is perfect to protect this,
 286        especially since the setsockopt and getsockopt calls may well
 287        sleep.
 288      </para>
 289    </sect1>
 290
 291    <sect1 id="lock-user-bh">
 292     <title>Locking Between User Context and BHs</title>
 293
 294     <para>
 295       If a <firstterm linkend="gloss-bh">bottom half</firstterm> shares
 296       data with user context, you have two problems.  Firstly, the current
 297       user context can be interrupted by a bottom half, and secondly, the
 298       critical region could be entered from another CPU.  This is where
 299       <function>spin_lock_bh()</function>
 300       (<filename class=headerfile>include/linux/spinlock.h</filename>) is
 301       used.  It disables bottom halves on that CPU, then grabs the lock.
 302       <function>spin_unlock_bh()</function> does the reverse.
 303     </para>
 304
 305     <para>
 306       This works perfectly for <firstterm linkend="gloss-up"><acronym>UP
 307       </acronym></firstterm> as well: the spin lock vanishes, and this macro
 308       simply becomes <function>local_bh_disable()</function>
 309       (<filename class=headerfile>include/asm/softirq.h</filename>), which
 310       protects you from the bottom half being run.
 311     </para>
 312    </sect1>
 313
 314    <sect1 id="lock-user-tasklet">
 315     <title>Locking Between User Context and Tasklets/Soft IRQs</title>
 316
 317     <para>
 318       This is exactly the same as above, because
 319       <function>local_bh_disable()</function> actually disables all
 320       softirqs and <firstterm linkend="gloss-tasklet">tasklets</firstterm>
 321       on that CPU as well.  It should really be
 322       called `local_softirq_disable()', but the name has been preserved
 323       for historical reasons.  Similarly,
 324       <function>spin_lock_bh()</function> would now be called
 325       spin_lock_softirq() in a perfect world.
 326     </para>
 327    </sect1>
 328
 329    <sect1 id="lock-bh">
 330     <title>Locking Between Bottom Halves</title>
 331
 332     <para>
 333       Sometimes a bottom half might want to share data with
 334       another bottom half (especially remember that timers are run
 335       off a bottom half).
 336     </para>
 337
 338     <sect2 id="lock-bh-same">
 339      <title>The Same BH</title>
 340
 341      <para>
 342        Since a bottom half is never run on two CPUs at once, you
 343        don't need to worry about your bottom half being run twice
 344        at once, even on SMP.
 345      </para>
 346     </sect2>
 347
 348     <sect2 id="lock-bh-different">
 349      <title>Different BHs</title>
 350
 351      <para>
 352        Since only one bottom half ever runs at a time once, you
 353        don't need to worry about race conditions with other bottom
 354        halves.  Beware that things might change under you, however,
 355        if someone changes your bottom half to a tasklet.  If you
 356        want to make your code future-proof, pretend you're already
 357        running from a tasklet (see below), and doing the extra
 358        locking.  Of course, if it's five years before that happens,
 359        you're gonna look like a damn fool.
 360      </para>
 361     </sect2>
 362    </sect1>
 363
 364    <sect1 id="lock-tasklets">
 365     <title>Locking Between Tasklets</title>
 366
 367     <para>
 368       Sometimes a tasklet might want to share data with another
 369       tasklet, or a bottom half.
 370     </para>
 371
 372     <sect2 id="lock-tasklets-same">
 373      <title>The Same Tasklet</title>
 374      <para>
 375        Since a tasklet is never run on two CPUs at once, you don't
 376        need to worry about your tasklet being reentrant (running
 377        twice at once), even on SMP.
 378      </para>
 379     </sect2>
 380
 381     <sect2 id="lock-tasklets-different">
 382      <title>Different Tasklets</title>
 383      <para>
 384        If another tasklet (or bottom half, such as a timer) wants
 385        to share data with your tasklet, you will both need to use
 386        <function>spin_lock()</function> and
 387        <function>spin_unlock()</function> calls.
 388        <function>spin_lock_bh()</function> is
 389        unnecessary here, as you are already in a tasklet, and
 390        none will be run on the same CPU.
 391      </para>
 392     </sect2>
 393    </sect1>
 394
 395    <sect1 id="lock-softirqs">
 396     <title>Locking Between Softirqs</title>
 397
 398     <para>
 399       Often a <firstterm linkend="gloss-softirq">softirq</firstterm> might
 400       want to share data with itself, a tasklet, or a bottom half.
 401     </para>
 402
 403     <sect2 id="lock-softirqs-same">
 404      <title>The Same Softirq</title>
 405
 406      <para>
 407        The same softirq can run on the other CPUs: you can use a
 408        per-CPU array (see <xref linkend="per-cpu">) for better
 409        performance.  If you're going so far as to use a softirq,
 410        you probably care about scalable performance enough
 411        to justify the extra complexity.
 412      </para>
 413
 414      <para>
 415        You'll need to use <function>spin_lock()</function> and
 416        <function>spin_unlock()</function> for shared data.
 417      </para>
 418     </sect2>
 419
 420     <sect2 id="lock-softirqs-different">
 421      <title>Different Softirqs</title>
 422
 423      <para>
 424        You'll need to use <function>spin_lock()</function> and
 425        <function>spin_unlock()</function> for shared data, whether it
 426        be a timer (which can be running on a different CPU), bottom half,
 427        tasklet or the same or another softirq.
 428      </para>
 429     </sect2>
 430    </sect1>
 431   </chapter>
 432
 433   <chapter id="hardirq-context">
 434    <title>Hard IRQ Context</title>
 435
 436    <para>
 437      Hardware interrupts usually communicate with a bottom half,
 438      tasklet or softirq.  Frequently this involves putting work in a
 439      queue, which the BH/softirq will take out.
 440    </para>
 441
 442    <sect1 id="hardirq-softirq">
 443     <title>Locking Between Hard IRQ and Softirqs/Tasklets/BHs</title>
 444
 445     <para>
 446       If a hardware irq handler shares data with a softirq, you have
 447       two concerns.  Firstly, the softirq processing can be
 448       interrupted by a hardware interrupt, and secondly, the
 449       critical region could be entered by a hardware interrupt on
 450       another CPU.  This is where <function>spin_lock_irq()</function> is
 451       used.  It is defined to disable interrupts on that cpu, then grab
 452       the lock. <function>spin_unlock_irq()</function> does the reverse.
 453     </para>
 454
 455     <para>
 456       This works perfectly for UP as well: the spin lock vanishes,
 457       and this macro simply becomes <function>local_irq_disable()</function>
 458       (<filename class=headerfile>include/asm/smp.h</filename>), which
 459       protects you from the softirq/tasklet/BH being run.
 460     </para>
 461
 462     <para>
 463       <function>spin_lock_irqsave()</function>
 464       (<filename>include/linux/spinlock.h</filename>) is a variant
 465       which saves whether interrupts were on or off in a flags word,
 466       which is passed to <function>spin_lock_irqrestore()</function>.  This
 467       means that the same code can be used inside an hard irq handler (where
 468       interrupts are already off) and in softirqs (where the irq
 469       disabling is required).
 470     </para>
 471    </sect1>
 472   </chapter>
 473
 474   <chapter id="common-techniques">
 475    <title>Common Techniques</title>
 476
 477    <para>
 478      This section lists some common dilemmas and the standard
 479      solutions used in the Linux kernel code.  If you use these,
 480      people will find your code simpler to understand.
 481    </para>
 482
 483    <para>
 484      If I could give you one piece of advice: never sleep with anyone
 485      crazier than yourself.  But if I had to give you advice on
 486      locking: <emphasis>keep it simple</emphasis>.
 487    </para>
 488
 489    <para>
 490      Lock data, not code.
 491    </para>
 492
 493    <para>
 494      Be reluctant to introduce new locks.
 495    </para>
 496
 497    <para>
 498      Strangely enough, this is the exact reverse of my advice when
 499      you <emphasis>have</emphasis> slept with someone crazier than yourself.
 500    </para>
 501
 502    <sect1 id="techniques-no-writers">
 503     <title>No Writers in Interrupt Context</title>
 504
 505     <para>
 506       There is a fairly common case where an interrupt handler needs
 507       access to a critical region, but does not need write access.
 508       In this case, you do not need to use
 509       <function>read_lock_irq()</function>, but only
 510       <function>read_lock()</function> everywhere (since if an interrupt
 511       occurs, the irq handler will only try to grab a read lock, which
 512       won't deadlock).  You will still need to use
 513       <function>write_lock_irq()</function>.
 514     </para>
 515
 516     <para>
 517       Similar logic applies to locking between softirqs/tasklets/BHs
 518       which never need a write lock, and user context:
 519       <function>read_lock()</function> and
 520       <function>write_lock_bh()</function>.
 521     </para>
 522    </sect1>
 523
 524    <sect1 id="techniques-deadlocks">
 525     <title>Deadlock: Simple and Advanced</title>
 526
 527     <para>
 528       There is a coding bug where a piece of code tries to grab a
 529       spinlock twice: it will spin forever, waiting for the lock to
 530       be released (spinlocks, rwlocks and semaphores are not
 531       recursive in Linux).  This is trivial to diagnose: not a
 532       stay-up-five-nights-talk-to-fluffy-code-bunnies kind of
 533       problem.
 534     </para>
 535
 536     <para>
 537       For a slightly more complex case, imagine you have a region
 538       shared by a bottom half and user context.  If you use a
 539       <function>spin_lock()</function> call to protect it, it is
 540       possible that the user context will be interrupted by the bottom
 541       half while it holds the lock, and the bottom half will then spin
 542       forever trying to get the same lock.
 543     </para>
 544
 545     <para>
 546       Both of these are called deadlock, and as shown above, it can
 547       occur even with a single CPU (although not on UP compiles,
 548       since spinlocks vanish on kernel compiles with
 549       <symbol>CONFIG_SMP</symbol>=n. You'll still get data corruption
 550       in the second example).
 551     </para>
 552
 553     <para>
 554       This complete lockup is easy to diagnose: on SMP boxes the
 555       watchdog timer or compiling with <symbol>DEBUG_SPINLOCKS</symbol> set
 556       (<filename>include/linux/spinlock.h</filename>) will show this up
 557       immediately when it happens.
 558     </para>
 559
 560     <para>
 561       A more complex problem is the so-called `deadly embrace',
 562       involving two or more locks.  Say you have a hash table: each
 563       entry in the table is a spinlock, and a chain of hashed
 564       objects.  Inside a softirq handler, you sometimes want to
 565       alter an object from one place in the hash to another: you
 566       grab the spinlock of the old hash chain and the spinlock of
 567       the new hash chain, and delete the object from the old one,
 568       and insert it in the new one.
 569     </para>
 570
 571     <para>
 572       There are two problems here.  First, if your code ever
 573       tries to move the object to the same chain, it will deadlock
 574       with itself as it tries to lock it twice.  Secondly, if the
 575       same softirq on another CPU is trying to move another object
 576       in the reverse direction, the following could happen:
 577     </para>
 578
 579     <table>
 580      <title>Consequences</title>
 581
 582      <tgroup cols=2 align=left>
 583
 584       <thead>
 585        <row>
 586         <entry>CPU 1</entry>
 587         <entry>CPU 2</entry>
 588        </row>
 589       </thead>
 590
 591       <tbody>
 592        <row>
 593         <entry>Grab lock A -&gt; OK</entry>
 594         <entry>Grab lock B -&gt; OK</entry>
 595        </row>
 596        <row>
 597         <entry>Grab lock B -&gt; spin</entry>
 598         <entry>Grab lock A -&gt; spin</entry>
 599        </row>
 600       </tbody>
 601      </tgroup>
 602     </table>
 603
 604     <para>
 605       The two CPUs will spin forever, waiting for the other to give up
 606       their lock.  It will look, smell, and feel like a crash.
 607     </para>
 608
 609     <sect2 id="techs-deadlock-prevent">
 610      <title>Preventing Deadlock</title>
 611
 612      <para>
 613        Textbooks will tell you that if you always lock in the same
 614        order, you will never get this kind of deadlock.  Practice
 615        will tell you that this approach doesn't scale: when I
 616        create a new lock, I don't understand enough of the kernel
 617        to figure out where in the 5000 lock hierarchy it will fit.
 618      </para>
 619
 620      <para>
 621        The best locks are encapsulated: they never get exposed in
 622        headers, and are never held around calls to non-trivial
 623        functions outside the same file.  You can read through this
 624        code and see that it will never deadlock, because it never
 625        tries to grab another lock while it has that one.  People
 626        using your code don't even need to know you are using a
 627        lock.
 628      </para>
 629
 630      <para>
 631        A classic problem here is when you provide callbacks or
 632        hooks: if you call these with the lock held, you risk simple
 633        deadlock, or a deadly embrace (who knows what the callback
 634        will do?).  Remember, the other programmers are out to get
 635        you, so don't do this.
 636      </para>
 637     </sect2>
 638
 639     <sect2 id="techs-deadlock-overprevent">
 640      <title>Overzealous Prevention Of Deadlocks</title>
 641
 642      <para>
 643        Deadlocks are problematic, but not as bad as data
 644        corruption.  Code which grabs a read lock, searches a list,
 645        fails to find what it wants, drops the read lock, grabs a
 646        write lock and inserts the object has a race condition.
 647      </para>
 648
 649      <para>
 650        If you don't see why, please stay the fuck away from my code.
 651      </para>
 652     </sect2>
 653    </sect1>
 654
 655    <sect1 id="per-cpu">
 656     <title>Per-CPU Data</title>
 657
 658     <para>
 659       A great technique for avoiding locking which is used fairly
 660       widely is to duplicate information for each CPU.  For example,
 661       if you wanted to keep a count of a common condition, you could
 662       use a spin lock and a single counter.  Nice and simple.
 663     </para>
 664
 665     <para>
 666       If that was too slow [it's probably not], you could instead
 667       use a counter for each CPU [don't], then none of them need an
 668       exclusive lock [you're wasting your time here].  To make sure
 669       the CPUs don't have to synchronize caches all the time, align
 670       the counters to cache boundaries by appending
 671       `__cacheline_aligned' to the declaration
 672       (<filename class=headerfile>include/linux/cache.h</filename>).
 673       [Can't you think of anything better to do?]
 674     </para>
 675
 676     <para>
 677       They will need a read lock to access their own counters,
 678       however.  That way you can use a write lock to grant exclusive
 679       access to all of them at once, to tally them up.
 680     </para>
 681    </sect1>
 682
 683    <sect1 id="brlock">
 684     <title>Big Reader Locks</title>
 685
 686     <para>
 687       A classic example of per-CPU information is Ingo's `big
 688       reader' locks
 689       (<filename class=headerfile>linux/include/brlock.h</filename>).  These
 690       use the Per-CPU Data techniques described above to create a lock which
 691       is very fast to get a read lock, but agonizingly slow for a write
 692       lock.
 693     </para>
 694
 695     <para>
 696       Fortunately, there are a limited number of these locks
 697       available: you have to go through a strict interview process
 698       to get one.
 699     </para>
 700    </sect1>
 701
 702    <sect1 id="lock-avoidance-rw">
 703     <title>Avoiding Locks: Read And Write Ordering</title>
 704
 705     <para>
 706       Sometimes it is possible to avoid locking.  Consider the
 707       following case from the 2.2 firewall code, which inserted an
 708       element into a single linked list in user context:
 709     </para>
 710
 711     <programlisting>
 712         new-&gt;next = i-&gt;next;
 713         i-&gt;next = new;
 714     </programlisting>
 715
 716     <para>
 717       Here the author (Alan Cox, who knows what he's doing) assumes
 718       that the pointer assignments are atomic.  This is important,
 719       because networking packets would traverse this list on bottom
 720       halves without a lock.  Depending on their exact timing, they
 721       would either see the new element in the list with a valid
 722       <structfield>next</structfield> pointer, or it would not be in the
 723       list yet.  A lock is still required against other CPUs inserting
 724       or deleting from the list, of course.
 725     </para>
 726
 727     <para>
 728       Of course, the writes <emphasis>must</emphasis> be in this
 729       order, otherwise the new element appears in the list with an
 730       invalid <structfield>next</structfield> pointer, and any other
 731       CPU iterating at the wrong time will jump through it into garbage.
 732       Because modern CPUs reorder, Alan's code actually read as follows:
 733     </para>
 734
 735     <programlisting>
 736         new-&gt;next = i-&gt;next;
 737         wmb();
 738         i-&gt;next = new;
 739     </programlisting>
 740
 741     <para>
 742       The <function>wmb()</function> is a write memory barrier
 743       (<filename class=headerfile>include/asm/system.h</filename>): neither
 744       the compiler nor the CPU will allow any writes to memory after the
 745       <function>wmb()</function> to be visible to other hardware
 746       before any of the writes before the <function>wmb()</function>.
 747     </para>
 748
 749     <para>
 750       As i386 does not do write reordering, this bug would never
 751       show up on that platform.  On other SMP platforms, however, it
 752       will.
 753     </para>
 754
 755     <para>
 756       There is also <function>rmb()</function> for read ordering: to ensure
 757       any previous variable reads occur before following reads.  The simple
 758       <function>mb()</function> macro combines both
 759       <function>rmb()</function> and <function>wmb()</function>.
 760     </para>
 761
 762     <para>
 763       Some atomic operations are defined to act as a memory barrier
 764       (ie. as per the <function>mb()</function> macro, but if in
 765       doubt, be explicit.
 766       <!-- Rusty Russell 2 May 2001, 2.4.4 -->
 767       Also,
 768       spinlock operations act as partial barriers: operations after
 769       gaining a spinlock will never be moved to precede the
 770       <function>spin_lock()</function> call, and operations before
 771       releasing a spinlock will never be moved after the
 772       <function>spin_unlock()</function> call.
 773       <!-- Manfred Spraul <manfreds@colorfullife.com>
 774            24 May 2000 2.3.99-pre9 -->
 775     </para>
 776    </sect1>
 777
 778    <sect1 id="lock-avoidance-atomic-ops">
 779     <title>Avoiding Locks: Atomic Operations</title>
 780
 781     <para>
 782       There are a number of atomic operations defined in
 783       <filename class=headerfile>include/asm/atomic.h</filename>: these
 784       are guaranteed to be seen atomically from all CPUs in the system, thus
 785       avoiding races. If your shared data consists of a single counter, say,
 786       these operations might be simpler than using spinlocks (although for
 787       anything non-trivial using spinlocks is clearer).
 788     </para>
 789
 790     <para>
 791       Note that the atomic operations do in general not act as memory
 792       barriers. Instead you can insert a memory barrier before or
 793       after <function>atomic_inc()</function> or
 794       <function>atomic_dec()</function> by inserting
 795       <function>smp_mb__before_atomic_inc()</function>,
 796       <function>smp_mb__after_atomic_inc()</function>,
 797       <function>smp_mb__before_atomic_dec()</function> or
 798       <function>smp_mb__after_atomic_dec()</function>
 799       respectively. The advantage of using those macros instead of
 800       <function>smp_mb()</function> is, that they are cheaper on some
 801       platforms.
 802       <!-- Sebastian Wilhelmi <seppi@seppi.de> 2002-03-04 -->
 803     </para>
 804    </sect1>
 805
 806    <sect1 id="ref-counts">
 807     <title>Protecting A Collection of Objects: Reference Counts</title>
 808
 809     <para>
 810       Locking a collection of objects is fairly easy: you get a
 811       single spinlock, and you make sure you grab it before
 812       searching, adding or deleting an object.
 813     </para>
 814
 815     <para>
 816       The purpose of this lock is not to protect the individual
 817       objects: you might have a separate lock inside each one for
 818       that.  It is to protect the <emphasis>data structure
 819       containing the objects</emphasis> from race conditions.  Often
 820       the same lock is used to protect the contents of all the
 821       objects as well, for simplicity, but they are inherently
 822       orthogonal (and many other big words designed to confuse).
 823     </para>
 824
 825     <para>
 826       Changing this to a read-write lock will often help markedly if
 827       reads are far more common that writes.  If not, there is
 828       another approach you can use to reduce the time the lock is
 829       held: reference counts.
 830     </para>
 831
 832     <para>
 833       In this approach, an object has an owner, who sets the
 834       reference count to one.  Whenever you get a pointer to the
 835       object, you increment the reference count (a `get' operation).
 836       Whenever you relinquish a pointer, you decrement the reference
 837       count (a `put' operation).  When the owner wants to destroy
 838       it, they mark it dead, and do a put.
 839     </para>
 840
 841     <para>
 842       Whoever drops the reference count to zero (usually implemented
 843       with <function>atomic_dec_and_test()</function>) actually cleans
 844       up and frees the object.
 845     </para>
 846
 847     <para>
 848       This means that you are guaranteed that the object won't
 849       vanish underneath you, even though you no longer have a lock
 850       for the collection.
 851     </para>
 852
 853     <para>
 854       Here's some skeleton code:
 855     </para>
 856
 857     <programlisting>
 858         void create_foo(struct foo *x)
 859         {
 860                 atomic_set(&amp;x-&gt;use, 1);
 861                 spin_lock_bh(&amp;list_lock);
 862                 ... insert in list ...
 863                 spin_unlock_bh(&amp;list_lock);
 864         }
 865
 866         struct foo *get_foo(int desc)
 867         {
 868                 struct foo *ret;
 869
 870                 spin_lock_bh(&amp;list_lock);
 871                 ... find in list ...
 872                 if (ret) atomic_inc(&amp;ret-&gt;use);
 873                 spin_unlock_bh(&amp;list_lock);
 874
 875                 return ret;
 876         }
 877
 878         void put_foo(struct foo *x)
 879         {
 880                 if (atomic_dec_and_test(&amp;x-&gt;use))
 881                         kfree(foo);
 882         }
 883
 884         void destroy_foo(struct foo *x)
 885         {
 886                 spin_lock_bh(&amp;list_lock);
 887                 ... remove from list ...
 888                 spin_unlock_bh(&amp;list_lock);
 889
 890                 put_foo(x);
 891         }
 892     </programlisting>
 893
 894     <sect2 id="helpful-macros">
 895      <title>Macros To Help You</title>
 896      <para>
 897        There are a set of debugging macros tucked inside
 898        <filename class=headerfile>include/linux/netfilter_ipv4/lockhelp.h</filename>
 899        and <filename class=headerfile>listhelp.h</filename>: these are very
 900        useful for ensuring that locks are held in the right places to protect
 901        infrastructure.
 902      </para>
 903     </sect2>
 904    </sect1>
 905
 906    <sect1 id="sleeping-things">
 907     <title>Things Which Sleep</title>
 908
 909     <para>
 910       You can never call the following routines while holding a
 911       spinlock, as they may sleep.  This also means you need to be in
 912       user context.
 913     </para>
 914
 915     <itemizedlist>
 916      <listitem>
 917       <para>
 918         Accesses to
 919         <firstterm linkend="gloss-userspace">userspace</firstterm>:
 920       </para>
 921       <itemizedlist>
 922        <listitem>
 923         <para>
 924           <function>copy_from_user()</function>
 925         </para>
 926        </listitem>
 927        <listitem>
 928         <para>
 929           <function>copy_to_user()</function>
 930         </para>
 931        </listitem>
 932        <listitem>
 933         <para>
 934           <function>get_user()</function>
 935         </para>
 936        </listitem>
 937        <listitem>
 938         <para>
 939           <function> put_user()</function>
 940         </para>
 941        </listitem>
 942       </itemizedlist>
 943      </listitem>
 944
 945      <listitem>
 946       <para>
 947         <function>kmalloc(GFP_KERNEL)</function>
 948       </para>
 949      </listitem>
 950
 951      <listitem>
 952       <para>
 953       <function>down_interruptible()</function> and
 954       <function>down()</function>
 955       </para>
 956       <para>
 957        There is a <function>down_trylock()</function> which can be
 958        used inside interrupt context, as it will not sleep.
 959        <function>up()</function> will also never sleep.
 960       </para>
 961      </listitem>
 962     </itemizedlist>
 963
 964     <para>
 965      <function>printk()</function> can be called in
 966      <emphasis>any</emphasis> context, interestingly enough.
 967     </para>
 968    </sect1>
 969
 970    <sect1 id="sparc">
 971     <title>The Fucked Up Sparc</title>
 972
 973     <para>
 974       Alan Cox says <quote>the irq disable/enable is in the register
 975       window on a sparc</quote>.  Andi Kleen says <quote>when you do
 976       restore_flags in a different function you mess up all the
 977       register windows</quote>.
 978     </para>
 979
 980     <para>
 981       So never pass the flags word set by
 982       <function>spin_lock_irqsave()</function> and brethren to another
 983       function (unless it's declared <type>inline</type>.  Usually no-one
 984       does this, but now you've been warned.  Dave Miller can never do
 985       anything in a straightforward manner (I can say that, because I have
 986       pictures of him and a certain PowerPC maintainer in a compromising
 987       position).
 988     </para>
 989    </sect1>
 990
 991    <sect1 id="racing-timers">
 992     <title>Racing Timers: A Kernel Pastime</title>
 993
 994     <para>
 995       Timers can produce their own special problems with races.
 996       Consider a collection of objects (list, hash, etc) where each
 997       object has a timer which is due to destroy it.
 998     </para>
 999
1000     <para>
1001       If you want to destroy the entire collection (say on module
1002       removal), you might do the following:
1003     </para>
1004
1005     <programlisting>
1006         /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE
1007            HUNGARIAN NOTATION */
1008         spin_lock_bh(&amp;list_lock);
1009
1010         while (list) {
1011                 struct foo *next = list-&gt;next;
1012                 del_timer(&amp;list-&gt;timer);
1013                 kfree(list);
1014                 list = next;
1015         }
1016
1017         spin_unlock_bh(&amp;list_lock);
1018     </programlisting>
1019
1020     <para>
1021       Sooner or later, this will crash on SMP, because a timer can
1022       have just gone off before the <function>spin_lock_bh()</function>,
1023       and it will only get the lock after we
1024       <function>spin_unlock_bh()</function>, and then try to free
1025       the element (which has already been freed!).
1026     </para>
1027
1028     <para>
1029       This can be avoided by checking the result of
1030       <function>del_timer()</function>: if it returns
1031       <returnvalue>1</returnvalue>, the timer has been deleted.
1032       If <returnvalue>0</returnvalue>, it means (in this
1033       case) that it is currently running, so we can do:
1034     </para>
1035
1036     <programlisting>
1037         retry:
1038                 spin_lock_bh(&amp;list_lock);
1039
1040                 while (list) {
1041                         struct foo *next = list-&gt;next;
1042                         if (!del_timer(&amp;list-&gt;timer)) {
1043                                 /* Give timer a chance to delete this */
1044                                 spin_unlock_bh(&amp;list_lock);
1045                                 goto retry;
1046                         }
1047                         kfree(list);
1048                         list = next;
1049                 }
1050
1051                 spin_unlock_bh(&amp;list_lock);
1052     </programlisting>
1053
1054     <para>
1055       Another common problem is deleting timers which restart
1056       themselves (by calling <function>add_timer()</function> at the end
1057       of their timer function).  Because this is a fairly common case
1058       which is prone to races, you should use <function>del_timer_sync()</function>
1059       (<filename class=headerfile>include/linux/timer.h</filename>)
1060       to handle this case.  It returns the number of times the timer
1061       had to be deleted before we finally stopped it from adding itself back
1062       in.
1063     </para>
1064    </sect1>
1065   </chapter>
1066
1067   <chapter id="references">
1068    <title>Further reading</title>
1069
1070    <itemizedlist>
1071     <listitem>
1072      <para>
1073        <filename>Documentation/spinlocks.txt</filename>:
1074        Linus Torvalds' spinlocking tutorial in the kernel sources.
1075      </para>
1076     </listitem>
1077
1078     <listitem>
1079      <para>
1080        Unix Systems for Modern Architectures: Symmetric
1081        Multiprocessing and Caching for Kernel Programmers:
1082      </para>
1083
1084      <para>
1085        Curt Schimmel's very good introduction to kernel level
1086        locking (not written for Linux, but nearly everything
1087        applies).  The book is expensive, but really worth every
1088        penny to understand SMP locking. [ISBN: 0201633388]
1089      </para>
1090     </listitem>
1091    </itemizedlist>
1092   </chapter>
1093
1094   <chapter id="thanks">
1095     <title>Thanks</title>
1096
1097     <para>
1098       Thanks to Telsa Gwynne for DocBooking, neatening and adding
1099       style.
1100     </para>
1101
1102     <para>
1103       Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul
1104       Mackerras, Ruedi Aschwanden, Alan Cox, Manfred Spraul and Tim
1105       Waugh for proofreading, correcting, flaming, commenting.
1106     </para>
1107
1108     <para>
1109       Thanks to the cabal for having no influence on this document.
1110     </para>
1111   </chapter>
1112
1113   <glossary id="glossary">
1114    <title>Glossary</title>
1115
1116    <glossentry id="gloss-bh">
1117     <glossterm>bh</glossterm>
1118      <glossdef>
1119       <para>
1120         Bottom Half: for historical reasons, functions with
1121         `_bh' in them often now refer to any software interrupt, e.g.
1122         <function>spin_lock_bh()</function> blocks any software interrupt
1123         on the current CPU.  Bottom halves are deprecated, and will
1124         eventually be replaced by tasklets.  Only one bottom half will be
1125         running at any time.
1126      </para>
1127     </glossdef>
1128    </glossentry>
1129
1130    <glossentry id="gloss-hwinterrupt">
1131     <glossterm>Hardware Interrupt / Hardware IRQ</glossterm>
1132     <glossdef>
1133      <para>
1134        Hardware interrupt request.  <function>in_irq()</function> returns
1135        <returnvalue>true</returnvalue> in a hardware interrupt handler (it
1136        also returns true when interrupts are blocked).
1137      </para>
1138     </glossdef>
1139    </glossentry>
1140
1141    <glossentry id="gloss-interruptcontext">
1142     <glossterm>Interrupt Context</glossterm>
1143     <glossdef>
1144      <para>
1145        Not user context: processing a hardware irq or software irq.
1146        Indicated by the <function>in_interrupt()</function> macro
1147        returning <returnvalue>true</returnvalue> (although it also
1148        returns true when interrupts or BHs are blocked).
1149      </para>
1150     </glossdef>
1151    </glossentry>
1152
1153    <glossentry id="gloss-smp">
1154     <glossterm><acronym>SMP</acronym></glossterm>
1155     <glossdef>
1156      <para>
1157        Symmetric Multi-Processor: kernels compiled for multiple-CPU
1158        machines.  (CONFIG_SMP=y).
1159      </para>
1160     </glossdef>
1161    </glossentry>
1162
1163    <glossentry id="gloss-softirq">
1164     <glossterm>softirq</glossterm>
1165     <glossdef>
1166      <para>
1167        Strictly speaking, one of up to 32 enumerated software
1168        interrupts which can run on multiple CPUs at once.
1169        Sometimes used to refer to tasklets and bottom halves as
1170        well (ie. all software interrupts).
1171      </para>
1172     </glossdef>
1173    </glossentry>
1174
1175    <glossentry id="gloss-swinterrupt">
1176     <glossterm>Software Interrupt / Software IRQ</glossterm>
1177     <glossdef>
1178      <para>
1179        Software interrupt handler.  <function>in_irq()</function> returns
1180        <returnvalue>false</returnvalue>; <function>in_softirq()</function>
1181        returns <returnvalue>true</returnvalue>.  Tasklets, softirqs and
1182        bottom halves all fall into the category of `software interrupts'.
1183      </para>
1184     </glossdef>
1185    </glossentry>
1186
1187    <glossentry id="gloss-tasklet">
1188     <glossterm>tasklet</glossterm>
1189     <glossdef>
1190      <para>
1191        A dynamically-registrable software interrupt,
1192        which is guaranteed to only run on one CPU at a time.
1193      </para>
1194     </glossdef>
1195    </glossentry>
1196
1197    <glossentry id="gloss-up">
1198     <glossterm><acronym>UP</acronym></glossterm>
1199     <glossdef>
1200      <para>
1201        Uni-Processor: Non-SMP.  (CONFIG_SMP=n).
1202      </para>
1203     </glossdef>
1204    </glossentry>
1205
1206    <glossentry id="gloss-usercontext">
1207     <glossterm>User Context</glossterm>
1208     <glossdef>
1209      <para>
1210        The kernel executing on behalf of a particular
1211        process or kernel thread (given by the <function>current()</function>
1212        macro.)  Not to be confused with userspace.  Can be interrupted by
1213        software  or hardware interrupts.
1214      </para>
1215     </glossdef>
1216    </glossentry>
1217
1218    <glossentry id="gloss-userspace">
1219     <glossterm>Userspace</glossterm>
1220     <glossdef>
1221      <para>
1222        A process executing its own code outside the kernel.
1223      </para>
1224     </glossdef>
1225    </glossentry>
1226
1227   </glossary>
1228 </book>
1229