projects
/
powerpc.git
/ blobdiff
commit
grep
author
committer
pickaxe
?
search:
re
summary
|
shortlog
|
log
|
commit
|
commitdiff
|
tree
raw
|
inline
| side by side
cpu-hotplug: replace lock_cpu_hotplug() with get_online_cpus()
[powerpc.git]
/
Documentation
/
memory-barriers.txt
diff --git
a/Documentation/memory-barriers.txt
b/Documentation/memory-barriers.txt
index
58408dd
..
4e17beb
100644
(file)
--- a/
Documentation/memory-barriers.txt
+++ b/
Documentation/memory-barriers.txt
@@
-24,7
+24,7
@@
Contents:
(*) Explicit kernel barriers.
- Compiler barrier.
(*) Explicit kernel barriers.
- Compiler barrier.
- -
The
CPU memory barriers.
+ - CPU memory barriers.
- MMIO write barrier.
(*) Implicit kernel memory barriers.
- MMIO write barrier.
(*) Implicit kernel memory barriers.
@@
-265,7
+265,7
@@
Memory barriers are such interventions. They impose a perceived partial
ordering over the memory operations on either side of the barrier.
Such enforcement is important because the CPUs and other devices in a system
ordering over the memory operations on either side of the barrier.
Such enforcement is important because the CPUs and other devices in a system
-can use a variety of tricks to improve performance
-
including reordering,
+can use a variety of tricks to improve performance
,
including reordering,
deferral and combination of memory operations; speculative loads; speculative
branch prediction and various types of caching. Memory barriers are used to
override or suppress these tricks, allowing the code to sanely control the
deferral and combination of memory operations; speculative loads; speculative
branch prediction and various types of caching. Memory barriers are used to
override or suppress these tricks, allowing the code to sanely control the
@@
-457,7
+457,7
@@
sequence, Q must be either &A or &B, and that:
(Q == &A) implies (D == 1)
(Q == &B) implies (D == 4)
(Q == &A) implies (D == 1)
(Q == &B) implies (D == 4)
-But! CPU 2's perception of P may be updated _before_ its perception of B, thus
+But!
CPU 2's perception of P may be updated _before_ its perception of B, thus
leading to the following situation:
(Q == &B) and (D == 2) ????
leading to the following situation:
(Q == &B) and (D == 2) ????
@@
-573,7
+573,7
@@
Basically, the read barrier always has to be there, even though it can be of
the "weaker" type.
[!] Note that the stores before the write barrier would normally be expected to
the "weaker" type.
[!] Note that the stores before the write barrier would normally be expected to
-match the loads after the read barrier or data dependency barrier, and vice
+match the loads after the read barrier or
the
data dependency barrier, and vice
versa:
CPU 1 CPU 2
versa:
CPU 1 CPU 2
@@
-588,7
+588,7
@@
versa:
EXAMPLES OF MEMORY BARRIER SEQUENCES
------------------------------------
EXAMPLES OF MEMORY BARRIER SEQUENCES
------------------------------------
-Firstly, write barriers act as
a
partial orderings on store operations.
+Firstly, write barriers act as partial orderings on store operations.
Consider the following sequence of events:
CPU 1
Consider the following sequence of events:
CPU 1
@@
-608,15
+608,15
@@
STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E
+-------+ : :
| | +------+
| |------>| C=3 | } /\
+-------+ : :
| | +------+
| |------>| C=3 | } /\
- | | : +------+ }----- \ -----> Events perceptible
- | | : | A=1 | } \/ t
o rest of
system
+ | | : +------+ }----- \ -----> Events perceptible
to
+ | | : | A=1 | } \/ t
he rest of the
system
| | : +------+ }
| CPU 1 | : | B=2 | }
| | +------+ }
| | wwwwwwwwwwwwwwww } <--- At this point the write barrier
| | +------+ } requires all stores prior to the
| | : | E=5 | } barrier to be committed before
| | : +------+ }
| CPU 1 | : | B=2 | }
| | +------+ }
| | wwwwwwwwwwwwwwww } <--- At this point the write barrier
| | +------+ } requires all stores prior to the
| | : | E=5 | } barrier to be committed before
- | | : +------+ } further stores may
be take place.
+ | | : +------+ } further stores may
take place
| |------>| D=4 | }
| | +------+
+-------+ : :
| |------>| D=4 | }
| | +------+
+-------+ : :
@@
-626,7
+626,7
@@
STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E
V
V
-Secondly, data dependency barriers act as
a
partial orderings on data-dependent
+Secondly, data dependency barriers act as partial orderings on data-dependent
loads. Consider the following sequence of events:
CPU 1 CPU 2
loads. Consider the following sequence of events:
CPU 1 CPU 2
@@
-975,7
+975,7
@@
compiler from moving the memory accesses either side of it to the other side:
barrier();
barrier();
-This a general barrier - lesser varieties of compiler barrier do not exist.
+This
is
a general barrier - lesser varieties of compiler barrier do not exist.
The compiler barrier has no direct effect on the CPU, which may then reorder
things however it wishes.
The compiler barrier has no direct effect on the CPU, which may then reorder
things however it wishes.
@@
-997,7
+997,7
@@
The Linux kernel has eight basic CPU memory barriers:
All CPU memory barriers unconditionally imply compiler barriers.
SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
All CPU memory barriers unconditionally imply compiler barriers.
SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
-systems because it is assumed that a CPU will
be
appear to be self-consistent,
+systems because it is assumed that a CPU will appear to be self-consistent,
and will order overlapping accesses correctly with respect to itself.
[!] Note that SMP memory barriers _must_ be used to control the ordering of
and will order overlapping accesses correctly with respect to itself.
[!] Note that SMP memory barriers _must_ be used to control the ordering of
@@
-1146,9
+1146,9
@@
for each construct. These operations all imply certain barriers:
Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
-[!] Note: one of the consequence of LOCKs and UNLOCKs being only one-way
- barriers is that the effects
instructions outside of a critical section may
- seep into the inside of the critical section.
+[!] Note: one of the consequence
s
of LOCKs and UNLOCKs being only one-way
+ barriers is that the effects
of instructions outside of a critical section
+
may
seep into the inside of the critical section.
A LOCK followed by an UNLOCK may not be assumed to be full memory barrier
because it is possible for an access preceding the LOCK to happen after the
A LOCK followed by an UNLOCK may not be assumed to be full memory barrier
because it is possible for an access preceding the LOCK to happen after the
@@
-1239,7
+1239,7
@@
three CPUs; then should the following sequence of events occur:
UNLOCK M UNLOCK Q
*D = d; *H = h;
UNLOCK M UNLOCK Q
*D = d; *H = h;
-Then there is no guarantee as to what order CPU
#
3 will see the accesses to *A
+Then there is no guarantee as to what order CPU 3 will see the accesses to *A
through *H occur in, other than the constraints imposed by the separate locks
on the separate CPUs. It might, for example, see:
through *H occur in, other than the constraints imposed by the separate locks
on the separate CPUs. It might, for example, see:
@@
-1269,12
+1269,12
@@
However, if the following occurs:
UNLOCK M [2]
*H = h;
UNLOCK M [2]
*H = h;
-CPU
#
3 might see:
+CPU 3 might see:
*E, LOCK M [1], *C, *B, *A, UNLOCK M [1],
LOCK M [2], *H, *F, *G, UNLOCK M [2], *D
*E, LOCK M [1], *C, *B, *A, UNLOCK M [1],
LOCK M [2], *H, *F, *G, UNLOCK M [2], *D
-But assuming CPU
#1 gets the lock first, it
won't see any of:
+But assuming CPU
1 gets the lock first, CPU 3
won't see any of:
*B, *C, *D, *F, *G or *H preceding LOCK M [1]
*A, *B or *C following UNLOCK M [1]
*B, *C, *D, *F, *G or *H preceding LOCK M [1]
*A, *B or *C following UNLOCK M [1]
@@
-1327,12
+1327,12
@@
spinlock, for example:
mmiowb();
spin_unlock(Q);
mmiowb();
spin_unlock(Q);
-this will ensure that the two stores issued on CPU
#
1 appear at the PCI bridge
-before either of the stores issued on CPU
#
2.
+this will ensure that the two stores issued on CPU 1 appear at the PCI bridge
+before either of the stores issued on CPU 2.
-Furthermore, following a store by a load
to
the same device obviates the need
-for
an
mmiowb(), because the load forces the store to complete before the load
+Furthermore, following a store by a load
from
the same device obviates the need
+for
the
mmiowb(), because the load forces the store to complete before the load
is performed:
CPU 1 CPU 2
is performed:
CPU 1 CPU 2
@@
-1363,7
+1363,7
@@
circumstances in which reordering definitely _could_ be a problem:
(*) Atomic operations.
(*) Atomic operations.
- (*) Accessing devices
(I/O)
.
+ (*) Accessing devices.
(*) Interrupts.
(*) Interrupts.
@@
-1399,7
+1399,7
@@
To wake up a particular waiter, the up_read() or up_write() functions have to:
(1) read the next pointer from this waiter's record to know as to where the
next waiter record is;
(1) read the next pointer from this waiter's record to know as to where the
next waiter record is;
- (
4
) read the pointer to the waiter's task structure;
+ (
2
) read the pointer to the waiter's task structure;
(3) clear the task pointer to tell the waiter it has been given the semaphore;
(3) clear the task pointer to tell the waiter it has been given the semaphore;
@@
-1407,7
+1407,7
@@
To wake up a particular waiter, the up_read() or up_write() functions have to:
(5) release the reference held on the waiter's task struct.
(5) release the reference held on the waiter's task struct.
-In otherwords, it has to perform this sequence of events:
+In other
words, it has to perform this sequence of events:
LOAD waiter->list.next;
LOAD waiter->task;
LOAD waiter->list.next;
LOAD waiter->task;
@@
-1479,7
+1479,8
@@
kernel.
Any atomic operation that modifies some state in memory and returns information
about the state (old or new) implies an SMP-conditional general memory barrier
Any atomic operation that modifies some state in memory and returns information
about the state (old or new) implies an SMP-conditional general memory barrier
-(smp_mb()) on each side of the actual operation. These include:
+(smp_mb()) on each side of the actual operation (with the exception of
+explicit lock operations, described later). These include:
xchg();
cmpxchg();
xchg();
cmpxchg();
@@
-1502,7
+1503,7
@@
operations and adjusting reference counters towards object destruction, and as
such the implicit memory barrier effects are necessary.
such the implicit memory barrier effects are necessary.
-The following operation are potential problems as they do _not_ imply memory
+The following operation
s
are potential problems as they do _not_ imply memory
barriers, but might be used for implementing such things as UNLOCK-class
operations:
barriers, but might be used for implementing such things as UNLOCK-class
operations:
@@
-1517,7
+1518,7
@@
With these the appropriate explicit memory barrier should be used if necessary
The following also do _not_ imply memory barriers, and so may require explicit
memory barriers under some circumstances (smp_mb__before_atomic_dec() for
The following also do _not_ imply memory barriers, and so may require explicit
memory barriers under some circumstances (smp_mb__before_atomic_dec() for
-instance)
)
:
+instance):
atomic_add();
atomic_sub();
atomic_add();
atomic_sub();
@@
-1536,10
+1537,19
@@
If they're used for constructing a lock of some description, then they probably
do need memory barriers as a lock primitive generally has to do things in a
specific order.
do need memory barriers as a lock primitive generally has to do things in a
specific order.
-
Basically, each usage case has to be carefully considered as to whether memory
barriers are needed or not.
Basically, each usage case has to be carefully considered as to whether memory
barriers are needed or not.
+The following operations are special locking primitives:
+
+ test_and_set_bit_lock();
+ clear_bit_unlock();
+ __clear_bit_unlock();
+
+These implement LOCK-class and UNLOCK-class operations. These should be used in
+preference to other operations when implementing locking primitives, because
+their implementations can be optimised on many architectures.
+
[!] Note that special memory barrier primitives are available for these
situations because on some CPUs the atomic instructions used imply full memory
barriers, and so barrier instructions are superfluous in conjunction with them,
[!] Note that special memory barrier primitives are available for these
situations because on some CPUs the atomic instructions used imply full memory
barriers, and so barrier instructions are superfluous in conjunction with them,
@@
-1641,8
+1651,8
@@
functions:
indeed have special I/O space access cycles and instructions, but many
CPUs don't have such a concept.
indeed have special I/O space access cycles and instructions, but many
CPUs don't have such a concept.
- The PCI bus, amongst others, defines an I/O space concept
- which
on such
- CPUs as i386 and x86_64
cpus
readily maps to the CPU's concept of I/O
+ The PCI bus, amongst others, defines an I/O space concept
which -
on such
+ CPUs as i386 and x86_64
-
readily maps to the CPU's concept of I/O
space. However, it may also be mapped as a virtual I/O space in the CPU's
memory map, particularly on those CPUs that don't support alternate I/O
spaces.
space. However, it may also be mapped as a virtual I/O space in the CPU's
memory map, particularly on those CPUs that don't support alternate I/O
spaces.
@@
-1664,7
+1674,7
@@
functions:
i386 architecture machines, for example, this is controlled by way of the
MTRR registers.
i386 architecture machines, for example, this is controlled by way of the
MTRR registers.
- Ordinarily, these will be guaranteed to be fully ordered and uncombined,
,
+ Ordinarily, these will be guaranteed to be fully ordered and uncombined,
provided they're not accessing a prefetchable device.
However, intermediary hardware (such as a PCI bridge) may indulge in
provided they're not accessing a prefetchable device.
However, intermediary hardware (such as a PCI bridge) may indulge in
@@
-1689,7
+1699,7
@@
functions:
(*) ioreadX(), iowriteX()
(*) ioreadX(), iowriteX()
- These will perform a
s appropriate
for the type of access they're actually
+ These will perform a
ppropriately
for the type of access they're actually
doing, be it inX()/outX() or readX()/writeX().
doing, be it inX()/outX() or readX()/writeX().
@@
-1705,7
+1715,7
@@
of arch-specific code.
This means that it must be considered that the CPU will execute its instruction
stream in any order it feels like - or even in parallel - provided that if an
This means that it must be considered that the CPU will execute its instruction
stream in any order it feels like - or even in parallel - provided that if an
-instruction in the stream depends on
the
an earlier instruction, then that
+instruction in the stream depends on an earlier instruction, then that
earlier instruction must be sufficiently complete[*] before the later
instruction may proceed; in other words: provided that the appearance of
causality is maintained.
earlier instruction must be sufficiently complete[*] before the later
instruction may proceed; in other words: provided that the appearance of
causality is maintained.
@@
-1795,8
+1805,8
@@
eventually become visible on all CPUs, there's no guarantee that they will
become apparent in the same order on those other CPUs.
become apparent in the same order on those other CPUs.
-Consider dealing with a system that has
pair of CPUs (1 & 2), each of which has
-a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D):
+Consider dealing with a system that has
a pair of CPUs (1 & 2), each of which
+
has
a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D):
:
: +--------+
:
: +--------+
@@
-1835,7
+1845,7
@@
Imagine the system has the following properties:
(*) the coherency queue is not flushed by normal loads to lines already
present in the cache, even though the contents of the queue may
(*) the coherency queue is not flushed by normal loads to lines already
present in the cache, even though the contents of the queue may
- potentially
e
ffect those loads.
+ potentially
a
ffect those loads.
Imagine, then, that two writes are made on the first CPU, with a write barrier
between them to guarantee that they will appear to reach that CPU's caches in
Imagine, then, that two writes are made on the first CPU, with a write barrier
between them to guarantee that they will appear to reach that CPU's caches in
@@
-1845,7
+1855,7
@@
the requisite order:
=============== =============== =======================================
u == 0, v == 1 and p == &u, q == &u
v = 2;
=============== =============== =======================================
u == 0, v == 1 and p == &u, q == &u
v = 2;
- smp_wmb(); Make sure change to v visible before
+ smp_wmb(); Make sure change to v
is
visible before
change to p
<A:modify v=2> v is now in cache A exclusively
p = &v;
change to p
<A:modify v=2> v is now in cache A exclusively
p = &v;
@@
-1853,7
+1863,7
@@
the requisite order:
The write memory barrier forces the other CPUs in the system to perceive that
the local CPU's caches have apparently been updated in the correct order. But
The write memory barrier forces the other CPUs in the system to perceive that
the local CPU's caches have apparently been updated in the correct order. But
-now imagine that the second CPU
that
wants to read those values:
+now imagine that the second CPU wants to read those values:
CPU 1 CPU 2 COMMENT
=============== =============== =======================================
CPU 1 CPU 2 COMMENT
=============== =============== =======================================
@@
-1861,7
+1871,7
@@
now imagine that the second CPU that wants to read those values:
q = p;
x = *q;
q = p;
x = *q;
-The above pair of reads may then fail to happen in expected order, as the
+The above pair of reads may then fail to happen in
the
expected order, as the
cacheline holding p may get updated in one of the second CPU's caches whilst
the update to the cacheline holding v is delayed in the other of the second
CPU's caches by some other cache event:
cacheline holding p may get updated in one of the second CPU's caches whilst
the update to the cacheline holding v is delayed in the other of the second
CPU's caches by some other cache event:
@@
-1916,7
+1926,7
@@
access depends on a read, not all do, so it may not be relied on.
Other CPUs may also have split caches, but must coordinate between the various
cachelets for normal memory accesses. The semantics of the Alpha removes the
Other CPUs may also have split caches, but must coordinate between the various
cachelets for normal memory accesses. The semantics of the Alpha removes the
-need for coordination in absence of memory barriers.
+need for coordination in
the
absence of memory barriers.
CACHE COHERENCY VS DMA
CACHE COHERENCY VS DMA
@@
-1931,10
+1941,10
@@
invalidate them as well).
In addition, the data DMA'd to RAM by a device may be overwritten by dirty
cache lines being written back to RAM from a CPU's cache after the device has
In addition, the data DMA'd to RAM by a device may be overwritten by dirty
cache lines being written back to RAM from a CPU's cache after the device has
-installed its own data, or cache lines
simply present in a CPUs cache ma
y
-
simply obscure the fact that RAM has been updated, until at such time as th
e
-cacheline is discarded from the CPU's cache and reloaded. To deal with this,
-
the
appropriate part of the kernel must invalidate the overlapping bits of the
+installed its own data, or cache lines
present in the CPU's cache may simpl
y
+
obscure the fact that RAM has been updated, until at such time as the cachelin
e
+is discarded from the CPU's cache and reloaded. To deal with this, the
+appropriate part of the kernel must invalidate the overlapping bits of the
cache on each CPU.
See Documentation/cachetlb.txt for more information on cache management.
cache on each CPU.
See Documentation/cachetlb.txt for more information on cache management.
@@
-1944,7
+1954,7
@@
CACHE COHERENCY VS MMIO
-----------------------
Memory mapped I/O usually takes place through memory locations that are part of
-----------------------
Memory mapped I/O usually takes place through memory locations that are part of
-a window in the CPU's memory space that ha
ve
different properties assigned than
+a window in the CPU's memory space that ha
s
different properties assigned than
the usual RAM directed window.
Amongst these properties is usually the fact that such accesses bypass the
the usual RAM directed window.
Amongst these properties is usually the fact that such accesses bypass the
@@
-1960,7
+1970,7
@@
THE THINGS CPUS GET UP TO
=========================
A programmer might take it for granted that the CPU will perform memory
=========================
A programmer might take it for granted that the CPU will perform memory
-operations in exactly the order specified, so that if
a
CPU is, for example,
+operations in exactly the order specified, so that if
the
CPU is, for example,
given the following piece of code to execute:
a = *A;
given the following piece of code to execute:
a = *A;
@@
-1969,7
+1979,7
@@
given the following piece of code to execute:
d = *D;
*E = e;
d = *D;
*E = e;
-
T
hey would then expect that the CPU will complete the memory operation for each
+
t
hey would then expect that the CPU will complete the memory operation for each
instruction before moving on to the next one, leading to a definite sequence of
operations as seen by external observers in the system:
instruction before moving on to the next one, leading to a definite sequence of
operations as seen by external observers in the system:
@@
-1986,8
+1996,8
@@
assumption doesn't hold because:
(*) loads may be done speculatively, and the result discarded should it prove
to have been unnecessary;
(*) loads may be done speculatively, and the result discarded should it prove
to have been unnecessary;
- (*) loads may be done speculatively, leading to the result having be
ing
-
fetched
at the wrong time in the expected sequence of events;
+ (*) loads may be done speculatively, leading to the result having be
en fetched
+ at the wrong time in the expected sequence of events;
(*) the order of the memory accesses may be rearranged to promote better use
of the CPU buses and caches;
(*) the order of the memory accesses may be rearranged to promote better use
of the CPU buses and caches;
@@
-2069,12
+2079,12
@@
AND THEN THERE'S THE ALPHA
The DEC Alpha CPU is one of the most relaxed CPUs there is. Not only that,
some versions of the Alpha CPU have a split data cache, permitting them to have
The DEC Alpha CPU is one of the most relaxed CPUs there is. Not only that,
some versions of the Alpha CPU have a split data cache, permitting them to have
-two semantically
related cache lines updating
at separate times. This is where
+two semantically
-related cache lines updated
at separate times. This is where
the data dependency barrier really becomes necessary as this synchronises both
caches with the memory coherence system, thus making it seem like pointer
changes vs new data occur in the right order.
the data dependency barrier really becomes necessary as this synchronises both
caches with the memory coherence system, thus making it seem like pointer
changes vs new data occur in the right order.
-The Alpha defines the Linux
's
kernel's memory barrier model.
+The Alpha defines the Linux kernel's memory barrier model.
See the subsection on "Cache Coherency" above.
See the subsection on "Cache Coherency" above.