FreeBSD manual
download PDF document: geom.4.pdf
GEOM(4) FreeBSD Kernel Interfaces Manual GEOM(4)
NAME
GEOM - modular disk I/O request transformation framework
SYNOPSIS
options GEOM_BDE
options GEOM_CACHE
options GEOM_CONCAT
options GEOM_ELI
options GEOM_GATE
options GEOM_JOURNAL
options GEOM_LABEL
options GEOM_LINUX_LVM
options GEOM_MAP
options GEOM_MIRROR
options GEOM_MOUNTVER
options GEOM_MULTIPATH
options GEOM_NOP
options GEOM_PART_APM
options GEOM_PART_BSD
options GEOM_PART_BSD64
options GEOM_PART_EBR
options GEOM_PART_EBR_COMPAT
options GEOM_PART_GPT
options GEOM_PART_LDM
options GEOM_PART_MBR
options GEOM_RAID
options GEOM_RAID3
options GEOM_SHSEC
options GEOM_STRIPE
options GEOM_UZIP
options GEOM_VIRSTOR
options GEOM_ZERO
DESCRIPTION
The GEOM framework provides an infrastructure in which "classes" can
perform transformations on disk I/O requests on their path from the upper
kernel to the device drivers and back.
Transformations in a GEOM context range from the simple geometric
displacement performed in typical disk partitioning modules over RAID
algorithms and device multipath resolution to full blown cryptographic
protection of the stored data.
Compared to traditional "volume management", GEOM differs from most and
in some cases all previous implementations in the following ways:
o GEOM is extensible. It is trivially simple to write a new class of
transformation and it will not be given stepchild treatment. If
someone for some reason wanted to mount IBM MVS diskpacks, a class
recognizing and configuring their VTOC information would be a trivial
matter.
o GEOM is topologically agnostic. Most volume management
implementations have very strict notions of how classes can fit
together, very often one fixed hierarchy is provided, for instance,
subdisk - plex - volume.
mirror these two and two, resulting in a much more complex configuration.
GEOM on the other hand does not care in which order things are done, the
only restriction is that cycles in the graph will not be allowed.
TERMINOLOGY AND TOPOLOGY
GEOM is quite object oriented and consequently the terminology borrows a
lot of context and semantics from the OO vocabulary:
A "class", represented by the data structure g_class implements one
particular kind of transformation. Typical examples are MBR disk
partition, BSD disklabel, and RAID5 classes.
An instance of a class is called a "geom" and represented by the data
structure g_geom. In a typical i386 FreeBSD system, there will be one
geom of class MBR for each disk.
A "provider", represented by the data structure g_provider, is the front
gate at which a geom offers service. A provider is "a disk-like thing
which appears in /dev" - a logical disk in other words. All providers
have three main properties: "name", "sectorsize" and "size".
A "consumer" is the backdoor through which a geom connects to another
geom provider and through which I/O requests are sent.
The topological relationship between these entities are as follows:
o A class has zero or more geom instances.
o A geom has exactly one class it is derived from.
o A geom has zero or more consumers.
o A geom has zero or more providers.
o A consumer can be attached to zero or one providers.
o A provider can have zero or more consumers attached.
All geoms have a rank-number assigned, which is used to detect and
prevent loops in the acyclic directed graph. This rank number is
assigned as follows:
1. A geom with no attached consumers has rank=1.
2. A geom with attached consumers has a rank one higher than the
highest rank of the geoms of the providers its consumers are
attached to.
SPECIAL TOPOLOGICAL MANEUVERS
In addition to the straightforward attach, which attaches a consumer to a
provider, and detach, which breaks the bond, a number of special
topological maneuvers exists to facilitate configuration and to improve
the overall flexibility.
TASTING is a process that happens whenever a new class or new provider is
created, and it provides the class a chance to automatically configure an
instance on providers which it recognizes as its own. A typical example
is the MBR disk-partition class which will look for the MBR table in the
first sector and, if found and validated, will instantiate a geom to
o Examine specific data structures on the disk.
o Examine properties like "sectorsize" or "mediasize" for the provider.
o Examine the rank number of the provider's geom.
o Examine the method name of the provider's geom.
ORPHANIZATION is the process by which a provider is removed while it
potentially is still being used.
When a geom orphans a provider, all future I/O requests will "bounce" on
the provider with an error code set by the geom. Any consumers attached
to the provider will receive notification about the orphanization when
the event loop gets around to it, and they can take appropriate action at
that time.
A geom which came into being as a result of a normal taste operation
should self-destruct unless it has a way to keep functioning whilst
lacking the orphaned provider. Geoms like disk slicers should therefore
self-destruct whereas RAID5 or mirror geoms will be able to continue as
long as they do not lose quorum.
When a provider is orphaned, this does not necessarily result in any
immediate change in the topology: any attached consumers are still
attached, any opened paths are still open, any outstanding I/O requests
are still outstanding.
The typical scenario is:
o A device driver detects a disk has departed and orphans the
provider for it.
o The geoms on top of the disk receive the orphanization event
and orphan all their providers in turn. Providers which are
not attached to will typically self-destruct right away. This
process continues in a quasi-recursive fashion until all
relevant pieces of the tree have heard the bad news.
o Eventually the buck stops when it reaches geom_dev at the top
of the stack.
o Geom_dev will call destroy_dev(9) to stop any more requests
from coming in. It will sleep until any and all outstanding
I/O requests have been returned. It will explicitly close
(i.e.: zero the access counts), a change which will propagate
all the way down through the mesh. It will then detach and
destroy its geom.
o The geom whose provider is now detached will destroy the
provider, detach and destroy its consumer and destroy its geom.
o This process percolates all the way down through the mesh,
until the cleanup is complete.
While this approach seems byzantine, it does provide the maximum
flexibility and robustness in handling disappearing devices.
The one absolutely crucial detail to be aware of is that if the device
driver does not return all I/O requests, the tree will not unravel.
SPOILING is a special case of orphanization used to protect against stale
metadata. It is probably easiest to understand spoiling by going through
the geoms would be operating on stale metadata unless some notification
system can inform them otherwise.
To avoid this situation, when the open of da0 for write happens, all
attached consumers are told about this and geoms like MBR and BSD will
self-destruct as a result. When da0 is closed, it will be offered for
tasting again and, if the data structures for MBR and BSD are still
there, new geoms will instantiate themselves anew.
Now for the fine print:
If any of the paths through the MBR or BSD module were open, they would
have opened downwards with an exclusive bit thus rendering it impossible
to open da0 for writing in that case. Conversely, the requested
exclusive bit would render it impossible to open a path through the MBR
geom while da0 is open for writing.
From this it also follows that changing the size of open geoms can only
be done with their cooperation.
Finally: the spoiling only happens when the write count goes from zero to
non-zero and the retasting happens only when the write count goes from
non-zero to zero.
CONFIGURE is the process where the administrator issues instructions for
a particular class to instantiate itself. There are multiple ways to
express intent in this case - a particular provider may be specified with
a level of override forcing, for instance, a BSD disklabel module to
attach to a provider which was not found palatable during the TASTE
operation.
Finally, I/O is the reason we even do this: it concerns itself with
sending I/O requests through the graph.
I/O REQUESTS, represented by struct bio, originate at a consumer, are
scheduled on its attached provider and, when processed, are returned to
the consumer. It is important to realize that the struct bio which
enters through the provider of a particular geom does not "come out on
the other side". Even simple transformations like MBR and BSD will clone
the struct bio, modify the clone, and schedule the clone on their own
consumer. Note that cloning the struct bio does not involve cloning the
actual data area specified in the I/O request.
In total, four different I/O requests exist in GEOM: read, write, delete,
and "get attribute".
Read and write are self explanatory.
Delete indicates that a certain range of data is no longer used and that
it can be erased or freed as the underlying technology supports.
Technologies like flash adaptation layers can arrange to erase the
relevant blocks before they will become reassigned and cryptographic
devices may want to fill random bits into the range to reduce the amount
of data available for attack.
It is important to recognize that a delete indication is not a request
and consequently there is no guarantee that the data actually will be
erased or made unavailable unless guaranteed by specific geoms in the
graph. If "secure delete" semantics are required, a geom should be
(Stay tuned while the author rests his brain and fingers: more to come.)
DIAGNOSTICS
Several flags are provided for tracing GEOM operations and unlocking
protection mechanisms via the kern.geom.debugflags sysctl. All of these
flags are off by default, and great care should be taken in turning them
on.
0x01 (G_T_TOPOLOGY)
Provide tracing of topology change events.
0x02 (G_T_BIO)
Provide tracing of buffer I/O requests.
0x04 (G_T_ACCESS)
Provide tracing of access check controls.
0x08 (unused)
0x10 (allow foot shooting)
Allow writing to Rank 1 providers. This would, for example,
allow the super-user to overwrite the MBR on the root disk or
write random sectors elsewhere to a mounted disk. The
implications are obvious.
0x40 (G_F_DISKIOCTL)
This is unused at this time.
0x80 (G_F_CTLDUMP)
Dump contents of gctl requests.
SEE ALSO
libgeom(3), geom(8), DECLARE_GEOM_CLASS(9), disk(9), g_access(9),
g_attach(9), g_bio(9), g_consumer(9), g_data(9), g_event(9), g_geom(9),
g_provider(9), g_provider_by_name(9)
HISTORY
This software was initially developed for the FreeBSD Project by
Poul-Henning Kamp and NAI Labs, the Security Research Division of Network
Associates, Inc. under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"),
as part of the DARPA CHATS research program.
The following obsolete GEOM components were removed in FreeBSD 13.0:
o GEOM_BSD,
o GEOM_FOX,
o GEOM_MBR,
o GEOM_SUNLABEL, and
o GEOM_VOL.
Use
o GEOM_PART_BSD,
o GEOM_MULTIPATH,
o GEOM_PART_MBR, and
o GEOM_LABEL
options, respectively, instead.
AUTHORS
Poul-Henning Kamp <phk@FreeBSD.org>