UBI - Unsorted Block Images
Table of contents
- Big red note
- Overview
- Source code
- Mailing list
- User-space tools
- Saving erase counters
- Scalability issues
- Flash space overhead
- LEB un-map operation
- Volume update operation
- Atomic LEB change operation
- Volume auto-resize
- Marking eraseblocks as bad
- More documentation
- How to send a bugreport?
Big red note
People are often confused and treat UBI as a block device emulation layer (also known as FTL - flash translation layer). But this is not true - UBI is not an FTL.
UBI was not designed for flash devices like MMC/SD/mini-SD/micro-SD cards, USB sticks, CompactFlash, and so on. UBI was designed for bare flashes which may be found in embedded systems. Please, do not be confused.
Overview
UBI (Latin: "where?") stands for "Unsorted Block Images". It is a volume management system for flash devices which manages multiple logical volumes on a single physical flash device and spreads the I/O load (i.e, wear-leveling) across the whole flash chip.
In a sense, UBI may be compared to the Logical Volume Manager (LVM). Whereas LVM maps logical sectors to physical sectors, UBI maps logical eraseblocks to physical eraseblocks. But besides the mapping, UBI implements global wear-leveling and I/O errors handling.
An UBI volume is a set of consecutive logical eraseblocks. Each logical eraseblock may be mapped to any physical eraseblock. This mapping is managed by UBI, it is hidden from users and it is the base mechanism to provide global wear-leveling (along with per-physical eraseblock erase counters and the ability to transparently move data from more worn-out physical eraseblocks to the less worn-out ones).
UBI volume size is specified when the volume is created a may later be changed (volumes are dynamically re-sizable). UBI supports dynamic volumes and static volumes. Static volumes are read-only and their contents are protected by CRC check sums, while dynamic volumes are read-write and the upper layer (e.g., a file-system) is responsible for data integrity.
UBI is aware of bad eraseblocks (e.g., NAND flash may have them) and frees the upper layer from any bad block handling. UBI has a pool of reserved physical eraseblocks, and when a physical eraseblock becomes bad, it transparently substitutes it by a good physical eraseblock. UBI moves good data from the newly appeared bad physical eraseblock to the good one as well. The result is that users of UBI volumes do not notice I/O errors as UBI takes care of them.
NAND flashes may have bit-flips which occur on read and write operations. Bit-flips are corrected by ECC checksums, but they may accumulate over time and cause data loss. UBI handles this by moving data from physical eraseblocks with bit-flips to other physical eraseblocks, thus doing active scrubbing. This is done transparently in background and is hidden from upper layers.
Here is a short list of the main UBI features:
- UBI provides volumes which may be dynamically created, removed, or re-sized;
- UBI implements wear-leveling across whole flash device (i.e., you may continuously write/erase only the first logical eraseblock of an UBI volume, but UBI will spread this to all physical eraseblocks of the flash chip);
- UBI transparently handles bad physical eraseblocks;
- minimizes chances to loose data by means of "scrubbing".
Here is a comparison of MTD partitions and UBI volumes. UBI volumes are rather similar to MTD partitions because:
- both consist of eraseblocks - logical eraseblocks in case of UBI volumes, and physical eraseblocks in case of static partitions;
- both support three basic operations - read, write, erase.
But UBI volumes have the following advantages over traditional MTD partitions:
- there are no eraseblock wear-leveling constraints in case of UBI volumes, so users do not have to care about this at all, which means the upper-level software may be simpler;
- there are no bad eraseblocks in case of UBI volumes, which also leads to simpler upper-level software;
- UBI handles bit-flips;
- UBI also provides an atomic logical eraseblock change operation which allows to change the contents of a logical eraseblock and do not loose data if an unclean reboot happens during the operation; this is might be very useful for the upper-level software (e.g., for a file-system);
- UBI has an un-map operation, which just un-maps a logical eraseblock from the physical eraseblock, schedules the physical eraseblock for erasure and returns; this is very quick and frees upper level software from implementing their own mechanisms to defer erasures (e.g., JFFS2 has to implements such mechanisms).
So, existing software may still work on top of UBI volumes, while new software may benefit from the UBI features and let UBI solve many issues which the flash technology imposes.
Source code
UBI is in the main-line Linux kernel starting from version 2.6.22. But it is recommended to use the latest UBI which can be found in the UBI git tree:
git://git.infradead.org/~dedekind/ubi-2.6.git
The UBI git tree is usually based on top of the latest release of the Linux kernel and it should not be too difficult to fetch all UBI patches from the UBI git tree and to apply them to your tree. In fact, this work has already been done in UBIFS back-ports (see here). The back-port trees contain all UBI patches from the main-line. Just pick the UBI patches and apply them to your tree.
Mailing list
You are welcome to send feed-back, bug-reports, patches, etc to the MTD mailing list.
Saving erase counters
When working with UBI, it is important to realize that UBI stores erase
counters on the flash media. Namely, each physical eraseblock has so-called
erase counter header which stores the amount of times this physical eraseblock
has been erased. And of course, it is important not to loose the erase counters,
which means the tools you use to erase the flash and flash UBI images have to
be UBI-aware. For example, you may use the ubiformat utility (see
here) which is doing the things right.
Some details for flasher programs
This section provides information about what the program which flashes UBI images has to do. Please, skip this if you are not going to implement a flasher.
The following is a rough lists of things the flasher program has to do when flashing UBI images.
- first of all, the flasher has to scan the flash to collect the erase counters and calculate the average erase counter value in case one of the eraseblocks has a corrupted EC header;
- if the intention is to just erase a physical eraseblock, then after the erasure the flasher has to form valid EC header, increment the old value of the erase counter and write the EC header back to the flash;
- if the intention is to write some data to the eraseblock, which should be part of a valid UBI image, the flasher also has to form a valid ED header with incremented erase counter value, erase the physical eraseblock, and write the data with the formed header;
- if one or few physical eraseblocks do not have EC header or the EC header is corrupted, the flasher should use the average erase counter value;
Besides preserving the erase counter, there is one more important
thing which should be done - the flasher should drop 0xFF bytes at
the end of eraseblocks. Well, this is actually only relevant for some NAND
flashes, but if your flasher does not do this, you may face very unpleasant
problems which might be difficult to debug later. And this is relevant to JFFS2
images, not just UBI images. In other words, it is not UBI-specific.
Let's take an UBIFS image which is wrapped to an UBI image which has to be
flashed (say, ubi.img) to NAND flash. Suppose one of the physical
eraseblocks in the ubi.img image file contains just an erase
counter and the rest are 0xFF bytes. What happen if the flasher
writes these 0xFF bytes is that the ECC bytes are written to OOB
area of the corresponding NAND pages. Depending on the algorithm, the resulting
ECC may be 0xFF or something else. Later, when UBIFS is mounted it
treats that space as empty and tries to write data there, but fails (in the
best case, or you find the data corrupted later and have hard time figuring out
what is going wrong). So basically what happens is that NAND pages are
programmed twice - first by the flasher (only 0xFF bytes) and
then by UBIFS. For some NAND flashes this might work fine, for some
not.
By the way, we experienced the same problem with JFFS2. The JFFS2 images
generated by the mkfs.jffs2 program were padded to the physical
eraseblock size and were later flashed to our NAND. The flasher did not bother
to drop NAND pages containing only 0xFF bytes and just flashed
them. Later, when JFFS2 was mounted, it tried to append more data to that last
physical eraseblock which lead to various ECC errors later on. It took a while
to recognize the problem.
Thus, what the flasher has to do is to drop all the empty NAND pages from the end of the physical eraseblock buffer before writing it. Below is example code from UBI which is doing this.
/**
* calc_data_len - calculate how much real data is stored in a buffer.
* @ubi: UBI device description object
* @buf: a buffer with the contents of the physical eraseblock
* @length: the buffer length
*
* This function calculates how much "real data" is stored in @buf and returns
* the length. Continuous 0xFF bytes at the end of the buffer are not
* considered as "real data".
*/
int ubi_calc_data_len(const struct ubi_device *ubi, const void *buf,
int length)
{
int i;
for (i = length - 1; i >= 0; i--)
if (((const uint8_t *)buf)[i] != 0xFF)
break;
/* The resulting length must be aligned to the minimum flash I/O size */
length = ALIGN(i + 1, ubi->min_io_size);
return length;
}
This function is called before writing buf to a physical
eraseblock. The purpose of this function is to drop 0xFFs from
the end and prevent the situation described above.
ubi->min_io_size it the minimal input/output unit size which is
NAND page size for NAND flashes.
User-space tools
UBI user-space tools are available from the the
git://git.infradead.org/mtd-utils.git
repository (ubi-utils/new-utils sub-directory). One should
download the source codes and compile them. Please, download the latest
version of the tools:
git-clone git://git.infradead.org/mtd-utils.git
The repository contains the following UBI tools:
- ubinfo - provides information about UBI installed in the system, about all UBI devices and volumes;
- ubiattach - a tool to attach MTD devices (which describe raw flash) to UBI, which creates an UBI device sitting on top of the MTD device; this is an alternative method to specifying MTD devices on module load or in kernel boot command line;
- ubidetach - a tool to detach MTD devices from UBI devices; in other words, this tool does the opposite to what ubiattach does;
- ubimkvol - a tool to create UBI volumes on UBI devices;
- ubirmvol - a tool to remove UBI volumes from UBI devices;
- ubiupdatevol - a tool to update UBI volumes, which means to write new volume contents; this tool uses the UBI volume update feature which leaves the volume in "corrupted" state if it is interrupted; additionally, this tool may be used to wipe out UBI volumes;
- ubicrc32 - calculate CRC32 checksum of a file with the same initial seed as UBI would use;
- ubinize - a tool to generate UBI images;
- ubiformat - a tool to format empty flash, to erase flash preserving erase counters, and to flash UBI images to MTD devices.
All UBI tools support "-h" option and print sufficient usage information.
Note, ubiattach and ubidetach won't work on
if the kernel version is less than 2.6.25, because corresponding UBI feature
did not exist in older kernels. However, it is possible to back-port UBI
patches from the UBI git tree.
Also note, there are old UBI tools which might be useful (e.g., unubi).
Scalability issues
Unfortunately, UBI scales linearly in terms of flash size. UBI initialization time linearly depends on the number of physical eraseblocks on the flash. This means that the larger is the flash, the more time it takes for UBI to initialize (i.e., to attach the MTD device). The initialization time depends on the flash I/O speed and (slightly) on the CPU speed, because:
- UBI scans flash when it is attaching an MTD device - it reads the erase counter (EC) and the volume ID (VID) headers from every single physical eraseblock of the MTD device; the headers are quite small (64 bytes each), so this means reading 128 bytes from each PEB on NOR flash or one or two NAND pages in case of NAND page (this depends on whether the flash supports sub-page writes or not); this is anyway much less than JFFS2 needs to read when it mounts MTD devices, so UBI attaches MTD devices many times faster than JFFS2 would mount a file system on the same MTD device;
- UBI calculates CRC32 checksum of each EC and VID header, which consumes CPU, although this is usually minor comparing to the I/O overhead.
Here are some figures:
- a 256MiB OneNAND flash found in Nokia N800 devices is attached for less than 1 sec; the flash does support sub-pages so UBI has to read the first 2KiB NAND page of each PEB while scanning;
- a 1GiB NAND flash found in OLPC devices is attached for about 2 seconds; the flash is an SLC and supports sub-pages, but the Cafe controller which is used in the laptop does not allow sub-page writes, so UBI has to read two 2KiB NAND pages from each PEB.
Unfortunately we do not have more data and the reader is welcome to send it to us via the MTD mailing list.
Implementation details
In general, UBI needs three tables for operation:
- volume table which contains per-volume information, like volume size, type, etc;
- eraseblock association (EBA) table which contains the logical-to-physical eraseblock mapping information; for example, when reading an LEB, UBI first looks up the table to find the corresponding PEB number, then reads from this PEB;
- erase counters (EC) table which contains the erase counter value for each physical eraseblock; UBI wear-leveling sub-system uses this table when it needs to find, for example, a highly worn-out LEB;
The volume table is maintained on flash. It changes only when UBI volumes are created, deleted and re-sized, which are rare and not time-critical operations, and UBI can afford a slow and simple method of the volume table management.
The EBA and EC tables are changed every time an LEB is mapped to a PEB or a PEB is erased, which happens quite often and means that the table management methods would have to be fast and efficient if the table was maintained on flash. And this would inevitably involve journaling, journal replay, journal commit, etc. UBI could be logarithmically scalable if it maintained the latter 2 tables on the flash media, but it does not do this.
One of the UBI requirements was simplicity of on-flash format, because the original UBI designers had to read UBI volumes from the boot-loader and they had very tough constraints on the boot-loader code size. It was basically impossible to add complex journal scanning and replay code to the boot-loader.
UBI does not maintain the EBA and EC tables on flash, but instead, it builds them in RAM each time it attaches an MTD device. UBI keeps and maintains erase counter and LEB mapping of each physical eraseblock in the physical eraseblock itself. This means, that:
- the erase counter of a PEB is stored at the beginning of the PEB in the EC header; when a PEB is erased, UBI increases its erase counter and writes the EC header just after the erasure;
- the LEB-to-PEB mapping information is stored at the VID header which is placed after the EC header; the VID header is written only when the PEB gets mapped to an LEB, which happens only when an LEB which was previously erased (un-mapped) is written to for the first time; this explains why UBI has to write the EC and VID headers separately, which requires 2 NAND pages in case of NAND flash (unless the flash allows to write 2 or more times to the same NAND page which is referred to as sub-page write).
So, UBI has to scan the flash and read the EC and VID header from each PEB in order to build in-RAM EC and EBA tables.
The drawbacks of this design are poor scalability and relatively high overhead on NAND flashes (e.g., the overhead is 1.5%-3% of flash space in case of a NAND flash with 2KiB NAND page and 128KiB eraseblock). The advantages are simple binary format and robustness, as the result of symplicity.
Nonetheless, it is always possible to create UBI2 which would maintain the tables in separate flash areas. UBI2 would not be compatible with UBI because of completely different on-flash format, but the user interfaces would stay the same, which would guarantee compatibility of all the software built on top of UBI.
Flash space overhead
UBI uses some amount of flash space for its own purposes thus, reducing the amount of flash space available for UBI users. Namely:
- 2 PEBs are used to store the volume table;
- 1 PEB is reserved for wear-leveling purposes;
- 1 PEB is reserved for the atomic LEB change operation;
- some percent of PEBs is reserved for bad PEB handling if the flash may have bad PEBs; this is applicable for NAND flash, but not for NOR flash; the percentage is configurable and is 1% by default;
- UBI stores the erase counter (EC) and volume ID (VID) headers at the beginning of each PEB; the amount of bytes used for these purposes depends on the flash type and is explained below;
Lets introduce symbols:
- P - total number of physical eraseblocks on the MTD device;
- SP - physical eraseblock size;
- SL - logical eraseblock size;
- B - number of PEBs reserved for bad PEB handling; it is 1% of P for NAND by default, and 0 for NOR and other flash types which do not admit of bad PEBs;
- O - the overhead related to storing EC and VID headers in bytes, i.e. O = SP - SL.
The UBI overhead is (B + 4) * SP + O * (P - B - 4) i.e., this amount of bytes will not be accessible for users. O is different for different flashes:
- in case of NOR flash which has 1 byte minimum input/output unit, O is 128 bytes (each UBI header takes 64 bytes);
- in case of NAND flash which does not allow sub-pages (e.g., MLC NAND), O is 2 NAND pages, i.e. 4KiB in case of 2KiB NAND page and 1KiB in case of 512 bytes NAND page;
- if the NAND flash, NAND flash controller, and NAND flash driver support sub-page writes (e.g., SLC NANDs and most of their Linux drivers, OneNAND), UBI optimizes its on-flash layout and puts the EC and VID headers to the same NAND page, but different sub-pages; in this case O is only one NAND page;
- for other flashes with different min. I/O unit size, the overhead should be 2 min. I/O units if min. I/O unit size is greater or equivalent than 64, and 128 aligned to the min. I/O unit size if the min. I/O unit size is less than 64.
LEB un-map operation
The LEB un-map operation is available via the ubi_leb_unmap()
call of the UBI kernel API. The operation is not available via the
user-space interfaces. The LEB un-map operation:
- first un-maps the LEB from the corresponding PEB;
- then schedules the PEB for erasure and returns; it does not wait for the erasure of the PEB to be finished; the PEB is instead erased in context of the UBI background thread;
UBI returns all 0xFF bytes when an un-mapped LEB is read, so
the un-map operation is very similar to the erase operation (a very fast erase
operation). But there is a difference UBI programmers have to be well aware
of.
Suppose you un-map LEB L which is mapped to PEB P. Since P is not synchronously erased, but just scheduled for erasure, there might be "surprises" in case of unclean reboots: if the reboot happens before P has been physically erased, L will be mapped to P again when UBI attaches the MTD device after the unclean reboot. Indeed, UBI will scan the MTD device and find P which refers L, and it will add this mapping information to the EBA table.
But once you write any data to L, it gets mapped to a new PEB, and the old contents goes forever, because even in case of an unclean reboot UBI would pick the newer mapping for L.
You may use the ubi_leb_map() call which maps the LEB to an
empty PEB, so the LEB would always contain only 0xFF bytes, even in case of
an unclean reboot. But do not use this unless it is really needed, because this
puts additional overhead on the UBI wear-leveling sub-system, comparing to
an un-mapped LEB. Indeed, if an LEB is un-mapped, there is no PEB which
contains LEB's data, and the wear-leveling sub-system does not have to move any
data to maintain wear-leveling. Conversely, if the LEB is mapped to a PEB,
there is one more PEB for the wear-leveling sub-system to care about, and one
more LEB to re-map to another PEB if the erase counter of the current PEB
becomes too low (then the LEB is re-mapped to a PEB with higher erase counter
and the old PEB is used for other operations).
Implementation details
This section describes how UBI distinguishes between older and newer versions of an LEB in case of an unclean reboot. Suppose we un-map LEB L which is mapped to PEB P1, which means UBI schedules P1 for erasure. Then we write some data to L, which means that UBI finds another PEB P2, maps L to P2, and writes the data to P2. If an unclean reboot happens before P1 is physically erased, but after the write operation, we end up with 2 PEBs (P1 and P2) mapped to the same LEB L.
To handle situations like this, UBI maintains a global 64-bit sequence number variable. The sequence number variable is increased each time a PEB is mapped to a LEB and its value is stored in the VID header of the PEB. So each VID header has a unique sequence number, and the larger is the sequence number, the "younger" is the VID header. When UBI attaches an MTD device, it initializes the global sequence number variable to the highest value found in existing VID headers plus one.
In the above situation, UBI just selects a PEB with higher sequence number (P2) and drops the PEB with lower sequence number (P1).
Note, the situation is more difficult if an unclean reboot happens when UBI moves the contents of one PEB to another for a wear-leveling purposes, or when it happens during the atomic LEB change operation. In this case it is not enough to just pick the newer PEB, it is also necessary to make sure all the data were written, not just part of them.
Volume update operation
Unlike raw MTD devices, UBI devices support the volume update operation
which may be useful to implement software updates in end-user devices. The
operation changes the contents of whole UBI volume with new contents. Of
course, one could do this with raw MTD devices by means of just erasing the
device and putting the new image on it. But the advantage of the UBI volume
update operation is that if it gets interrupted, the volume goes into
"corrupted" state and further I/O on the volume ends up with an
EBADF error. And the only way to get the volume back to the normal
state is to start a new volume update operation and to finish it.
The volume update operation allows to detect interrupted updates and to re-start it with help of, for example, a "mirror" volume which would have the same contents or by showing a dialog window which would inform the end user about the problem and request flashing. In contrast, it is difficult to detect interrupted updates in case of raw MTD devices.
The volume update operation is available only via the user-space UBI
interface and it is not available via the UBI kernel API. To update a
volume, you first have to call the UBI_IOCVOLUP ioctl of the
corresponding volume character device and to pass a pointer to a 64-bit value
containing the length of the new volume contents in bytes. Then this amount of
bytes has to be written to the volume character device. Once the last byte has
been send to the character device, the update operation is finished.
Schematically, the sequence is:
fd = open("/dev/my_volume");
ioctl(fd, UBI_IOCVOLUP, &image_size);
write(fd, buf, image_size);
close(fd);
See include/mtd/ubi-user.h for more details. Bear in mind, the
old contents of the volume is not preserved in case of an interrupted update.
Also, it is not necessary to write all new data at one go. It is OK to call
the write() function arbitrary number of times and pass arbitrary
amount of data each time. The operation will be finished only after all the
data have been written. If the last write operation contains more bytes than
UBI expects, the extra data is just ignored.
Special case of volume update is what we call "volume truncation", which
may be done by specifying zero length of the new contents. In this case the
volume is just wiped out and will contain all 0xFF bytes.
Note, the /sys/class/ubi/ubiX_X/corrupted sysfs file reflects
the "corrupted" state of the volume: it contains ASCII "0\n" if the volume is OK
and is not corrupted and "1\n" if it is corrupted (because volume update had
started but not finished.
Technically, it is possible to implement an "atomic" volume update operation, which would mean that the contents of the volume would stay unchanged in case of interrupted updates. But this would require to have as much free space as the size of the volume to be updated. This is not currently implemented.
Implementation details
The volume update is implemented with help of so-called "update marker". Once
the user has issued the UBI_IOCVOLUP ioctl, UBI sets the update
marker flag for the volume in the corresponding record of the UBI volume table.
Then the volume is wiped out and UBI waits for the the user to pass the data.
Once all the data arrived and has been written to the flash, the update marker
is cleaned. But in case of an interruption (e.g., unclean reboot, crash of the
update application, etc.), the update marker is not cleaned and the volume is
treated as "corrupted". Only a new successful update operation may clean the
update marker.
Atomic LEB change operation
UBI also has an atomic LEB change operation which means that the contents of the LEB stays unchanged if the operation gets interrupted. In other words, the result of the operation is that the LEB either has the old contents or the new contents.
The operation is available via the ubi_leb_change() kernel API
call. The user-space interface for this operation does not exist in the
main-line kernel so far, but it was recently implemented and may be found in the
UBI git tree. It should be
available in the main-line kernels starting from version 2.6.25.
The user-space atomic LEB change operation is run via the
UBI_IOCEBCH ioctl command. One has to pass a pointer to a properly
filled request object of struct ubi_leb_change_req type. The
object stores the LEB number to change and the length of the new contents. Then
the user has to write the specified amount of bytes to the volume character
device. Notice some similarity to the user-space interface of the volume update
operation. Schematically, the sequence is:
struct ubi_leb_change_req req;
req.lnum = lnum_to_change;
req.len = data_len;
req.dtype = UBI_LONGTERM; /* data persistency (may also be UBI_SHORTTERM
and UBI_UNKNOWN) */
fd = open("/dev/my_volume");
ioctl(fd, UBI_IOCEBCH, &req);
write(fd, data_buf, data_len);
close(fd);
If for some reason the user does not write the declared amount of bytes and closes the file, the operation is canceled and the old contents of the LEB is preserved.
Similarly tho the volume update operation it does not matter how many times
the write() function is called and how much data it passes to the
UBI volume each time. The atomic LEB change operation finishes once the last
data byte arrives.
The atomic LEB change operation might be very useful for file-systems, for example UBIFS uses this operation as a last resort when it commits the file-system index. This operation may also be exploited to create an FTL layer on top of UBI (see here for the description of the idea).
Keep in mind that the atomic LEB change operation calculates CRC32 checksum of the new data, so it has some overhead comparing to the LEB erase plus LEB write sequence. The volume update operation does not calculate data CRC, so it is faster to update the volume than to atomically change all its eraseblocks. This additional overhead has to be remembered about and the operation should not be used if the atomicity is not really needed.
Implementation details
Suppose UBI has to change a logical eraseblock L which is mapped to a
physical eraseblock P1. First of all, UBI always has one free
PEB reserved for the atomic LEB change operation, let it be
P2. Before the operation, P1 stores the
contents of the LEB L and P2 is free (it contains only
the EC header and OxFF bytes). The new data is written to
P2, not to P1, so should anything go wrong,
the old contents of the LEB is always there.
When the operation finishes, UBI un-maps L from P1, maps in to P2, and schedules P1 for erasure. If the operation is interrupted, L stays being mapped to P1 and P2 is scheduled for erasure.
If an unclean reboot happens half way through the atomic LEB change operation, it is obvious that UBI has to preserve the L -> P1 mapping and erase P2 when it is attaches the MTD device next time. But if the unclean reboot happens just after the atomic LEB change operation finishes, but before P1 is physically erased, it is obvious that UBI has to preserve L -> P2 mapping and erase P1.
To resolve situations like that, UBI calculates CRC checksum of the new contents of the LEB before it is written to flash, and stores it in the VID header (together with data length). When UBI finds 2 PEBs P1 and P2 mapped to the same LEB L during the initialization, it selects the one with higher sequence number (P2) only if the data CRC is correct (which means all that the data has been written to the flash media), otherwise it selects the PEB with lower sequence number(P1). Of course, UBI has to read the LEB contents in order to check the CRC checksum.
Volume auto-resize
It is well-known that NAND chips have some amount of physical eraseblocks marked as bad by the manufacturer. The bad PEBs are distributed randomly and their number is different, although manufacturers usually guarantee that the first few physical eraseblocks are not bad and the total amount of bad PEBs does not exceed certain number. For example, a new 256MiB Samsung OneNAND chip is guaranteed to have not larger than 40 128KiB PEBs (but of course, more physical eraseblock will become bad over time). This is about 2% of the whole flash size.
When it is needed to create an UBI image which will be flashed to the end user devices in production line, you to define exact sizes of all volumes (the sizes are stored in the UBI volume table). But it is difficult to do because total flash chip may vary depending on the amount of initially bad PEBs.
One obvious way to solve the problem is to assume the worst case, when all chips would have maximum amount of bad PEBs. But in practice, most of the chips will have only few bad PEBs which is far less than the maximum. In general, it is fine, this will increase reliability, because UBI anyway uses all PEBs of the device. On the other hand UBI anyway reserves some amount of physical eraseblocks for bad PEB handling which is 1% of PEBs by default. So in case of the above mentioned OneNAND chip the result would be that 1% of PEBs would be reserved by UBI, and 0-2% would be available for new volumes (they would be seen as available LEBs for UBI users).
But there is an alternative approach - one of the volume may be marked as auto-resized, which means that its size is enlarged when UBI is run for the first time. After the volume size is adjusted, UBI removes the auto-resize mark and the volume is not resized anymore. The auto-resize flag is stored in the volume table and only one volume may be marked as auto-resize. For example, if there is a volume which is intended to have the root file-system, it may be reasonable to mark it as auto-resize.
In the example with OneNAND chip, if one of the UBI volumes is be marked as auto-resized, it will be enlarged by 0-2% on the first UBI boot, but 1% of PEBs will anyway be reserved for bad PEB handling.
Note, the auto-resize feature was added very recently and it is not in the main-line kernel yet, but it should appear in version 2.6.25. Use UBI git tree to find the implementation of the feature.
Marking eraseblocks as bad
This section is relevant for NAND flashes and other flashes which admit of bad eraseblocks. UBI marks physical eraseblocks as bad on 2 occasions:
- eraseblock write operation failed, in which case UBI moves the data from this PEB to some other PEB (data recovery) and schedules this PEB for torturing;
- erase operation failed with
EIOerror, in which case the eraseblock s marked as bad straight away.
The torturing is done in background with the purpose to detect whether the physical eraseblock is bad or not. The write failure may have happened because of many reasons, including bugs in the driver, or in the upper level stuff like file system (e.g., the FS mistakenly writes many times to the same NAND page). During eraseblock torturing UBI does the following:
- erase the eraseblock;
- read it back and make sure it contains only 0xFF bytes;
- write test pattern bytes;
- read the eraseblock back and check the pattern;
- and so on for several patterns (
0xA5,0x5A,0x00).
The eraseblock is not marked as bad if it survives the torture test. Note, a
bit-flip during the torture test is treated as a good reason to mark the
eraseblock bad as well. Please, refer the torture_peb() function
for detailed information.
More documentation
Unfortunately, there are no thorough and strict UBI documents. But there is an old UBI design document which has some out-of-date information, but is still useful: ubidesign.pdf.
There is also a PowerPoint UBI presentation available:
ubi.ppt. Note, this document has to be looked at
in Windows, because it contains a lot of animation and Open Office cannot
properly show it. Use slide show (F5 key) when you look, because
otherwise the animation is not shown.
Many useful information may be found at the FAQ section.
And of course just reading the UBI interface C header files which contains
quite a few commentaries may help: include/mtd/ubi-user.h
contains the user-space interface definition (namely, it defines UBI ioctl
commands and the involved data structures),
include/linux/mtd/ubi.h defines the kernel API and the
drivers/mtd/ubi/kapi.c file contains comments for each kernel API
function (just above the body of the function).
How to send an UBI bugreport?
Before sending a bug report:
- make sure you have compiled kernel symbols in
(
CONFIG_KALLSYMS_ALL=yin.config); - enable UBI debugging (
CONFIG_MTD_UBI_DEBUG=yin.config); enable just UBI debugging in general, not UBI debugging messages (unless you are an experienced kernel hacker and know what you do).
Please, attach all the bug-related messages including the UBI messages from
the kernel ring buffer, which may be collected using the dmesg
utility or using minicom with serial console capturing. And of
course, it is wise to describe how the problem can be reproduced. The bugreport
should be sent to the MTD mailing list.