UBI - Unsorted Block Images

Table of contents

  1. Big red note
  2. Overview
  3. Source code
  4. Mailing list
  5. User-space tools
  6. Saving erase counters
  7. Scalability issues
  8. Flash space overhead
  9. LEB un-map operation
  10. Volume update operation
  11. Atomic LEB change operation
  12. Volume auto-resize
  13. Marking eraseblocks as bad
  14. More documentation
  15. How to send a bugreport?

Big red note

People are often confused and treat UBI as a block device emulation layer (also known as FTL - flash translation layer). But this is not true - UBI is not an FTL.

UBI was not designed for flash devices like MMC/SD/mini-SD/micro-SD cards, USB sticks, CompactFlash, and so on. UBI was designed for bare flashes which may be found in embedded systems. Please, do not be confused.

Overview

UBI (Latin: "where?") stands for "Unsorted Block Images". It is a volume management system for flash devices which manages multiple logical volumes on a single physical flash device and spreads the I/O load (i.e, wear-leveling) across the whole flash chip.

In a sense, UBI may be compared to the Logical Volume Manager (LVM). Whereas LVM maps logical sectors to physical sectors, UBI maps logical eraseblocks to physical eraseblocks. But besides the mapping, UBI implements global wear-leveling and I/O errors handling.

An UBI volume is a set of consecutive logical eraseblocks. Each logical eraseblock may be mapped to any physical eraseblock. This mapping is managed by UBI, it is hidden from users and it is the base mechanism to provide global wear-leveling (along with per-physical eraseblock erase counters and the ability to transparently move data from more worn-out physical eraseblocks to the less worn-out ones).

UBI volume size is specified when the volume is created a may later be changed (volumes are dynamically re-sizable). UBI supports dynamic volumes and static volumes. Static volumes are read-only and their contents are protected by CRC check sums, while dynamic volumes are read-write and the upper layer (e.g., a file-system) is responsible for data integrity.

UBI is aware of bad eraseblocks (e.g., NAND flash may have them) and frees the upper layer from any bad block handling. UBI has a pool of reserved physical eraseblocks, and when a physical eraseblock becomes bad, it transparently substitutes it by a good physical eraseblock. UBI moves good data from the newly appeared bad physical eraseblock to the good one as well. The result is that users of UBI volumes do not notice I/O errors as UBI takes care of them.

NAND flashes may have bit-flips which occur on read and write operations. Bit-flips are corrected by ECC checksums, but they may accumulate over time and cause data loss. UBI handles this by moving data from physical eraseblocks with bit-flips to other physical eraseblocks, thus doing active scrubbing. This is done transparently in background and is hidden from upper layers.

Here is a short list of the main UBI features:

Here is a comparison of MTD partitions and UBI volumes. UBI volumes are rather similar to MTD partitions because:

But UBI volumes have the following advantages over traditional MTD partitions:

So, existing software may still work on top of UBI volumes, while new software may benefit from the UBI features and let UBI solve many issues which the flash technology imposes.

Source code

UBI is in the main-line Linux kernel starting from version 2.6.22. But it is recommended to use the latest UBI which can be found in the UBI git tree:

git://git.infradead.org/~dedekind/ubi-2.6.git

The UBI git tree is usually based on top of the latest release of the Linux kernel and it should not be too difficult to fetch all UBI patches from the UBI git tree and to apply them to your tree. In fact, this work has already been done in UBIFS back-ports (see here). The back-port trees contain all UBI patches from the main-line. Just pick the UBI patches and apply them to your tree.

Mailing list

You are welcome to send feed-back, bug-reports, patches, etc to the MTD mailing list.

Saving erase counters

When working with UBI, it is important to realize that UBI stores erase counters on the flash media. Namely, each physical eraseblock has so-called erase counter header which stores the amount of times this physical eraseblock has been erased. And of course, it is important not to loose the erase counters, which means the tools you use to erase the flash and flash UBI images have to be UBI-aware. For example, you may use the ubiformat utility (see here) which is doing the things right.

Some details for flasher programs

This section provides information about what the program which flashes UBI images has to do. Please, skip this if you are not going to implement a flasher.

The following is a rough lists of things the flasher program has to do when flashing UBI images.

Besides preserving the erase counter, there is one more important thing which should be done - the flasher should drop 0xFF bytes at the end of eraseblocks. Well, this is actually only relevant for some NAND flashes, but if your flasher does not do this, you may face very unpleasant problems which might be difficult to debug later. And this is relevant to JFFS2 images, not just UBI images. In other words, it is not UBI-specific.

Let's take an UBIFS image which is wrapped to an UBI image which has to be flashed (say, ubi.img) to NAND flash. Suppose one of the physical eraseblocks in the ubi.img image file contains just an erase counter and the rest are 0xFF bytes. What happen if the flasher writes these 0xFF bytes is that the ECC bytes are written to OOB area of the corresponding NAND pages. Depending on the algorithm, the resulting ECC may be 0xFF or something else. Later, when UBIFS is mounted it treats that space as empty and tries to write data there, but fails (in the best case, or you find the data corrupted later and have hard time figuring out what is going wrong). So basically what happens is that NAND pages are programmed twice - first by the flasher (only 0xFF bytes) and then by UBIFS. For some NAND flashes this might work fine, for some not.

By the way, we experienced the same problem with JFFS2. The JFFS2 images generated by the mkfs.jffs2 program were padded to the physical eraseblock size and were later flashed to our NAND. The flasher did not bother to drop NAND pages containing only 0xFF bytes and just flashed them. Later, when JFFS2 was mounted, it tried to append more data to that last physical eraseblock which lead to various ECC errors later on. It took a while to recognize the problem.

Thus, what the flasher has to do is to drop all the empty NAND pages from the end of the physical eraseblock buffer before writing it. Below is example code from UBI which is doing this.

/**
 * calc_data_len - calculate how much real data is stored in a buffer.
 * @ubi: UBI device description object
 * @buf: a buffer with the contents of the physical eraseblock
 * @length: the buffer length
 *
 * This function calculates how much "real data" is stored in @buf and returns
 * the length. Continuous 0xFF bytes at the end of the buffer are not
 * considered as "real data".
 */
int ubi_calc_data_len(const struct ubi_device *ubi, const void *buf,
                      int length)
{
        int i;

        for (i = length - 1; i >= 0; i--)
                if (((const uint8_t *)buf)[i] != 0xFF)
                        break;

        /* The resulting length must be aligned to the minimum flash I/O size */
        length = ALIGN(i + 1, ubi->min_io_size);
        return length;
}

This function is called before writing buf to a physical eraseblock. The purpose of this function is to drop 0xFFs from the end and prevent the situation described above. ubi->min_io_size it the minimal input/output unit size which is NAND page size for NAND flashes.

User-space tools

UBI user-space tools are available from the the git://git.infradead.org/mtd-utils.git repository (ubi-utils/new-utils sub-directory). One should download the source codes and compile them. Please, download the latest version of the tools:

git-clone git://git.infradead.org/mtd-utils.git

The repository contains the following UBI tools:

All UBI tools support "-h" option and print sufficient usage information.

Note, ubiattach and ubidetach won't work on if the kernel version is less than 2.6.25, because corresponding UBI feature did not exist in older kernels. However, it is possible to back-port UBI patches from the UBI git tree.

Also note, there are old UBI tools which might be useful (e.g., unubi).

Scalability issues

Unfortunately, UBI scales linearly in terms of flash size. UBI initialization time linearly depends on the number of physical eraseblocks on the flash. This means that the larger is the flash, the more time it takes for UBI to initialize (i.e., to attach the MTD device). The initialization time depends on the flash I/O speed and (slightly) on the CPU speed, because:

Here are some figures:

Unfortunately we do not have more data and the reader is welcome to send it to us via the MTD mailing list.

Implementation details

In general, UBI needs three tables for operation:

The volume table is maintained on flash. It changes only when UBI volumes are created, deleted and re-sized, which are rare and not time-critical operations, and UBI can afford a slow and simple method of the volume table management.

The EBA and EC tables are changed every time an LEB is mapped to a PEB or a PEB is erased, which happens quite often and means that the table management methods would have to be fast and efficient if the table was maintained on flash. And this would inevitably involve journaling, journal replay, journal commit, etc. UBI could be logarithmically scalable if it maintained the latter 2 tables on the flash media, but it does not do this.

One of the UBI requirements was simplicity of on-flash format, because the original UBI designers had to read UBI volumes from the boot-loader and they had very tough constraints on the boot-loader code size. It was basically impossible to add complex journal scanning and replay code to the boot-loader.

UBI does not maintain the EBA and EC tables on flash, but instead, it builds them in RAM each time it attaches an MTD device. UBI keeps and maintains erase counter and LEB mapping of each physical eraseblock in the physical eraseblock itself. This means, that:

So, UBI has to scan the flash and read the EC and VID header from each PEB in order to build in-RAM EC and EBA tables.

The drawbacks of this design are poor scalability and relatively high overhead on NAND flashes (e.g., the overhead is 1.5%-3% of flash space in case of a NAND flash with 2KiB NAND page and 128KiB eraseblock). The advantages are simple binary format and robustness, as the result of symplicity.

Nonetheless, it is always possible to create UBI2 which would maintain the tables in separate flash areas. UBI2 would not be compatible with UBI because of completely different on-flash format, but the user interfaces would stay the same, which would guarantee compatibility of all the software built on top of UBI.

Flash space overhead

UBI uses some amount of flash space for its own purposes thus, reducing the amount of flash space available for UBI users. Namely:

Lets introduce symbols:

The UBI overhead is (B + 4) * SP + O * (P - B - 4) i.e., this amount of bytes will not be accessible for users. O is different for different flashes:

LEB un-map operation

The LEB un-map operation is available via the ubi_leb_unmap() call of the UBI kernel API. The operation is not available via the user-space interfaces. The LEB un-map operation:

UBI returns all 0xFF bytes when an un-mapped LEB is read, so the un-map operation is very similar to the erase operation (a very fast erase operation). But there is a difference UBI programmers have to be well aware of.

Suppose you un-map LEB L which is mapped to PEB P. Since P is not synchronously erased, but just scheduled for erasure, there might be "surprises" in case of unclean reboots: if the reboot happens before P has been physically erased, L will be mapped to P again when UBI attaches the MTD device after the unclean reboot. Indeed, UBI will scan the MTD device and find P which refers L, and it will add this mapping information to the EBA table.

But once you write any data to L, it gets mapped to a new PEB, and the old contents goes forever, because even in case of an unclean reboot UBI would pick the newer mapping for L.

You may use the ubi_leb_map() call which maps the LEB to an empty PEB, so the LEB would always contain only 0xFF bytes, even in case of an unclean reboot. But do not use this unless it is really needed, because this puts additional overhead on the UBI wear-leveling sub-system, comparing to an un-mapped LEB. Indeed, if an LEB is un-mapped, there is no PEB which contains LEB's data, and the wear-leveling sub-system does not have to move any data to maintain wear-leveling. Conversely, if the LEB is mapped to a PEB, there is one more PEB for the wear-leveling sub-system to care about, and one more LEB to re-map to another PEB if the erase counter of the current PEB becomes too low (then the LEB is re-mapped to a PEB with higher erase counter and the old PEB is used for other operations).

Implementation details

This section describes how UBI distinguishes between older and newer versions of an LEB in case of an unclean reboot. Suppose we un-map LEB L which is mapped to PEB P1, which means UBI schedules P1 for erasure. Then we write some data to L, which means that UBI finds another PEB P2, maps L to P2, and writes the data to P2. If an unclean reboot happens before P1 is physically erased, but after the write operation, we end up with 2 PEBs (P1 and P2) mapped to the same LEB L.

To handle situations like this, UBI maintains a global 64-bit sequence number variable. The sequence number variable is increased each time a PEB is mapped to a LEB and its value is stored in the VID header of the PEB. So each VID header has a unique sequence number, and the larger is the sequence number, the "younger" is the VID header. When UBI attaches an MTD device, it initializes the global sequence number variable to the highest value found in existing VID headers plus one.

In the above situation, UBI just selects a PEB with higher sequence number (P2) and drops the PEB with lower sequence number (P1).

Note, the situation is more difficult if an unclean reboot happens when UBI moves the contents of one PEB to another for a wear-leveling purposes, or when it happens during the atomic LEB change operation. In this case it is not enough to just pick the newer PEB, it is also necessary to make sure all the data were written, not just part of them.

Volume update operation

Unlike raw MTD devices, UBI devices support the volume update operation which may be useful to implement software updates in end-user devices. The operation changes the contents of whole UBI volume with new contents. Of course, one could do this with raw MTD devices by means of just erasing the device and putting the new image on it. But the advantage of the UBI volume update operation is that if it gets interrupted, the volume goes into "corrupted" state and further I/O on the volume ends up with an EBADF error. And the only way to get the volume back to the normal state is to start a new volume update operation and to finish it.

The volume update operation allows to detect interrupted updates and to re-start it with help of, for example, a "mirror" volume which would have the same contents or by showing a dialog window which would inform the end user about the problem and request flashing. In contrast, it is difficult to detect interrupted updates in case of raw MTD devices.

The volume update operation is available only via the user-space UBI interface and it is not available via the UBI kernel API. To update a volume, you first have to call the UBI_IOCVOLUP ioctl of the corresponding volume character device and to pass a pointer to a 64-bit value containing the length of the new volume contents in bytes. Then this amount of bytes has to be written to the volume character device. Once the last byte has been send to the character device, the update operation is finished. Schematically, the sequence is:

fd = open("/dev/my_volume");
ioctl(fd, UBI_IOCVOLUP, &image_size);
write(fd, buf, image_size);
close(fd);

See include/mtd/ubi-user.h for more details. Bear in mind, the old contents of the volume is not preserved in case of an interrupted update. Also, it is not necessary to write all new data at one go. It is OK to call the write() function arbitrary number of times and pass arbitrary amount of data each time. The operation will be finished only after all the data have been written. If the last write operation contains more bytes than UBI expects, the extra data is just ignored.

Special case of volume update is what we call "volume truncation", which may be done by specifying zero length of the new contents. In this case the volume is just wiped out and will contain all 0xFF bytes.

Note, the /sys/class/ubi/ubiX_X/corrupted sysfs file reflects the "corrupted" state of the volume: it contains ASCII "0\n" if the volume is OK and is not corrupted and "1\n" if it is corrupted (because volume update had started but not finished.

Technically, it is possible to implement an "atomic" volume update operation, which would mean that the contents of the volume would stay unchanged in case of interrupted updates. But this would require to have as much free space as the size of the volume to be updated. This is not currently implemented.

Implementation details

The volume update is implemented with help of so-called "update marker". Once the user has issued the UBI_IOCVOLUP ioctl, UBI sets the update marker flag for the volume in the corresponding record of the UBI volume table. Then the volume is wiped out and UBI waits for the the user to pass the data. Once all the data arrived and has been written to the flash, the update marker is cleaned. But in case of an interruption (e.g., unclean reboot, crash of the update application, etc.), the update marker is not cleaned and the volume is treated as "corrupted". Only a new successful update operation may clean the update marker.

Atomic LEB change operation

UBI also has an atomic LEB change operation which means that the contents of the LEB stays unchanged if the operation gets interrupted. In other words, the result of the operation is that the LEB either has the old contents or the new contents.

The operation is available via the ubi_leb_change() kernel API call. The user-space interface for this operation does not exist in the main-line kernel so far, but it was recently implemented and may be found in the UBI git tree. It should be available in the main-line kernels starting from version 2.6.25.

The user-space atomic LEB change operation is run via the UBI_IOCEBCH ioctl command. One has to pass a pointer to a properly filled request object of struct ubi_leb_change_req type. The object stores the LEB number to change and the length of the new contents. Then the user has to write the specified amount of bytes to the volume character device. Notice some similarity to the user-space interface of the volume update operation. Schematically, the sequence is:

struct ubi_leb_change_req req;

req.lnum = lnum_to_change;
req.len = data_len;
req.dtype = UBI_LONGTERM;  /* data persistency (may also be UBI_SHORTTERM
                              and UBI_UNKNOWN) */
fd = open("/dev/my_volume");
ioctl(fd, UBI_IOCEBCH, &req);
write(fd, data_buf, data_len);
close(fd);

If for some reason the user does not write the declared amount of bytes and closes the file, the operation is canceled and the old contents of the LEB is preserved.

Similarly tho the volume update operation it does not matter how many times the write() function is called and how much data it passes to the UBI volume each time. The atomic LEB change operation finishes once the last data byte arrives.

The atomic LEB change operation might be very useful for file-systems, for example UBIFS uses this operation as a last resort when it commits the file-system index. This operation may also be exploited to create an FTL layer on top of UBI (see here for the description of the idea).

Keep in mind that the atomic LEB change operation calculates CRC32 checksum of the new data, so it has some overhead comparing to the LEB erase plus LEB write sequence. The volume update operation does not calculate data CRC, so it is faster to update the volume than to atomically change all its eraseblocks. This additional overhead has to be remembered about and the operation should not be used if the atomicity is not really needed.

Implementation details

Suppose UBI has to change a logical eraseblock L which is mapped to a physical eraseblock P1. First of all, UBI always has one free PEB reserved for the atomic LEB change operation, let it be P2. Before the operation, P1 stores the contents of the LEB L and P2 is free (it contains only the EC header and OxFF bytes). The new data is written to P2, not to P1, so should anything go wrong, the old contents of the LEB is always there.

When the operation finishes, UBI un-maps L from P1, maps in to P2, and schedules P1 for erasure. If the operation is interrupted, L stays being mapped to P1 and P2 is scheduled for erasure.

If an unclean reboot happens half way through the atomic LEB change operation, it is obvious that UBI has to preserve the L -> P1 mapping and erase P2 when it is attaches the MTD device next time. But if the unclean reboot happens just after the atomic LEB change operation finishes, but before P1 is physically erased, it is obvious that UBI has to preserve L -> P2 mapping and erase P1.

To resolve situations like that, UBI calculates CRC checksum of the new contents of the LEB before it is written to flash, and stores it in the VID header (together with data length). When UBI finds 2 PEBs P1 and P2 mapped to the same LEB L during the initialization, it selects the one with higher sequence number (P2) only if the data CRC is correct (which means all that the data has been written to the flash media), otherwise it selects the PEB with lower sequence number(P1). Of course, UBI has to read the LEB contents in order to check the CRC checksum.

Volume auto-resize

It is well-known that NAND chips have some amount of physical eraseblocks marked as bad by the manufacturer. The bad PEBs are distributed randomly and their number is different, although manufacturers usually guarantee that the first few physical eraseblocks are not bad and the total amount of bad PEBs does not exceed certain number. For example, a new 256MiB Samsung OneNAND chip is guaranteed to have not larger than 40 128KiB PEBs (but of course, more physical eraseblock will become bad over time). This is about 2% of the whole flash size.

When it is needed to create an UBI image which will be flashed to the end user devices in production line, you to define exact sizes of all volumes (the sizes are stored in the UBI volume table). But it is difficult to do because total flash chip may vary depending on the amount of initially bad PEBs.

One obvious way to solve the problem is to assume the worst case, when all chips would have maximum amount of bad PEBs. But in practice, most of the chips will have only few bad PEBs which is far less than the maximum. In general, it is fine, this will increase reliability, because UBI anyway uses all PEBs of the device. On the other hand UBI anyway reserves some amount of physical eraseblocks for bad PEB handling which is 1% of PEBs by default. So in case of the above mentioned OneNAND chip the result would be that 1% of PEBs would be reserved by UBI, and 0-2% would be available for new volumes (they would be seen as available LEBs for UBI users).

But there is an alternative approach - one of the volume may be marked as auto-resized, which means that its size is enlarged when UBI is run for the first time. After the volume size is adjusted, UBI removes the auto-resize mark and the volume is not resized anymore. The auto-resize flag is stored in the volume table and only one volume may be marked as auto-resize. For example, if there is a volume which is intended to have the root file-system, it may be reasonable to mark it as auto-resize.

In the example with OneNAND chip, if one of the UBI volumes is be marked as auto-resized, it will be enlarged by 0-2% on the first UBI boot, but 1% of PEBs will anyway be reserved for bad PEB handling.

Note, the auto-resize feature was added very recently and it is not in the main-line kernel yet, but it should appear in version 2.6.25. Use UBI git tree to find the implementation of the feature.

Marking eraseblocks as bad

This section is relevant for NAND flashes and other flashes which admit of bad eraseblocks. UBI marks physical eraseblocks as bad on 2 occasions:

  1. eraseblock write operation failed, in which case UBI moves the data from this PEB to some other PEB (data recovery) and schedules this PEB for torturing;
  2. erase operation failed with EIO error, in which case the eraseblock s marked as bad straight away.

The torturing is done in background with the purpose to detect whether the physical eraseblock is bad or not. The write failure may have happened because of many reasons, including bugs in the driver, or in the upper level stuff like file system (e.g., the FS mistakenly writes many times to the same NAND page). During eraseblock torturing UBI does the following:

The eraseblock is not marked as bad if it survives the torture test. Note, a bit-flip during the torture test is treated as a good reason to mark the eraseblock bad as well. Please, refer the torture_peb() function for detailed information.

More documentation

Unfortunately, there are no thorough and strict UBI documents. But there is an old UBI design document which has some out-of-date information, but is still useful: ubidesign.pdf.

There is also a PowerPoint UBI presentation available: ubi.ppt. Note, this document has to be looked at in Windows, because it contains a lot of animation and Open Office cannot properly show it. Use slide show (F5 key) when you look, because otherwise the animation is not shown.

Many useful information may be found at the FAQ section.

And of course just reading the UBI interface C header files which contains quite a few commentaries may help: include/mtd/ubi-user.h contains the user-space interface definition (namely, it defines UBI ioctl commands and the involved data structures), include/linux/mtd/ubi.h defines the kernel API and the drivers/mtd/ubi/kapi.c file contains comments for each kernel API function (just above the body of the function).

How to send an UBI bugreport?

Before sending a bug report:

Please, attach all the bug-related messages including the UBI messages from the kernel ring buffer, which may be collected using the dmesg utility or using minicom with serial console capturing. And of course, it is wise to describe how the problem can be reproduced. The bugreport should be sent to the MTD mailing list.

Last updated: 21 Apr 2008, dedekind Valid XHTML 1.0! Valid CSS!