×
Namespaces

Variants
Actions

Symbian OS Internals/07. Memory Models

From Nokia Developer Wiki
Jump to: navigation, search
Article Metadata
Compatibility
Platform(s):
Symbian
Article
Created: hamishwillee (17 Jan 2011)
Last edited: hamishwillee (26 Jul 2012)

by Andrew Thoelke

A memory is what is left when something happens and does not completely unhappen
Edward de Bono


The kernel is responsible for two key resources on a device: the CPU and the memory. In Chapter 6, Interrupts and Exceptions, I described how the kernel shares the CPU between execution threads and interrupts.

In this chapter I will examine the high-level memory services provided by EKA2, and the way that the kernel interacts with the physical memory in the device to provide them. To isolate the kernel from different memory hardware designs, this interaction is encapsulated in a distinct architectural unit that we call the memory model. As I describe the different memory models provided with EKA2 you will find out how they use the memory address space (the memory map) and their contribution to overall system behavior.

Contents

The memory model

At the application level - and to a large extent when writing kernel-side software - the main use of memory is for allocation from the free store using operator new or malloc. However, there are some more fundamental memory services that are used to provide the foundation from which such memory allocators can be built.

The kernel has the following responsibilities related to memory management:

  1. Management of the physical memory resources: RAM, MMU and caches
  2. Allocation of virtual and physical memory
  3. Per-process address space management
  4. Process isolation and kernel memory protection
  5. The memory aspects of the software loader.

As well as providing these essential services, we wanted to ensure that the design of the memory model does not impose hard or low limits on the operating system. In particular:

  • The number of processes should be limited by physical resources rather than the memory model, and should certainly exceed 64
  • Each process should have a large dedicated address space of 1-2 GB
  • The amount of executable code that can be loaded by a process should be limited only by available ROM/RAM.

We found that the provision of efficient services to carry out these responsibilities is dependent on the memory architecture in the hardware. In particular, a design that is fast and small for some hardware may prove to be too slow or require too much memory if used on another. As one of the aims of EKA2 was to be readily portable to new hardware, including new MMU and memory architectures, we took all of the code that implements the different memory designs out of the generic kernel and provided a common interface. The resulting block of code we call the memory model. This is itself layered, as I have already briefly described in Chapter 1, Introducing EKA2. I will repeat the key parts of the illustration I gave there as Figure 7.1.

Figure 7.1 Memory model layering

At the highest level, we have to distinguish between a native implementation of EKA2 and an emulated one. In the former case, EKA2 is the OS that owns the CPU, the memory and all the peripherals, and the system boots from a ROM image. In the latter case another host OS provides basic services to EKA2, including memory allocation and software loading. This layer is referred to as the platform.

As I mentioned in Chapter 1, Introducing EKA2, there are several ways to design an MMU and cache. We want to provide the best use of memory and performance for Symbian OS and so the different hardware architectures result in different memory model designs. The basic choices are as follows:

No MMU Direct memory model
Virtually tagged cache Moving memory model
Physically tagged cache Multiple memory model
Emulator Emulator memory model

I describe these different memory models in detail later in the chapter. Even for identical memory architectures, different CPUs have different ways of controlling the MMU and cache and the final layer in the memory model, the CPU layer, supplies the specific code to control the memory in individual CPUs.

MMUs and caches

MMU

Before describing how EKA2 uses the RAM in the device to provide the memory services to the operating system and applications, it is worth explaining how the hardware presents the memory to the software.

EKA2 is a 32-bit operating system, which means that it assumes that all memory addresses can be represented in a 32-bit register. This limits the amount of simultaneously addressable memory to 4 GB. In practice there is far less physical memory than this, typically between 16 MB and 32 MB in the mobile phones available at the time of writing.

One of the important aspects of nearly all Symbian devices is that they are open - they allow the user to install third-party native applications and services. This is very different from a mobile handset based on an embedded OS, and is very significant for the way the OS must manage memory. It has several consequences:

  1. In an embedded OS, one can determine the maximum memory requirement of each component. Then, at compilation time, one can allocate exactly the memory that is needed to each component. This means that the exact amount of RAM needed is known when building the product. Static allocation of this kind is not viable with an open platform
  2. There are certain types of application that ideally would use all available memory to provide maximum benefit to the user - for example, web browsers encountering complex web sites. Providing each such application with dedicated RAM would prove very expensive, particularly considering that most of this memory would be unused most of the time
  3. The built-in software can be tested as thoroughly as required by the device manufacturer. However, third-party software added later can threaten the stability and integrity of the device. A poorly written or malicious program can be harmful if this software is allowed to directly interfere with the memory of the OS.

These issues make it important to make use of a piece of hardware found in higher-end devices: a memory management unit (MMU). This is responsible for the memory interface between the CPU and the memory hardware, which is typically a memory controller and one or more memory chips.

The rest of this section explores the various key features of a MMU and how EKA2 makes use of them.

Virtual addresses and address translation

One of the key services of an MMU is an abstraction between what the software considers to be a memory address and the real physical address of the RAM. The former is called the virtual address in this context and the latter the physical address.

This disconnection between the address in the software and the hardware address provides the mechanism to resolve the first two issues associated with an open OS. In particular, the OS can allocate a large range of virtual addresses for an application but only allocate the physical memory as and when the application requires it. Allocation of virtual addresses is often referred to as reserving, allocation of physical memory as committing.

The MMU and OS must maintain a mapping from virtual addresses to physical addresses in a form that allows the MMU to efficiently translate from the virtual to physical address whenever a memory access occurs.

The most common structure used to hold this map is called a multi-level page directory, and the hardware supported by Symbian OS specifically supports two-level page directories. Some high-end CPUs now use three or more levels in the MMU, particularly 64-bit processors that support virtual address ranges with more than 32 bits. Figure 7.2 shows what a multi-level page directory might look like.

Figure 7.2 A multi'-'level page director

The first level in a two-level directory is commonly referred to as the page directory. Conceptually there is just one of these directories used to do the mapping, although this is not always literally true - and we'll examine that in the next section. The directory is just a table of references to the items in the second level of the directory.

In the second level there are page tables. Typically there will be tens to hundreds of these objects in the mapping. Each page table itself contains a table of references to individual pieces of memory.

The memory itself has to be divided up into a collection of memory pages or frames. MMUs often support a small range of different page sizes: EKA2 prefers to use page sizes of 4 KB and 1 MB, but may also make use of others if available.

Perhaps the best way to understand how a virtual address is translated into a physical address through this structure would be to work through an example. To illustrate how address translation works, I shall concentrate on how an ARM MMU translates an address that refers to memory in a 4 KB page - this translation process is also called page table walking.

Let's suppose that a program has a string, Hello world, at address 0x87654321 and issues an instruction to read the first character of this string. Figure 7.3 illustrates the work done by the MMU to find the memory page containing the string.

Figure 7.3 Algorithm for translating virtual addresses

Currently, ARM MMUs have a page directory which contains 2¹² entries, each of which is 4 bytes - making the page directory 16 KB in size. Each entry is therefore responsible for 2²° bytes of address space, that is, 1 MB. When this entry refers to a page table containing 4 KB pages the page table will therefore have 28 entries. Again, each entry is 4 bytes, making the page table 1 KB in size.

First, the address provided by the program is broken up into three pieces. These provide three separate indexes into the different levels of the mapping.

Next the MMU locates the address of the page directory by reading its Translation Table Base Register (TTBR). The topmost 12 bits of the virtual address, in this case 0x876, is used as an index into the page directory.

The MMU reads the entry from the page directory and determines that it refers to a page table for 4 KB pages.

Then, using the next 8 bits of the address, 0x54, as an offset into the page table, the MMU can read the page table entry. This now provides the physical address of the memory page.

The final 12 bits of the address are now combined with the page address to create the physical address of the string and this is used to fulfill the read request.

It is worth noting that the addresses in the TTBR, page directory entries and page table entries are all physical addresses. Otherwise, the MMU would have to use page table walking to do page table walking and the algorithm would never terminate!

Translation look'-'aside buffers

The algorithm described for translating addresses is quite simple, but if you consider that it requires two additional external memory accesses for each load or store operation it is easy to see that it is slow and inefficient. To overcome this problem, MMUs provide a cache of the most recent successful address translations, and this is called the Translation Look-aside Buffer (TLB). This often stores the most recent 32 or 64 pages that were accessed, and allows the MMU to very quickly translate virtual addresses that correspond to one of those pages.

As with all caches, the kernel must manage the TLB to ensure that when the kernel makes changes to the underlying page tables it also modifies or discards any affected entries in the TLB.

TLBs are so effective that some MMUs do not provide the page-table walking algorithm in hardware at all. Instead they raise a CPU exception and expect that a software routine provided by the OS will look up the translation and set a TLB entry (if successful) before resuming the memory access that caused the exception. Although EKA2 could support this type of MMU, a reasonable implementation would require that the MMU provides special support for the software walking routine. For example, if the MMU reserved a region of the virtual address space to be directly mapped to physical memory without using the TLB, this would allow the table walking algorithm to read the page directory and page tables without incurring additional TLB miss exceptions.

Virtual address spaces

Earlier, I said that there is only one page directory in the mapping. This is true at any given time as the MMU has only one TTBR register which points to the base of the page directory. However, we can write to the TTBR and tell the MMU to use a different page directory when translating virtual addresses. This is one of the techniques that allows the same virtual address to map onto different physical addresses at different times.

Why would we want to do this?

The format of the executable code in Symbian OS is the basis for one of the reasons - in particular the way in which code refers to data. Symbian uses a relocated code format, in which the code has the actual (virtual) address of the data object. This is in contrast to relocatable code in which data references are all made relative to some external reference, usually a reserved register. It is almost amusing to note that only relocated code requires a set of relocation data in the executable file format so that the OS loader can correctly adjust all of the direct references within the code.

Consider an application, TERCET.EXE for instance, that has a global variable, lasterror, used to record the last error encountered in the program. Once this program has been loaded, linked and relocated there will be several memory blocks used for the program, and within them a direct reference from the program code to the address that the OS has decided to use for the lasterror variable (see Figure 7.4).

Figure 7.4 Memory used to run TERCET.EXE

This seems fine, we have a memory block allocated at virtual address 0xF0000000 for the program code, and another allocated at virtual address 0x00500000 for the program data, and in particular for the lasterror variable. There will be others for the program execution stack and the dynamic memory pool, or heap; however, unlike lasterror these do not have direct references from the program code.

Now suppose that the OS needs to run a second instance of TERCET.EXE at the same time as the first. One of the definitions of a process in the OS is an independent memory address space. So as a separate process, the second copy of TERCET.EXE must have its own thread and execution stack, its own heap and its own copy of the global variables.

One way to achieve this would be to make a second copy of the program code and relocate this for different code and data addresses to the first instance (see Figure 7.5). Notice that this requires the code to be duplicated so that the second instance refers to a different location for the lasterror variable. Symbian OS doesn't do this for two reasons. Firstly, duplicating the code uses more RAM - which is already in short supply. Secondly, and more crucially, built-in software is usually executed in place (XIP) from Flash memory and so it has already been relocated for just one code and data address. And worse - we have discarded the relocation data to save space in the Flash memory, so we cannot make a copy of the code and relocate it for a new address.

Figure 7.5 Memory used to run TERCET.EXE

So, in Symbian OS both instances of TERCET.EXE will share the same code memory - but this also implies that the address for lasterror is the same in both processes, 0x00500840 (see Figure 7.6).

Figure 7.6 Running TERCET.EXE twice, sharing the code

We still need the two instances of TERCET.EXE to have separate memory blocks for their variables, so that when an instance of the process is running it finds its own variable mapped to address 0x00500840. So we need a way for the same virtual address to translate to two different physical addresses, depending on which process is currently running.

The solution is for each process in the OS to have its own mapping from virtual to physical addresses, and this mapping is called the process memory context. As I described in Chapter 3, Threads, Processes and Libraries, when we schedule a new thread to run, part of the work that has to be done is to determine if the new thread runs in the same process as the old one. When this is not the case, the memory context has to be changed to ensure that the correct mapping is used for virtual addresses in the new thread and process.

How this is achieved in Symbian OS depends on the type of MMU, and I describe this later in the chapter when I look at the different memory models.

Memory protection

One of the issues that an open OS must address is how to protect the operating system from software which is flawed or even malicious. If all software has direct access to the device memory, then it is not possible to limit the adverse effects that new software might have on a device.

We have already seen that the MMU provides an indirect mapping from the virtual address used by the software and the physical address of the memory provided by the OS. For each of the pages mapped by the MMU, we can supply attributes that describe an access policy for that memory. When used correctly and consistently by an OS this is a very powerful feature:

  • We can protect the kernel data from direct and indirect attacks from user-mode programs
  • We can protect the hardware that uses memory mapped I/O from being accessed directly by user-mode programs
  • We can allow a process to read and write its own memory, but deny it access to that of any other process
  • We can ensure that loaded software cannot be modified after loading by marking it as read-only
  • When this is supported by the MMU, we can ensure that general heap and stack memory cannot be executed as program code, defending against many buffer over-run type attacks
  • We can provide memory that can be shared by just some of the running processes.

Figure 7.7 illustrates these concepts by showing which memory should be made accessible to a thread when running in user or supervisor modes. The memory used by the kernel and two user programs, A and B, is shown where A and B share some code and some data. The left-hand images show memory accessible to a thread in program A in both user and kernel mode - note that the kernel memory is inaccessible to user-mode software. The top right image shows that program B cannot access memory used by program A except for memory that has been shared between these programs. The final image with program C, whose own memory is not shown, shows that this program has no access to any of the memory used by programs A and B. This demonstrates the ideal situation and, as I will describe later, the different memory models sometimes provide less restricted access than is shown here for certain situations. Of course, any such relaxation is made very carefully to preserve the value of providing the memory protection in the first place.

Figure 7.7 Memory accessible to a thread in user and kernel modes

Page faults

The MMU allows us to map all of a device's RAM, 16 MB say, into a much larger 4 GB virtual address space. Clearly many of the virtual addresses cannot map onto physical memory. What happens if we try to access one of these?

When walking through the page tables to translate an address, the MMU may find an entry that is marked as empty, or not present (in the page directory or a page table). When this occurs, the MMU raises a CPU prefetch or data abort exception, depending on whether the memory access was trying to read code or data.

Something very similar will occur if the MMU detects that the CPU is not permitted to access the page because it does not currently satisfy the access policy for the page.

In EKA2, this will usually result in a user-side thread terminating with KERN-EXEC 3 (unhandled exception) or the OS rebooting in the case of a kernel thread. I covered this in more detail in Chapter 6, Interrupts and Exceptions.

Operating systems designed for personal computers all use page faults and the MMU mapping to achieve another goal: demand paging. This is a scheme in which the operating system can effectively pretend that it has more physical memory than is really available. It does this by saving to disk memory pages that have not been used recently, and allowing another program to use the physical memory (for now). The memory mapping is adjusted to record that the old page is now saved to disk and is not present in memory. When this page is accessed once more a page fault occurs, and a special fault handler determines that the contents of the page are on disk, and arranges for it to be loaded back into spare physical memory before restarting the program that faulted.

EKA2 does not support demand paging.

Cache

The second key element of the hardware memory sub-system is the cache. This is very fast (1- or 2-cycle) memory that sits right next to the CPU. The data in the most recently accessed memory is contained here, substantially reducing the number of external memory accesses and therefore improving performance and efficiency.

In Chapter 2, Hardware for Symbian OS, I have discussed caches in some detail.

The memory model interface

The memory model is a distinct architectural block in the EKA2 kernel. As a result the rest of the kernel can be almost entirely independent of the chosen memory architecture and hardware support. To provide that encapsulation, the memory model defines a standard API to which all memory model implementations must conform.

The basic API is in the two classes P and M defined in kern_priv.h. P denotes the API exposed by the platform layer in the EKA2 software layer diagram, and M denotes the API exposed by the model layer in the same diagram:

class P 
{
public:
static TInt InitSystemTime();
static void CreateVariant();
static void StartExtensions();
static void KernelInfo(TProcessCreateInfo& aInfo, TAny*& aStack, TAny*& aHeap);
static void NormalizeExecutableFileName(TDes& aFileName);
static void SetSuperPageSignature();
static TBool CheckSuperPageSignature();
static DProcess* NewProcess();
};
 
class M
{
public:
static void Init1();
static void Init2();
static TInt InitSvHeapChunk(DChunk* aChunk, TInt aSize);
static TInt InitSvStackChunk();
static TBool IsRomAddress(const TAny* aPtr);
static TInt PageSizeInBytes();
static void SetupCacheFlushPtr(TInt aCache, SCacheInfo& c);
static void FsRegisterThread();
static DCodeSeg* NewCodeSeg(TCodeSegCreateInfo& aInfo);
};

This appears to be a very small API indeed, but it does hide a few secrets. All but four of the functions are related to startup. The result of invoking the startup functions is both to initialize the memory model within the kernel, but also to configure the kernel for the memory model. In particular:

M::Init1() During this initialization phase the process context switch callback is registered with the scheduler. This callback will be used for all address space changes triggered by a context switch.
M::SetupCacheFlushPtr() Provides the memory address to be used by the cache manager when flushing the caches.

The two most interesting functions here are P::NewProcess() and M::NewCodeSeg(). These are not expected to return exact DProcess and DCodeSeg objects, but rather classes derived from them. We had a brief look at DProcess in Chapter 3, Threads, Processes and Libraries, but what you should note is that it has a number of virtual members - and among them are further factory functions DProcess::NewChunk() and DProcess::NewThread() designed to return memory model-specific classes derived from DChunk and DThread.

It is these four classes - DProcess, DThread, DChunk and DCodeSeg - that provide the main API between the generic layers of the kernel and the memory model.

DChunk

In Symbian OS, the chunk is the fundamental means by which the operating system allocates memory and makes it available to code outside of the memory model.

A chunk is a contiguous range of addressable (reserved) memory of which a subset will contain accessible (committed) memory. On systems without an MMU, the addresses are physical addresses, and the entire chunk is accessible.

On systems with an MMU, Symbian OS provides three fundamental types of chunk, depending on which subsets of the address range contain committed memory.

  1. NORMAL. These chunks have a committed region consisting of a single contiguous range beginning at the chunk base address with a size that is a multiple of the MMU page size
  2. DOUBLE ENDED. These chunks have a committed region consisting of a single contiguous range with arbitrary lower and upper endpoints within the reserved region, subject to the condition that both the lower and upper endpoints must be a multiple of the MMU page size
  3. DISCONNECTED. These have a committed region consisting of an arbitrary set of MMU pages within the reserved region - that is, each page-sized address range within the reserved region that begins on a page boundary may be committed independently.

Although it is obvious that a normal chunk is just a special case of a double-ended chunk, and both of these are special cases of a disconnected chunk, we decided to separate the types because the specialized forms occur frequently and we can implement them more efficiently than the general purpose disconnected chunk. Figure 7.8 shows the different types of chunks and the common terminology used to describe their attributes.

As with other types of kernel resource, you can create chunks that are local, or private, to the creating process or chunks that are global. Local chunks cannot be mapped into any other process and thus the operating system uses them for any memory that does not need to be shared. Conversely, you can map a global chunk into one or more other processes. A process can discover and map global chunks that are named, whereas the only way for a process to access an unnamed global chunk is for it to receive a handle to the chunk from a process that already has one.

Figure 7.8 Fundamental chunk types

The operating system uses chunks for different purposes, and this information is also specified when creating a chunk. The memory model uses this information to determine where in the virtual address space to allocate a chunk, which access permissions must be applied and how to map the chunk into the kernel or user process memory context. The kernel uses the TChunkType enumeration to describe the purpose of the chunk to the memory model, and the following table explains the different types:

Value Description
EKernelData There is a single chunk of this type used to manage the global data for all XIP kernel-mode software, the initial (null) thread stack and the dynamic kernel heap. The virtual address for this chunk depends on the memory model, but is set during ROM construction and extracted from the ROM header at runtime. It is used to calculate the runtime data addresses for relocating the XIP code in ROM.
EKernelStack There is single chunk of this type used to allocate all kernel-mode thread stacks. The difference with EKernelData is that the address range for this chunk is reserved dynamically during boot.
EKernelCode There is at most a single chunk of this type, used to allocate memory for all non-XIP kernel-mode code, such as device drivers loaded from disk. It differs from the previous type by requiring execute permissions and I-cache management.
EUserCode The kernel uses these chunks to allocate memory or page mappings for non-XIP user-mode code. The memory model determines how these chunks are used and how code is allocated in them.
ERamDrive This chunk contains the RAM drive, if present. The virtual address of the RAM drive is defined by the memory model - this allows the contents to be recovered after a software reboot.
EUserData General purpose chunks for user-mode processes. The kernel uses these chunks for program variables, stacks and heaps. May be private to a process or shared with one or more other processes.
EDllData This chunk allocates memory for writable static variables in user DLLs. The virtual address for this chunk must be fixed by the memory model, as it is used to calculate the runtime data address for XIP code in the ROM. Non-XIP DLLs have their data addresses allocated at load time. Each user process that links to or loads a DLL that has writable static data will have one of these chunks.
EUserSelfModCode This is a special type of user-mode chunk that is allowed to contain executable code. For example, a JIT compiler in a Java runtime would use one for the compiled code sequences. This type of chunk differs from EUserData in the access permissions and also the cache management behavior.
ESharedKernelSingle / ESharedKernelMultiple / ESharedIo The kernel provides these shared chunk types for memory that needs to be shared between device drivers and user-mode programs. Unlike other user-mode accessible chunks, these can only have the mapping adjusted by kernel software, which makes them suitable for direct access by hardware devices.
ESharedKernelMirror Some memory models map shared chunks into the kernel memory context using an independent mapping - in this case, this chunk owns the additional mapping.

Here is the DChunk class:

class DChunk : public DObject 
{
public:
 
enum TChunkAttributes
{
ENormal =0x00,
EDoubleEnded =0x01,
EDisconnected =0x02,
EConstructed =0x04,
EMemoryNotOwned =0x08
};
 
enum TCommitType
{
ECommitDiscontiguous = 0,
ECommitContiguous = 1,
ECommitPhysicalMask = 2,
ECommitDiscontiguousPhysical =
ECommitDiscontiguous|ECommitPhysicalMask,
ECommitContiguousPhysical =
ECommitContiguous|ECommitPhysicalMask,
};
 
DChunk();
~DChunk();
TInt Create(SChunkCreateInfo& aInfo);
inline TInt Size() const {return iSize;}
inline TInt MaxSize() const {return iMaxSize;}
inline TUint8 *Base() const {return iBase;}
inline TInt Bottom() const {return iStartPos;}
inline TInt Top() const {return iStartPos+iSize;}
inline DProcess* OwningProcess() const {return iOwningProcess;}
 
public:
virtual TInt AddToProcess(DProcess* aProcess);
virtual TInt DoCreate(SChunkCreateInfo& aInfo)=0;
virtual TInt Adjust(TInt aNewSize)=0;
virtual TInt AdjustDoubleEnded(TInt aBottom, TInt aTop)=0;
virtual TInt CheckAccess()=0;
virtual TInt Commit(TInt aOffset, TInt aSize,
TCommitType aCommitType= DChunk::ECommitDiscontiguous,
TUint32* aExtraArg=0)=0;
virtual TInt Allocate(TInt aSize, TInt aGuard=0, TInt aAlign=0)=0;
virtual TInt Decommit(TInt aOffset, TInt aSize)=0;
virtual TInt Address(TInt aOffset, TInt aSize, TLinAddr& aKernelAddress)=0;
virtual TInt PhysicalAddress(TInt aOffset, TInt aSize,
TLinAddr& aKernelAddress, TUint32& aPhysicalAddress,
TUint32* aPhysicalPageList=NULL)=0;
 
public:
DProcess* iOwningProcess;
TInt iSize;
TInt iMaxSize;
TUint8* iBase;
TInt iAttributes;
TInt iStartPos;
TUint iControllingOwner;
TUint iRestrictions;
TUint iMapAttr;
TDfc* iDestroyedDfc;
TChunkType iChunkType;
};

In the following table, I describe the meanings of some of DChunk's key member data:

Summary of fields in DChunk:

Field Description
iOwningProcess If the chunk is only ever mapped into a single process, this is the process control block for the process that created and owns this chunk. Otherwise this is NULL.
iSize Size of committed memory in the chunk. Note that this does not include the gaps in a disconnected chunk.
iMaxSize The reserved size of the chunk address region. This is usually the actual size reserved which may be larger than the requested size, depending on the MMU.
iBase The virtual address of the first reserved byte in the chunk. This may change over time depending on which user-mode process is currently running, and may also be specific to a memory context that is not the current one - so dereferencing this value directly may not yield the expected results!
iAttributes A set of flags indicating certain properties of the chunk. Some are generic - for example, double ended, disconnected, memory not owned. Some are memory model specific - for example, fixed access (protected by domain), fixed address, code (on moving model) and address allocation and mapping type flags on the multiple model.
iStartPos The offset of the first committed byte in a double-ended chunk. Not used for other chunk types.
iControllingOwner The process ID of the process that set restrictions on chunk.
iRestrictions Set of flags that control which operations may be carried out on the chunk. For example, this is used to prevent shared chunks from being adjusted by user-mode software. Shared chunks are described in Section 7.5.3.2.
iMapAttr Flags to control how the chunk is mapped into memory. Only used for shared chunks.
iDestroyedDfc A DFC that is invoked once the chunk is fully destroyed. Chunk destruction is asynchronous and depends on all references to the chunk being released - this enables the device that owns the memory mapped by the chunk to know when the mapping has been removed.
iChunkType The type of use that the chunk is put to. This is one of the TChunkType values already described.

We specify the chunk API entirely in terms of offsets from the base address. This is because the base address of a chunk is a virtual address, and thus may change depending on the memory context - in particular, different processes may have a different base address for the same chunk, or the kernel may find a chunk at a different base address than user code does. The precise circumstances under which the base address changes depend on the memory model.

We create a chunk with a specified maximum size, which determines the maximum size of the address range it covers; a chunk may never grow beyond this size. The memory model reserves a suitable region of virtual address space for the chunk which is at least as large as the maximum size, though it may be larger, depending on the particular MMU of the device.

The memory model provides chunk adjust functions which allow the committed region within the chunk to be changed in accordance with the chunk type:

Adjust() Set the end of the committed region of a normal chunk. This will commit or release pages of memory as required to achieve the new size.
AdjustDoubleEnded() Move one or both of the ends of the committed region of a double-ended chunk.
Commit() Commit the pages containing the region specified. If any of the pages are already committed this will fail - so it is advisable to always specify page-aligned offsets.
Decommit() Release the pages containing the region specified. This will ignore pages that are not committed without reporting an error.
Allocate() Allocate and commit a region of the size requested. Optionally allocate a preceding guard region (which is not committed) and request larger than page-size alignment.

DCodeSeg

A code segment is responsible for the loaded contents of an executable image file, either an EXE or a DLL. This will be the relocated code and read-only data, as well as the relocated initial state of writable data, if it is present in the executable. We store this initial writable data in memory to avoid the need to re-read and relocate this from the executable file when initializing additional copies of the writable data section for the executable. As Symbian OS does not encourage the use of writable static data, this does not result in any significant waste of memory.

The code segment may own the memory for the code in different ways, depending on the memory model. In some cases, a single disconnected chunk manages all the memory for code segments, and in others the code segment manages the pages of memory directly.

As an optimization, code that is part of an XIP ROM does not usually have a code segment object to represent it in the kernel, unless it is directly referenced by a DProcessor DLibrary object.

Here are the other main responsibilities of code segments:

  • Recording important information for the code segment, such as the code and data location, size and run address; the table of exception handlers; the code entry point; the directory of exports and more
  • Maintaining the record of dependencies it has on other code segments. These are the dependencies that result from importing functions from other DLLs. Note that these dependencies can be circular, because mutual dependency between DLLs is legal. These dependencies are used to determine which code segments should be mapped in or out of a process as a result of a DLL being loaded or unloaded, or to determine when a code segment is unused and can be destroyed entirely
  • Mapping the code segment into and out of a process address context when DLLs are loaded and unloaded. How, or if, this happens depends on how the memory model chooses to allocate and map code segments.

Chapter 10, The Loader, provides a thorough description of how code segments are used to manage executable code.

It is worth noting that EKA1 does not have a DCodeSeg object. In EKA1, the responsibilities of DCodeSeg were partly in DLibrary class and partly in DChunk. This arrangement suited the processor architectures and ROM available at the time it was designed. The complete redesign of this area for EKA2 was driven by the desire to exploit the very different memory model for ARMv6, and a need for a far more scalable design to manage hundreds of executables loaded from non-XIP Flash. EKA2 still has a DLibrary object, but it purely provides the kernel side of the user-mode RLibrary interface to dynamic code.

DProcess

Within the OS, a process is a container of one or more threads (see Chapter 3, Threads, Processes and Libraries) and an instantiation of an executable image file (see Chapter 10, The Loader). However, we have already seen that it is also the owner of a distinct, protected memory context. This means that it is concerned with both owning the memory belonging to the process (or being shared by the process), and also with owning the mapping of that memory into its virtual address space.

The memory model must maintain enough information with its process objects to be able to manage the process address context. This context is used by the memory model in the following situations:

  • Process context switch. The previous context and protection must be removed from the MMU and the new one established. Changing the virtual to physical address map usually requires modifying one or more MMU registers, and may require invalidation of now-incorrect TLB entries and cache data due to changing the virtual to physical mapping
  • Process termination. The memory model must be able to release all memory resources that the process owned or shared and return them to the system. Failure to do this would obviously result in slow exhaustion of the system memory and eventual reboot
  • Inter-process communication - data transfers between processes. When a thread needs to read or write memory belonging to another process - this included when a kernel thread wishes to read or write user-mode memory - the memory model must be able to locate and map that memory to transfer the data.

DThread

Although the memory model is concerned with managing the process address space, a number of the operations that depend on the implementation of the memory model are logically carried out on threads, and so these operations are presented as members of the DThread class:

AllocateSupervisorStack() / FreeSupervisorStack() / AllocateUserStack() / FreeUserStack() The management of supervisor and user-mode thread stacks is memory model dependent. MMU enabled memory models typically allocate all thread stacks for a process in a single disconnected user-mode chunk, with uncommitted guard pages between each stack to catch stack overflow. Similarly, the kernel allocates all kernel-mode thread stacks in a single kernel chunk with guard pages between them.
ReadDesHeader() / RawRead() / RawWrite() These support the kernel reading from and writing to another thread's user memory. These methods will carry out any necessary checks to ensure that the specified remote memory is part of the thread's user memory address space. They can also check that the local memory buffer is within the executing thread's user memory context. This functionality is exposed to drivers via the Kern::ThreadDesRead() and Kern::ThreadRawRead() set of APIs, which in addition will trap any exceptions caused by unmapped addresses. The user-mode client/server RMessagePtr2 APIs in turn use these for transferring data buffers between a client and server.
ExcIpcHandler() This provides the exception handler used in conjunction with the exception trap (see my discussion of XTRAP in Chapter 6, Interrupts and Exceptions) as part of the inter-process copying I mentioned earlier. This enables an exception caused by providing a faulty remote address to be treated as an error response, but one caused by a faulty local address to be treated as a programming error, that is, a panic.
RequestComplete() This is the kernel side of the Symbian OS programming patterns that use TRequestStatus, User::WaitForRequest() and active objects. Requests are always completed through this function, which writes the 32-bit status word into the target (requesting) thread's memory. As this is the basis for all inter-thread communication, performance is paramount, and so the memory model usually implements this operation as a special case of writing to another thread's memory space.

The memory models

Up to this point we have looked at a number of the problems faced when reconciling the need for Symbian OS to be open to third-party software, and yet robust against badly written or malicious programs. However, I have so far avoided providing the precise details of how these issues are resolved in EKA2. The reason for this is that the best solution depends on the design of the memory management hardware - and there are two very different designs of the level 1 memory sub-system employed in ARM processors. This has led to two different memory model implementations for devices running Symbian OS on ARM processors.

We developed the first implementation on EKA1, in the early days of Symbian OS, for version 3 of the ARM Architecture (ARMv3). The reason this is known as the moving memory model will become apparent as I explain its motivation and design. We developed and optimized this first implementation gradually, all the way through to version 5 of the ARM Architecture (ARMv5) and use it on both EKA1 and EKA2.

For version 6 of their architecture (ARMv6), ARM made a radical departure from their previous MMU and cache designs. At this point it made sense for Symbian OS to replace the moving model with a new design, the multiple memory model, which would make the most of the ARMv6 features and provide enhanced performance, reliability and robustness for Symbian OS.

For both of these memory models, we will look at the hardware architecture, describe how this is utilized to provide the memory model services and present the memory map, which depicts the way that the virtual address space is allocated to the OS.

For completeness, I will also briefly describe the two other memory models provided with EKA2: the direct memory model, enabling EKA2 to run without the use of an MMU, and the emulator memory model, which we use to provide a memory environment on the emulator that matches hardware as close as possible.

The moving model

We developed this memory model specifically for ARM processors up to and including those supporting ARMv5 architecture.

Hardware

The ARM Architecture Reference Manual (by Dave Seal, Addison-Wesley Professional) provides a detailed description of the memory sub-system in ARMv5. Here I will describe those features that have a significant impact on the memory model design.

Virtual address mapping

In ARMv5, the top-level page directory has 4096 entries, each of which is 4 bytes, making the directory 16 KB in size. Many operating systems that provide individual process address spaces use the simple technique of allocating a different page directory for each process - and then the context switch between processes is a straightforward change to the MMU's base register (the TTBR). However, on devices with limited RAM, we considered allocating 16 KB per process excessive and so we needed an alternative scheme for managing multiple address spaces.

Protection

ARMv5 provides two systems for protecting memory from unwanted accesses.

The first of these systems is the page table permissions: each page of memory that is mapped has bits to specify what kind of access is allowed from both user and supervisor modes. For example, a page can be marked as read-only to all modes, or no-access in user modes but read/write for supervisor modes. Obviously, memory that is not referenced in the current address map cannot be accessed either.

The second protection system is called domains. ARMv5 supports up to 16 domains. Each entry in the page directory contains a field to specify in which domain this address range lies. Thus, every mapped page lives in exactly one domain. The MMU has a register that controls the current access to each domain, with three settings: access to the domain is not allowed and always generates a fault, access to the domain is always allowed and page table permissions are ignored, or access to the domain is policed by the page table permissions. Using domains allows large changes to the memory map and effective access permissions to be made by small changes to the page directory entries and to the domain access control register (DACR).

Caches

The ARMv5 cache design uses a virtually indexed and virtually tagged cache - this means that the virtual address is used to look up the set of cache lines that may contain the data being requested, and also to identify the exact cache cell that contains the data. The benefits are that no address translation is required if the data is in the cache, theoretically reducing power requirements. In practice, the MMU must still check the TLB to determine the access permissions for the memory.

However, as I discussed earlier, in a system that is managing multiple address spaces we expect the same virtual address to sometimes refer to two different physical addresses (depending on which process is current). This form of multiple mapping is sometimes referred to as a homonym - the same virtual address may mean more than one thing. There are also situations where we might wish to use two different virtual addresses to refer to the same physical memory, for example when sharing memory between processes or with a peripheral. This other form of multiple mapping is called a synonym - different virtual addresses have the same meaning.

Figure 7.9 illustrates the problem of homonyms in ARMv5. Only one of the data items can be cached for the virtual address at any point in time because the MMU uses the virtual address to identify that item in the cache. We can only support the use of multiple overlapping address spaces by removing the virtual address and data from the cache during a context switch between the processes, ensuring that any updates are copied back to main memory. Otherwise the second process will only ever see (and access) the first process's memory.

Figure 7.9 Homonyms in ARMv5

In addition, the TLB cannot contain both of the mappings, and so the memory model also invalidates the TLB during a process context switch.

As a result, a context switch that changes the virtual memory map impacts both performance and power consumption.

The problem of synonyms on such hardware is illustrated in Figure 7.10. This is slightly more complex, as the different virtual addresses will both appear in the cache in different places. This can result in confusing effects, because writing through one address may not be visible if read back through the other. This can only be solved by ensuring that the memory model does not map the same physical memory with two virtual addresses at the same time, and that if the virtual address needs to be changed then the cache data must be flushed.

Figure 7.10 Synonyms in ARMv5

As with almost all new high-specification CPUs, the code and data caches are separated - this is sometimes referred to as a Harvard cache. (In Chapter 17, Real Time, I discuss the performance implications of different cache types.) Aside from general benefits that the Harvard cache is known to provide, the moving memory model specifically uses it to ensure that the instruction cache does not need to be managed on a context switch.

Memory model concept

The moving memory model uses a single page directory for the whole OS, and provides multiple overlapped process address spaces by moving blocks of memory (changing their virtual address) during a context switch. This is how the memory model derives its name.

Simple arithmetic shows that each page directory entry maps 1 MB of address space. Changing the domain specified in the entry provides easy control of the access policy for this memory range. The memory model can move this address range, whilst simultaneously changing the access permissions by writing a new entry in the page directory and resetting the old entry (two 32-bit writes).

For example, suppose we have a page table that maps a set of pages, each with user no-access, supervisor read/write permissions. Now we create a page directory entry in the second position in a page directory, allocate it to domain 0 and set the DACR to ignore permissions for this domain. We can now access the pages using the address range 0x00100000-0x001fffff with full access from both user and supervisor modes as the permission bits are being ignored. On a context switch we remove this page directory entry and create a new one in the seventh position, this time setting the domain to 1 (with the DACR set to check-permissions for domain 1). After clearing the TLB entry for the old address range we can no longer use address 0x00100000 to access the memory. However, we can now use 0x00600000, but only from supervisor mode as the permission bits are now being checked. Figure 7.11 shows the effect of making these simple changes to the page directory.

This is the essential idea that we use to provide each process with identical virtual address spaces, but distinct and protected memory pages. During a context switch, we first move the old process's memory out of the common execution address, making it inaccessible to user mode at the same time, and then we move the new process's memory to the common execution address and make it accessible.

This is also one of the motivations behind the concept and implementation of the chunk, described in Section 7.3.1, which is the unit of moving memory within the higher layers of this memory model.

Unfortunately, as with many good ideas, this one is not without its drawbacks. If you remember, I earlier described the problem that can be caused by mapping memory at different virtual memory addresses, even when spread out in time - and that the solution is to flush the cache. This means that all modified data is copied back to main memory and all cached data is discarded and must be reloaded from main memory when required. As a result, a process context switch with this memory model is dominated by the time spent flushing the cache, and is typically 100 times slower than a thread context switch (within the same process). There is little hope that in future cache flushing will be made faster by new processors and memory, as performance gained there is lost flushing ever larger caches.

Figure 7.11 Remapping memory by modifying the page directory

The moving memory model employs some of the other ARMv5 features, such as domains and split caches, to reduce the requirement for cache flushing. However, it cannot be entirely removed and still constitutes a measurable proportion of the execution time for Symbian OS.

It is interesting to note that ARMv5 provides an alternative to multiple page directories or moving page tables - the Fast Context Switch Extensions. In this mode, the MMU translates the virtual address before doing regular address translation using the page tables, and can eliminate the expensive cache flush on a context switch. In this mode, the MMU will replace the highest 7 bits of the virtual address with the value in the FCSE PID register, if these bits were all zero. This means that virtual addresses in the range 0x00000000 to 0x02000000 will be mapped to some other 32 MB range before the page tables are walked. On a process context switch all that is needed is to change the FCSE PID. Although popular with other open operating systems using ARMv5, this limits the system to 127 processes (the number of distinct, non-zero FCSE PID values) and each process to a virtual address space of 32 MB including code. The need for the kernel to use some of the memory map for other purposes can reduce these limits significantly. These limitations were not acceptable for Symbian OS.

Design

As I have already described, the moving memory model maintains a single page directory for the whole OS. The rest of this section provides a high-level view of the moving memory model design.

Address spaces

Allocated memory is always owned by a single-page table, and the page table will be owned by a single chunk. Thus a chunk is always responsible for a whole number of megabytes of virtual address space, and the base address is always aligned on a megabyte boundary.

Each chunk is always present in exactly one place in the memory map, and so all of the page tables that it owns will be referenced from consecutive page directory entries. One consequence of this is that there can never be more than 4096 distinct page tables at any one time.

The previous rule is not directly obvious from the requirements. Memory that is not accessible to the currently executing process does not always need to be in the memory map. However, much of Symbian OS execution involves inter-process activity and the implementations of the client/server system and thread I/O requests rely on having access to the memory of a non-current process. If we ensure that this memory is directly accessible to the kernel, we can simplify these algorithms considerably.

By default, the data chunks for a process are moving chunks and these have two address ranges allocated for them. The first is the data section address (or run address) which is the virtual address used by the process that creates the chunk and the range is as large as the maximum size for the chunk. The latter is necessary because the virtual address of the chunk is specified to never change as the chunk grows or shrinks. When a moving chunk is mapped into a second process, the memory model does not guarantee that the virtual address in the second process matches that in the first one. Thus the data section address is specific to each process that has access to the chunk.

The second address is the kernel section address (or home address) which is the virtual address occupied by the chunk when it is both inaccessible to the currently running process and the current process is not fixed - see the following optimizations section for an explanation of fixed processes. Page directory entries are only reserved in the kernel section for the currently committed memory in the chunk. If additional page tables are added to the chunk later, a new kernel section address will be allocated - this is not a problem as the kernel section address is only ever used transiently for inter-process memory accesses.

The memory model manages the chunks that are accessible to each process by maintaining for each process an address ordered list of all data chunks that are mapped by the process. Each entry on this list also contains the data section address for that chunk in the process. The chunk itself knows about its kernel section address, and whether it is currently mapped in the kernel section, or if it is mapped in the data section.

Protection

Using the memory moving technique shown in Figure 7.11, two domains are used to provide protection between the currently running process and the memory that should be inaccessible to the process, such as kernel memory or that belonging to other processes. Although it might be more obvious for the memory model to just use page permissions to achieve this, modifying the page permissions during a context switch would require changing every entry of the affected page tables - the scheme using domains only requires that the memory model modifies a handful of page directory entries.

Most chunks use page permissions that deny access from user mode, but allow read/write access from supervisor modes. Chunks that are not accessible to the current user process are allocated to domain 1, while those that are accessible to the current user process are allocated to domain 0. The domain access control register is set to allow all access to domain 0 (ignoring the permission bits), but makes the MMU check permissions for access to domain 1. This has the desired effect of allowing a process to access its own memory from user mode (chunks in the data section), but other memory is inaccessible except from supervisor modes. Some chunks have slightly different permissions to improve the robustness of Symbian OS:

  • Once loaded, all chunks containing code are marked as read-only, to prevent inadvertent or malicious modification of software
  • The mappings for the RAM drive are allocated to domain 3. This domain is set to no-access by default, preventing even faulty kernel code from damaging the disk contents. The RAM disk media driver is granted access to this domain temporarily when modifying the disk contents.

Figure 7.12 illustrates the effective access control provided by the moving memory model, compared with the ideal presented earlier in the chapter. Note that the only compromise for user-mode software is the visibility of program code that has not been explicitly loaded by the program. However, this memory model does make all memory directly accessible from kernel mode. Kernel-mode software must already take care to ensure that user processes cannot read or corrupt kernel memory through the executive interface, so extending that care to guard against incorrect access to another process does not add any significant complexity to the OS.

Figure 7.12 Memory accessible to a thread in the moving memory model
Optimizations

Every time an operation requires the moving of at least one chunk, the memory model must flush the relevant cache and TLB - therefore the memory model design attempts to reduce the number of chunks that need to be moved.

  • A global chunk is used to allocate code segments. Thus code executes from the same address in all processes. Additionally, code loaded by one process is visible to the entire OS - although this is a compromise for system robustness, it avoids a very expensive operation to adjust the access permissions for all RAM-loaded code, and flush TLBs. Together this ensures that the memory model never needs to flush the I-cache on a context switch, significantly improving system performance
  • Some chunks are fixed in memory, and their virtual address never changes. In these cases, we use domains to control access to the chunk by changing the DACR for the processes that are allowed access. This can reduce the number of chunks that need to be moved on a context switch
  • Important and heavily used server processes can be marked as fixed processes. Instead of allocating the data chunks for these processes in the normal data section the memory model allocates them in the kernel section and they are never moved. The memory model allocates an MMU domain, if possible, to provide protection for the process memory. The result is that a context switch to or from a fixed process does not require a D-cache flush and may even preserve the data TLB. One consequence of using this feature is that we can only ever run a single instance of a fixed process, but this is quite reasonable constraint for most of the server processes in the OS. Typical processes that we mark as fixed are the file server, comms server, window server, font/bitmap server and database server. When this attribute is used effectively in a device, it makes a notable improvement to overall performance.
Memory map

Figures 7.13 and 7.14 show how the virtual address space is divided in the moving memory model. These diagrams are not to scale and very large regions have been shortened, otherwise there would only be three or four visible regions on it!

Algorithms

In trying to understand how this memory model works it is useful to walk through a couple of typical operations to see how they are implemented.

Process context switch

The memory model provides the thread scheduler with a callback that should be used whenever an address space switch is required. I will describe what happens when the scheduler invokes that callback.

Switching the user-mode address space in the movingmemory model is a complex operation, and can require a significant period of time - often more than 100 microseconds. To reduce the impact on the real time behavior of EKA2 of this slow operation, the address space switch is carried out with preemption enabled.

SymbianOSInternalsBook 7.13.1.png
SymbianOSInternalsBook 7.13.2.png

Figure 7.13 Full memory map for moving memory model

Figure 7.14 Memory management detail for moving memory model

The user-mode address space is a shared data object in the kernel, as more than one thread may wish to access the user-mode memory of a different process, for example during IPC or device driver data transfers. Therefore, changing and using the user-mode address space must be protected by a mutex of some form - the moving memory model uses the system lock for this. This decision has a significant impact on kernel-side software, and the memory model in particular - the system lock must be held whenever another process's user-mode memory is being accessed to ensure a consistent view of user-mode memory.

The context switch is such a long operation that holding the system lock for the entire duration would have an impact on the real time behavior of the OS, as kernel threads also need to acquire this lock to transfer data to and from user-mode memory. We tackle this problem by regularly checking during the context switch to see if another thread is waiting on the system lock. If this is the case, the context switch is abandoned and the waiting thread is allowed to run. This leaves the user-mode address space in a semi-consistent state: kernel software can locate and manipulate any user-mode chunk as required, but when the user-mode thread is scheduled again, more work will have to be done to complete the address space switch.

The fixed process optimization described in the previous section relies on the memory model keeping track of several processes. It keeps a record of the following processes:

Variable Description
TheCurrentProcess This is a kernel value that is really the owning process for the currently scheduled thread.
TheCurrentVMProcess This is the user-mode process that last ran. It owns the user-mode memory map, and its memory is accessible.
TheCurrentDataSectionProcess This is the user-mode process that has at least one moving chunk in the common address range - the data section.
TheCompleteDataSectionProcess This is the user-mode process that has all of its moving chunks in the data section.

Some of these values may be NULL as a result of an abandoned context switch, or termination of the process.

The algorithm used by the process context switch is as follows:

  1. If the new process is fixed, then skip to step 6
  2. If the new process is not TheCompleteDataSectionProcess then flush the data cache as at least one chunk will have to be moved
  3. If a process other than the new one occupies the data section then move all of its chunks to the home section and protect them
  4. If a process other than the new one was the last user process then protect all of its chunks
  5. Move the new process's chunks to the data section (if not already present) and unprotect them. Go to step 8
  6. [Fixed process] Protect the chunks of TheCurrentVMProcess
  7. Unprotect the chunks of the new process
  8. Flush the TLB if any chunks were moved or permissions changed.
Thread request complete

This is the signaling mechanism at the heart of all inter-thread communications between user-mode programs and device drivers or servers. The part related to the memory model is the completion of the request status, which is a 32-bit value in the requesting thread's user memory. The signaling thread provides the address and the value to write there to the <tt style="font-family:monospace;">DThread::RequestComplete()</tt> method, which is always called with the system lock held.

In the moving memory model, this is a fairly simple operation because all of the user-mode memory is visible in the memory map, either in the data section or in the home section. This function looks up the provided address in the chunks belonging to the process, and writes the data to the address where the memory is mapped now.

The multiple model

This memory model was developed primarily to support - and exploit - the new MMU developed for ARMv6. However, it is more generally applicable than the moving memory model and can also be used with MMUs found on other popular processors such as Intel x86 and Renesas SuperH.

Hardware

As with the ARMv5 memory architecture, I refer you to the ARM Architecture Reference Manual for the full details of the level 1 memory sub-system on ARMv6.

Virtual address mapping

As with ARMv5, the top-level page directory still contains 4096 entries. However, in contrast with ARMv5, the page directory on ARMv6 can be split into two pieces. Writing to an MMU control register, TTBCR, sets the size of the first piece of the directory to contain the first 32, 64, . . ., 2048 or 4096 page directory entries, with the remainder being located in the second page directory. To support this, the MMU now has two TTBR registers, TTBR0 and TTBR1. The MMU also has an 8-bit application space identifier register (ASID). If this is updated to contain a unique value for each process, and the memory is marked as being process-specific, then TLB entries created from this mapping will include the ASID. As a result, we do not need to remove these TLB entries on a context switch - because the new process has a different ASID and will not match the old process's TLB entries.

Protection

Although ARMv6 still supports the concept of domains, this feature is now deprecated on the assumption that operating systems will opt to use the more powerful features of the new MMU. However, ARM have enhanced the page table permissions by the addition of a never-execute bit. When set, this prevents the page being accessed as part of the instruction fetching. When used appropriately, this can prevent stack and heap memory being used to execute code, which in turn makes it significantly harder to create effective security exploits such as buffer over-run attacks.

Caches

The cache in ARMv6 has also been through a complete overhaul, and a virtually indexed, physically tagged cache replaces the virtually indexed, virtually tagged cache in ARMv5. The cache is indexed using the virtual address, which enables the evaluation of the set of cache lines that could contain the data to run in parallel with the address translation process (hopefully in the TLB). Once the physical address is available, this is used to identify the exact location of the data in cache, if present. The result of using a physically tagged cache is very significant - the problems associated with multiple mappings are effectively removed. When the same virtual address maps to different physical addresses (a homonym) the cache can still store both of these simultaneously because the tags for the cache entries contain distinct physical addresses (see Figure 7.15). Also, two virtual addresses that map to the same physical address (a synonym) will both resolve to the same entry in the cache due to the physical tag and so the coherency problem is also eliminated. This rather nice result is not quite the whole picture - the use of the virtual address as the index to the cache adds another twist for synonyms which I will describe more fully later.

Memory model concept

The features of the ARMv6 MMU enable a number of the drawbacks of the moving memory model to be eliminated without compromising on the device constraints or OS requirements.

The split page directory of ARMv6 allows us to revisit the common idea of having one page directory for each process. This time, instead of requiring 16 KB for each process, we can choose to have just a part of the overall page directory specific to each process and the rest can be used for global and kernel memory. EKA2 always uses the top half (2 GB) for the kernel and global mappings, and the bottom half for per-process mapping. This reduces the per-process overhead to a more acceptable 8 KB, but retains up to 2 GB of virtual address space for each process.

For devices with smaller amounts of RAM (_<_32 MB) we go further and only map the bottom 1 GB for each process reducing the overhead to just 4 KB for each process. The name of the model comes from it using multiple page directories.

The multiple memory model makes use of ASIDs to resolve the problem of mapping the same virtual address to different physical addresses, while the physically tagged cache ensures that multiple mappings of virtual or physical addresses can be correctly resolved without needing to flush data out of the cache. Figure 7.15 shows how these features allow the TLB and cache to contain multiple process memory contexts simultaneously, even when the processes map the same virtual address.

Figure 7.15 Homonyms in ARMv6

When compared with the moving memory model, this design:

  • Still provides up to 2 GB of per-process virtual address space
  • Requires moderate additional memory overhead for each process (4 or 8 KB)
  • Has no requirement to flush the caches or TLBs on a context switch
  • Does not make loaded program code globally visible
  • Marks memory that holds data so that it cannot be executed as code.

The performance improvement that comes as a result of eliminating the cache flush on context switch is the most significant benefit of this memory model. It also ensures that this is a better memory model for the future, as we will see continuing increases in cache size and CPU to memory performance ratio.

The last two points in the previous list improve the robustness of the OS as a whole, but also increase the protection provided for platform security, which you can read more about in Chapter 8, Platform Security.

Revisiting the synonym problem

Although the multiple memory model is an improvement on the moving memory model, it is not without its own complexities. The most awkward issue is related to the solution for the synonym problem - providing a second or alias virtual address for the same physical address. The problem stems from the use of the virtual address as the initial index into the cache to select the small set of lines from which to determine an exact match using the physical address. Figure 7.16 primarily illustrates the ideal situation with a synonym mapping - where the cache resolves both virtual addresses to the same cache line and data.

However, the cache indexing is done using the lower bits of the virtual address. For obvious reasons, the bottom 12 bits of the virtual address and physical address are always identical (when using 4 KB pages). What could happen if the cache uses 13 bits for the index?

Suppose that the page at physical address 0x00010000 was mapped by two virtual addresses: 0x10000000 and 0x20001000. Then we write to the memory at 0x10000230, which results in an entry in the cache in the index set for 0x230 (low 13 bits) with the physical tag 0x00010230. If we now try to read the address 0x20001230 (which according to our mapping is the same memory), this will look up entries in the cache index set for 0x1230 and not find the previous entry. As a result the cache will end up containing two entries which refer to the original physical address. The dotted entry in the cache in Figure 7.16 illustrates this effect. This is the very problem we thought we had eliminated.

If the cache is small enough or the index sets within the cache large enough (commonly known as the cache associativity), then no more than 12 bits are used for the virtual index. In this case, the problem does not arise as there is a unique set within the cache for every physical address. If 13 or more bits of the virtual address are used for the cache index, then there can be multiple index sets in which a physical address may be found - which one depends on the virtual address used to map it. The one or more bits of virtual address that select which of these sets are said to determine the color of the page.

Figure 7.16 Synonyms in ARMv6

The solution adopted by EKA2 for this problem is to ensure that all virtual to physical mappings share the same color - that is, all of the virtual addresses used to map a given physical page must have the same values for the bits that determine the color of the page. Thus every cache lookup using any of these virtual addresses will resolve to the same entry in the cache.

Design

In some respects, the design of the multiple memory model is more straightforward, as there is never the need to work out where some memory happens to be at a given moment in time. If you know which process has the memory and you have the virtual address, it is just a matter of inspecting the process's page directory to locate the memory - remembering, of course, that the addresses in the page directory are physical addresses and translation to a virtual address is required to inspect the page table.

In this model, the concept of a chunk is less fundamental to the overall design. The design does not require such an object to exist - but as the main interface between the memory model and the kernel is in terms of chunks due to their significance for the moving memory model, they still form an integral part of this memory model.

Address spaces

The kernel process owns the global page directory, which is referenced by TTBR1. All of the pages mapped by this page directory are marked as global, which means that the MMU will create global TLB entries that can be used by any process.

The memory model allocates an ASID for each user-mode process. ARMv6 only supports 256 distinct ASIDs and thus limits the OS to running at most 256 concurrent processes. This is considered to be sufficient! This also provides a limit for the number of per-process, or local, page directories - so these are conveniently allocated in a simple array. Memory is only committed for a local page directory when the ASID is in use by a process. When a process is running, TTBR0 is set to the local page directory for the process.

Depending on its type, the memory model will allocate a chunk in the global page directory or in the local one. Examples of memory that is allocated in the global directory:

  • The global directory maps the XIP ROM as all processes must see this code
  • All processes share the locale data so this is allocated in the global directory
  • Any thread that is running in supervisor mode should have access to kernel data and this is allocated in the global chunk.

Examples of memory that is allocated in the local directory:

  • Stack and heap chunks that are private to the process
  • Shared chunks that may also be opened by other processes
  • Program code that is loaded into RAM.

The last two of these examples are memory that the operating system can map into more than one process. Unlike the moving memory model, however, chunks that can be shared between user processes always have the same base address in all processes. The multiple memory model achieves this by using a single address allocator for all memory that can be shared. This also ensures that shared memory does not suffer from the coloring problem as the virtual address is common to all processes. In the moving memory model, the DProcess objects maintain a collection of the chunks that they currently have access to. This is also necessary to ensure that on a context switch the chunk is made accessible to the program, as well as to allow address lookup when the process is not in context. In the multiple model, this collection still exists but only provides a means to track the number of times a given chunk has been opened within the process so that it can be removed from the memory map only after the last reference is closed. The process's local page directory maps the chunk to provide access when the program is running, and to provide lookup for the memory model with the process is not in context.

The model also keeps an inverse mapping from a shared chunk to the processes that have opened it, so that the memory model can reflect adjustments to the chunk size in all affected page directories.

Protection

Providing process memory protection with the multiple model is simpler than with the moving model, which required domains to make it efficient. Multiple page directories provide most of the protection: memory that is private to a process is not present in the memory map when another process is running. The use of ASIDs and the physically tagged cache ensure that all cache data and mappings are only applied to the owning process. Thus, unlike the moving memory model, the multiple model applies full access permissions to memory mapped by the local page directory. The model applies supervisor-only permissions to kernel data mapped by the global page directory, so that only supervisor modes can access this. The model sets the never-execute permission on all data memory, such as stacks and heaps. This prevents buffer-over-run attacks being used to launch malicious code in the device. Non-XIP user-mode program code is now mapped in the local page directory rather than globally. This allows the memory model to restrict the visibility of such code to just the processes that have explicitly loaded it. The result is that the memory access matches the ideal situation described in Section 7.2.1.3.

Memory map

Figures 7.17 and 7.18 show how the multiple memory model divides virtual address space. I have depicted the case in which the local page directory is 8 KB in size. Again, these diagrams are not to scale.

A final word on chunks Some might suggest that the chunk is a very high-level interface to provide the primary means of describing and controlling the memory allocated and mapped by a process, and that a simpler, lower-level interface would provide flexibility with less complexity. The development of the disconnected chunk illustrates the need for increasing flexibility and support for alternative allocation strategies.

Figure 7.17 Full memory map for the multiple memory model

Within the multiple memory model there is an attempt to escape from the notion that all memory belongs to a chunk in the handling of program code that is loaded into memory.

However, because the moving memory model depends on the use of chunks to describe all of its memory, while Symbian OS supports ARMv5, chunks will continue to be the primary means of describing the memory that is mapped by a process, and as the abstract interface between generic kernel software and the memory model. Of course, even when no longer demanded by the underlying memory hardware, the chunk will always form part of the user-mode interface for memory management.

Algorithms

I will describe the same operations for the multiple memory model as I did for the moving model to illustrate the design.

Figure 7.18 Memory management detail for the multiple memory model
Process context switch

The design of ARMv6 ensures that the address space switch is now a simple operation. It is fast enough that it can be executed with pre-emption disabled, making a process switch only marginally slower than a simple thread switch. The process context switch involves modifying two MMU registers:

  • TTBR0 is set to the page directory for the new process
  • CONTEXTID is set to the ASID for the new process.

The only extra work occurs if the new process contains user-mode self-modifying code chunks, and was not the last such process to run, in which case this function invalidates the dynamic branch prediction table before returning.

Thread request complete

In contrast, this is now a more complex operation than the equivalent in the moving memory model. This is because the memory to which we need to write is not visible in the current address space. This function can afford to use a different, faster, technique for writing into another address space when compared with a general IPC data copy, because it doesn't need to simultaneously map both the signalling and requesting process memory. Instead, the current nanokernel thread changes its address space, effectively executing briefly within the memory context of the target thread. The memory model manages this sleight of hand by changing the TTBR0 and CONTEXTID registers to the values for the target thread with interrupts disabled. At the same time, it updates the current thread's iAddressSpace member to ensure that the right memory context is restored if the next operation is preempted. Now that the current thread has jumped into the target process address space, it can just write the result code before restoring the MMU state to return to the original address context. Some care must be taken when writing to the request status to catch the use of an invalid memory address. The system lock is held and so RequestComplete() traps any exception and then processes the failure once the address space has been restored.

The direct model

This memory model disables the MMU and the OS assumes a direct mapping from virtual address to physical address. Although this enables Symbian OS to run on hardware that has no MMU, Symbian OS does not support this option in products as the lack of an MMU presents too many limitations for the OS as a whole:

  • The manufacturer must divide the physical memory at build time between all the running processes in the OS, as memory chunks cannot be grown and shrunk without an MMU. This makes it difficult to support a variety of different memory-hungry use cases in a single device without supplying an excessive amount of RAM
  • There is no memory protection between different user-mode processes or between user and kernel software - making the system significantly less robust. It would certainly be unwise to consider allowing such a device to support installation of additional software after production.

However, there are times when it is useful to be able to run part of Symbian OS - in particular the kernel and file server - with the MMU disabled, such as when porting EKA2 to a new CPU or a new CPU family.

Such porting tasks are easier if the MMU is initially disabled to stabilize the essential parts of the board support package without debugging new memory hardware at the same time. Once EKA2 is running on the hardware, the porting team can enable the MMU and tackle any memory related problems independently.

The emulator model

As one might expect, we developed this memory model specifically to support the emulator hosted by the Windows operating system. To achieve the objectives set for the emulator regarding development and demonstration, we made some compromises regarding true emulation of the behavior of the hardware memory models.

It is here in the memory model that we find the most significant differences between target and emulator kernels.

The emulator does not run on the bare metal of the PC hardware, but is hosted as a process within the Windows operating system. As a result, the low-level memory support in the emulator memory model uses standard Windows APIs for basic memory allocation.

Virtual address mapping

The emulator runs as a single Win32 process, with the consequence that it only has a 2 GB virtual address range for all memory allocation. Compare this with a real device, where each application within the OS typically has approximately 1 GB of virtual address space for its own use.

To provide the programming model of the chunk, the emulator uses the low-level VirtualAlloc() Windows API, which can reserve, commit and release pages of the process address space. This also enables an emulation of the page-wise allocation of RAM to a chunk, and allows some approximation to be made of the amount of RAM being used by the OS at any time. However, the emulator does not allocate all memory in this way.

The emulator utilizes the Windows DLL format and the Windows loader - LoadLibrary() and friends - to enable standard Windows IDEs to be used for debugging of Symbian OS code in the emulator. As a result, Windows allocates and manages the memory used for code segments and the static data associated with DLLs and EXEs.

The emulator uses native Windows threads to provide Symbian OS threads, again enabling standard development tools to debug multithreaded Symbian code. This results in Windows allocating and managing the software execution stack for the thread. As is typical for Windows threads, these stacks grow dynamically and can become very large - unlike the fixed size, fully committed stacks on target hardware.

Protection

The emulator runs within a single Windows process and thus within a single Windows address space. All memory committed to the emulator is accessible by any Symbian OS process within the emulator. As a result, the emulator provides no memory protection between Symbian OS processes, or between Symbian user and kernel memory.

Technically, it would be possible to make use of another Windows API, VirtualProtect(), which allows a program to change the access permissions for a region of committed memory, to, for example, temporarily make some memory inaccessible. The emulator could use this function to allow the current emulated Symbian OS process to only access its own memory chunks, and so provide some level of memory isolation between Symbian OS processes within the emulator. However, this would result in a poor multi-threaded debug experience as the memory for much of the OS would be unreadable by the debugger.

Programmer APIs

In Sections 7.2, MMUs and caches, and 7.3, The memory model interface, we looked at the very fundamental blocks of memory: the page and the objects used in the interface between the generic kernel and the memory model. Symbian OS provides a number of higher-level memory concepts and objects to provide user-mode and kernel-mode programmers with the right level of abstraction and control when allocating and using memory:

  • The chunk forms the basic API for almost all memory allocation and ownership both inside the kernel and within user-mode processes
  • One of the main consumers of chunks is the RHeap allocator class, which provides a free store allocator on top of a chunk. There are versions for both user- and kernel-side software. The standard C++ and C allocation functions use this allocator by default
  • Kernel-mode software also has lower-level APIs designed for allocating memory, which are suitable for direct device or DMA access. These include physically contiguous RAM, shared I/O buffers and shared chunks.

Chunks

In Section 7.3.1, we looked at the principles of chunks, and how the memory model provides support for them. In this section we look at the programming interface for chunks.

Outside of the kernel executable, EKERN.EXE, kernel-mode software only uses chunks directly for allocation when creating shared chunks, and I will discuss these later in this section. The user-mode API for chunks is the RChunk class:

class RChunk : public RHandleBase 
{
public:
enum TRestrictions
{
EPreventAdjust = 0x01
};
 
public:
inline TInt Open(...);
IMPORT_C TInt CreateLocal(...);
IMPORT_C TInt CreateLocalCode(...);
IMPORT_C TInt CreateGlobal(...);
IMPORT_C TInt CreateDoubleEndedLocal(...);
IMPORT_C TInt CreateDoubleEndedGlobal(...);
IMPORT_C TInt CreateDisconnectedLocal(...);
IMPORT_C TInt CreateDisconnectedGlobal(...);
IMPORT_C TInt Create(...);
IMPORT_C TInt SetRestrictions(TUint aFlags);
IMPORT_C TInt OpenGlobal(...);
IMPORT_C TInt Open(RMessagePtr2,...);
IMPORT_C TInt Open(TInt);
IMPORT_C TInt Adjust(TInt aNewSize) const;
IMPORT_C TInt AdjustDoubleEnded(TInt aBottom, TInt aTop) const;
IMPORT_C TInt Commit(TInt anOffset, TInt aSize) const;
IMPORT_C TInt Allocate(TInt aSize) const;
IMPORT_C TInt Decommit(TInt anOffset, TInt aSize) const;
IMPORT_C TUint8* Base() const;
IMPORT_C TInt Size() const;
IMPORT_C TInt Bottom() const;
IMPORT_C TInt Top() const;
IMPORT_C TInt MaxSize() const;
inline TBool IsReadable() const;
inline TBool IsWritable() const;
};

This follows the standard handle pattern found with all kernel resources. It is a fairly simple API with approximately half of the members being ways to initialize the handle, either as a result of creating a new chunk or by gaining access to an already existing one. The different versions are used to create the different types of chunk and specify the visibility of the chunk. The other half of the class members either provide access to chunk attributes such as the base address (within the calling process address space), or provide the user-mode API to the various chunk adjust methods as already described in Section 7.3.1.

Aside from the use of global chunks to share memory between processes, programmers only rarely use a chunk directly to allocate memory. More often they utilize them as the underlying memory management for some form of allocator.

Free store allocators and heaps

An allocator is an object that services requests to acquire and release memory for a program. Behind every call to the new and delete operators in C++, or to the malloc() and free() functions in C, is an allocator. This object is concerned with taking the memory provided by the OS, usually in multi-page sized blocks, and dividing it up so that an application can use smaller pieces of it in an efficient manner.

Allocator APIs

The essential allocation support required for C++ and C is very similar, and in particular an allocator that supports standard C programs is good enough to implement support for C++. The key services of an allocator are just these three functions:

malloc()
operator new()
Allocate and return a block of memory of at least the requested size in bytes, otherwise return NULL if the request cannot be satisfied. The allocator must ensure that it meets the alignment requirements of all object types. For example, the ABI for the ARM Architecture requires 8-byte alignment.
free()
operator delete()
Release a block of memory previously allocated using malloc() or realloc(). Following this call the memory block should not be used by the program again.
realloc() Grow or shrink a memory block previously allocated using malloc() or realloc(), preserving the contents and return the newly reallocated block. Note that this could trivially be implemented using malloc(), memcpy() and free(), but some allocation schemes may be able to satisfy this request in place, thus avoiding the potentially expensive memory copy.

The last of these functions is clearly optional, and has no parallel in the C++ allocation operators. Of course, C++ also allows the programmer to provide specialized allocation services for a class by over-riding the default implementation of operator new - perhaps to improve performance, meet a strict alignment constraint, or to use a specific type or part of the physical memory.

This simple API does not describe the behavior of a free store allocator with multiple threads. The language standards do not define the behavior in this situation because there would be a performance penalty to using a thread-safe allocator in a single-threaded program. Thus the question of thread-safety is left to the implementation to determine. I will come back to how Symbian OS tackles this problem a little later.

Allocator strategies

If we examine the basic problem of dividing up pages of RAM into different sized pieces, we find that there are several different techniques for structuring and dividing the memory for allocation, and different algorithms for selecting exactly which portion of memory will be used to satisfy a particular allocation request.

Different allocation techniques have different ways of organizing their memory and acquiring and releasing it from the operating system. In Symbian OS, an allocator is most likely going to use a chunk to provide the lower-level allocation, and will pick the type of chunk that best fits the allocation strategy and usage pattern. Here are some examples:

  • Many free store allocators - that is, those supporting operator new() in C++ or malloc() and free() in C - assume that the storage is a single contiguous address range, and that requests for additional pages of memory extend the current committed memory at the top. We can implement these using a standard chunk. The standard heap allocator in Symbian OS is one such allocator
  • Some memory managers for non-native programming systems, such as Java, implement a handle/body system for objects - and effectively require two dynamically re-sizable contiguous memory regions. We can manage these two memory regions in a double-ended chunk, with one region growing upwards and the other downwards
  • More advanced allocators may not require fully contiguous memory regions, and may also be able to release pages of memory back to the OS when no longer used by the program. This may result in better overall memory use in the OS. We use a disconnected chunk to support these.

Why should we bother with so many possible choices of data structure and algorithm for allocators? The simple answer is that there is no ideal allocator. All allocator designs will favor some attributes over others. For example, some provide fast, real-time allocation but have a high memory overhead; others have a minimal memory overhead, but have poor worst-case performance. Different applications may need different allocators to meet their requirements.

Allocators in Symbian OS

We realized that Symbian OS has to achieve two aims with the allocator that it provides:

  1. A good, general purpose allocator provided by default for all programs
  2. The ability to customize or replace the default allocator for applications that have special requirements.

EKA1 met the first of these needs with the RHeap allocator class. EKA2 provides the same choice of default allocator, but now also meets the second need by providing an abstract allocator class. This is the definition of MAllocator in e32cmn.h:

class MAllocator 
{
public:
virtual TAny* Alloc(TInt)=0;
virtual void Free(TAny*)=0;
virtual TAny* ReAlloc(TAny*, TInt, TInt =0)=0;
virtual TInt AllocLen(const TAny*) const =0;
virtual TInt Compress()=0;
virtual void Reset()=0;
virtual TInt AllocSize(TInt&) const =0;
virtual TInt Available(TInt&) const =0;
virtual TInt DebugFunction(TInt, TAny*, TAny*)=0;
};

The first three members are the basic allocator API that I described earlier. The OS expects several other services from the allocator, as I describe in the following table:

Alloc() Basic allocation function, foundation for malloc() and similar allocator functions.
Free() Basic de-allocation function, basis for free(), etc.
ReAlloc() Reallocation function, basis for realloc(). There is an optional third parameter, to control allocator behavior in certain situations. This enables an allocator to provide compatibility with programs that may incorrectly assume that all allocators behave like the original <tt style="font-family:monospace;">RHeap::ReAlloc()</tt> function.
AllocLen() Return the allocated length for the memory block. This is always at least as much as the memory requested, but is sometimes significantly larger.
Compress() Release any unused pages of memory back to the OS, if possible. This function is deprecated, but retained for EKA1 compatibility. Allocators for EKA2 are expected to do this automatically as a side effect of Free() rather than wait for an explicit request.
Reset() Release all allocated memory - effectively equivalent to Free() on all allocated blocks.
AllocSize() Returns the number of blocks and the number of bytes currently allocated in this allocator.
Available() Returns the number of bytes in this allocator that are unused and the largest allocation that would succeed without requesting more pages of memory from the OS.
DebugFunction() Provide support for additional diagnostics, instrumentation and forced failure of the allocator, typically implemented only in a debug build of Symbian OS.

In practice, however, a concrete allocator will derive from the RAllocator class. This is the class that defines the full behavior expected by the free store API in Symbian OS. It provides commonly used additional functionality to the allocator, such as support for calling User::Leave() on allocation failure, rather than returning NULL. It also defines the forced failure support expected by Symbian OS. Here is the RAllocator class as defined in e32cmn.h:

class RAllocator : public MAllocator 
{
public:
enum TAllocFail
{
ERandom,
ETrueRandom,
ENone,
EFailNext,
EReset
};
 
enum TDbgHeapType { EUser, EKernel };
enum TAllocDebugOp {ECount, EMarkStart, EMarkEnd, ECheck, ESetFail, ECopyDebugInfo};
enum TReAllocMode
{
ENeverMove=1,
EAllowMoveOnShrink=2
};
enum TFlags {ESingleThreaded=1, EFixedSize=2};
enum {EMaxHandles=32};
 
public:
inline RAllocator();
TInt Open();
void Close();
TAny* AllocZ(TInt);
TAny* AllocZL(TInt);
TAny* AllocL(TInt);
TAny* AllocLC(TInt);
void FreeZ(TAny*&);
TAny* ReAllocL(TAny*, TInt, TInt=0);
TInt Count() const;
TInt Count(TInt&) const;
void Check() const;
void __DbgMarkStart();
TUint32 __DbgMarkEnd(TInt);
TInt __DbgMarkCheck(TBool, TInt, const TDesC8&, TInt);
void __DbgMarkCheck(TBool, TInt, const TUint8*, TInt);
void __DbgSetAllocFail(TAllocFail, TInt);
 
protected:
virtual void DoClose();
 
protected:
TInt iAccessCount;
TInt iHandleCount;
TInt* iHandles;
TUint32 iFlags;
TInt iCellCount;
TInt iTotalAllocSize;
};

We are still a step or two away from the APIs that programmers typically use to allocate memory. Symbian OS implements the standard C and C++ allocation functions using static members of the User class:

malloc()
operator new()
User::Alloc()
free()
operator delete()
User::Free()
realloc() User::ReAlloc()

These User functions need to identify an allocator object to pass on the requests. The User::Allocator() function provides this service, returning a reference to the RAllocator object that is designated as the calling thread's current allocator.

The User class provides more functions related to manipulating and accessing the current allocator. Here is the relevant part of this class API:

class User : public UserHeap 
{
 
public:
static TInt AllocLen(const TAny*);
static TAny* Alloc(TInt);
static TAny* AllocL(TInt);
static TAny* AllocLC(TInt);
static TAny* AllocZ(TInt);
static TAny* AllocZL(TInt);
static TInt AllocSize(TInt&);
static TInt Available(TInt&);
static TInt CountAllocCells();
static TInt CountAllocCells(TInt&);
vstatic void Free(TAny*);
static void FreeZ(TAny*&);
static TAny* ReAlloc(TAny*, TInt, TInt);
static TAny* ReAllocL(TAny*, TInt, TInt);
static RAllocator& Allocator();
static RAllocator* SwitchAllocator(RAllocator*);
};

We can see the almost one-to-one correspondence of this API with the API provided by RAllocator. The User class implements all of these functions in the same way: get the current allocator object and invoke the corresponding member function.

It is possible to replace the current allocator with an alternative one using the User::SwitchAllocator() function, which returns the previous thread allocator object. There are several reasons that this may be desirable, for example:

  • Replacing the default allocator provided by the OS with one that uses a different allocation strategy better suited to the application
  • Adding an adaptor to the allocator to provide additional instrumentation or debugging facilities. In this case, the new allocator will continue to use the previous allocator for the actual memory allocation but can intercept the actual allocation and de-allocation requests to do additional processing.

RHeap - the default allocator

Symbian OS provides a single allocator implementation, RHeap, providing a low memory overhead and generally good performance. The same approach is used for both the user-mode free store, and the kernel free store. One can describe this allocator as a first fit, address ordered, free list allocator. It is a simple data structure, and the allocation and de-allocation algorithms are fairly straightforward.

RHeap supports different usage models:

  • Using preallocated memory to provide a fixed size heap, or using a chunk to provide a dynamically sized heap
  • Single-threaded or multi-threaded with light-weight locks
  • Selectable cell alignment.

A dynamic RHeap uses a normal chunk, and so has a single region of committed memory. Within that region, there will be both allocated and free blocks. Each block is preceded by a 32-bit word which describes the length of the block. The allocator does not need to track the allocated blocks, as it is the program's responsibility to do this and later free them. The allocator does need to keep track of all the free blocks: it does this by linking them into a list - the free list. The allocator uses the space within the free block (the first word) to maintain this list.

Free blocks that are neighbors in memory are coalesced into a single free block, so at any time the heap consists of a repeated pattern of one or more allocated blocks followed by a single free block. The free list is a singly linked queue maintained in address order - this enables the de-allocation algorithm to easily identify if the block being released is a direct neighbor of a block that is already free.

The allocation algorithm searches the free list from the start until a block is found that is large enough to satisfy the request. If the allocator finds such a block, the allocator splits the free block into the requested allocated block and any remaining free space, which is kept on the free list. Sometimes the block is only just large enough for the request (or the remaining space is too small to keep on the free list) in which case the whole block is returned by the request. If there is no free block large enough, the allocator tries to extend the chunk to create a larger free block at the end of the heap to satisfy the request.

The de-allocation algorithm searches the free list to find the last free block before the block being released and first one after it. If the block being released is a neighbor of either or both of these free blocks they are combined, otherwise the released block is just added into the list between these two free ones.

These algorithms are simple and so in general performance is fast. However, because both algorithms require the searching of an arbitrary length list, the performance is cannot be described as real-time. However, this is no worse than the memory model allocation for adjusting chunks, which also does not have real-time behavior.

One drawback with this data structure is that large free blocks that lie inside the heap memory are not released back to the OS - the allocator can only release free memory that lies at the very end of the heap. However, the data structure has a very low memory overhead in general - approximately 4 bytes per allocated cell - though alignment requirements for modern compilers increase this to approximately 8 bytes.

So RHeap, despite its limitations, provides an excellent general purpose allocator for almost all applications within the OS. When better execution or memory performance is required, you can create custom allocators for individual applications.

Shared memory

In many cases, when an application must pass data across some memory context boundary, such as between two processes or between user and kernel contexts, it is most convenient to copy the data. This can be done in a controlled manner that ensures the data being transferred belongs to the sending memory context - and errors are reported correctly rather than causing the wrong program to terminate. However, when the amount of data to be transferred is large, or lower delays in transfer are required, it is more useful to be able to transfer the memory itself rather than copy the data. Some examples of such use cases would be streaming multiple channels of high bit-rate audio data to a software mixer or downloading large files over USB.

For any situation in which we need to share memory between two user-mode processes, we could use one or more global chunks to achieve this. Chunks can also be accessed by kernel-mode software or even directly by DMA. There is a problem with this approach, however.

The chunks that I have described so far have the property that memory is dynamically committed and released from the chunk at the request of user-mode software. For example, the kernel grows the heap chunk to satisfy a large allocation request, or releases some stack pages in response to thread termination. So you can see that it is possible that a page currently being accessed by a kernel thread or DMA might be unmapped by another thread - probably resulting in a system crash. The case of unmapping memory during DMA is particularly difficult to diagnose because DMA works with the physical address and will continue accessing the physical memory: the defect may only be discovered after the memory model reassigns the memory to another process and then suffers from random memory corruption.

To support the sharing of memory between hardware, kernel threads and user programs, we need different types of memory object.

Shared I/O buffers

The simplest of these objects is the shared I/O buffer. Kernel software, such as a device driver, can allocate a shared IO buffer to a fixed size, and may subsequently map and unmap the buffer from user process address space.

The major limitation with these buffers is that they cannot be mapped into more than one user-mode process at the same time. These are supported in EKA1, but have been superseded in EKA2 by the more powerful shared chunk. As a result of this, we deprecate use of shared I/O buffers with EKA2.

Shared chunks

A shared chunk is a more complex, though more capable, shared memory object and can be used in almost all memory sharing scenarios. It is very much like the global, disconnected chunk that I described in Section 7.3.1, but with one distinct difference: memory can only be committed and released by kernel code and not by user code.

A shared chunk is likely to be the answer if you are solving a problem with some of the following demands:

  • The memory must be created and controlled by kernel-mode software
  • The memory must be safe for use by ISRs and DMA
  • The memory can be mapped into multiple user processes at the same time
  • The memory can be mapped into multiple user processes in sequence
  • The memory object can be transferred by user-side code between processes or to another device driver.

A device driver can map a shared chunk into multiple user processes, either in sequence or simultaneously. In addition, the driver can provide a user program with an RChunk handle to the chunk. This allows the user program to transfer the chunk to other processes and even hand it to other device drivers without the support of the device driver that created it originally.

See Chapter 13, Peripheral Support, for a fuller description of how shared chunks can be used by device drivers.

Global and anonymous chunks

As I have already mentioned, global chunks provide the most flexible way of sharing memory between user-mode processes. An anonymous chunk is a global chunk with no name; this restricts the discovery and access of the chunk from other processes. However, the limited value of anonymous chunks for sharing between kernel and user software has also been highlighted.

A global chunk is likely to be the solution if you are solving a problem with some of the following demands:

  • The memory must be created and controlled by user-mode software
  • The memory is not accessed directly from DMA/ISR
  • The memory can be mapped into one or more user processes at the same time.

I will again point out that opening a shared chunk in two processes at the same time does not always guarantee that they will share the same address for the data. In fact, closing a chunk and re-opening it at a later point within a single program may result in a different base address being returned!

Publish and subscribe

There are, of course, other reasons for wanting to share memory, such as having some data that is global to the whole OS. In this case it is the universal access to the data and not the reduction in copying overhead that drives the desire for sharing.

Once again, a global chunk might serve this purpose on some occasions. But if the quantity of data is small, it is not possible to retain the chunk handle or data address between accesses, or some control is required for access to the data then another approach is needed.

Publish and subscribe may be the answer, as one way to look at this service is as a set of global variables that both user- and kernel-mode software can access by ID. The service also provides access control for each value, based around the platform security architecture, and some real-time guarantees. See Chapter 4, Inter-thread Communication, for a detailed description of this service.

Memory allocation

The following tables provide a comparison across the various memory models of how the memory models reserve and allocate the memory used for different purposes within the OS.

Kernel memory

SymbianOSInternalsBook Table7.1.png

User memory

SymbianOSInternalsBook Table7.2.png

Low memory

On modern desktop computers, we certainly notice when the system runs out of memory: everything begins to slow down, the hard disk starts to get very busy and we get warning messages about running low on memory. But we aren't generally informed that there is not enough memory to carry out a request, as we would have been on a desktop system 10 years ago - instead the system will struggle on, even if it becomes unusable.

This change in behavior is mainly because of demand paging and virtual memory systems - the OS has the ability to save to disk a copy of memory that has not been accessed recently and then copy it back in to main memory again the next time a program tries to use the memory. This way, the system can appear to have far more physical memory than it really has. One side effect is that the hard limit of memory capacity has become a softer restriction - very rarely will an application find that a memory allocation request fails.

Handling allocation failure

As I said earlier, Symbian OS does not support demand paging and has small amounts of physical memory when compared with desktop devices. This combination means that all kernel, system and application software must expect that all memory allocation requests will fail from time to time. The result is that all software for Symbian OS must be written carefully to ensure that Out of Memory (OOM) errors are handled correctly and as gracefully as possible.

As well as correctly handling a failure to allocate memory, a server or application must also manage all of its allocated memory. Long running services (such as the kernel) must be able to free memory that was acquired for a resource when a program releases that resource - the alternative is the slow leakage of memory over time, eventually resulting in memory exhaustion and system failure.

For user-side code, the TRAP and Leave mechanism, and the cleanup stack provide much of the support required to manage memory allocation and recovery on failure. These services are covered extensively in books such as Symbian OS C++ for Mobile Phones (Professional Development on Constrained Devices, by Richard Harrison. Symbian Press).

Within the EKA2 kernel, there are no mechanisms such as TRAP, Leave and the cleanup stack. This contrasts with EKA1, in which we used the TRAP mechanism inside the kernel. Our experience shows that the use of TRAP, Leave and the cleanup stack make user-side code simpler, more readable and often more compact. However, this experience does not carry over to the implementation of EKA2 - the presence of fine-grained synchronization and possibility of preemption at almost all points in the code often requires more complex error detection and recovery code. Additionally, optimizations to accelerate important operations or to reduce context switch thrashing remove the symmetry that is desirable for using a cleanup stack push/pop protocol.

So instead of providing an equivalent to TRAP, the kernel provides a number of supporting services that help ensure that threads executing in kernel mode do not leak memory, even during long running kernel services when it is quite possible that the thread may be terminated.

Thread critical sections

These bear little relation to the user-side synchronization primitive of the same name, RCriticalSection. Rather, these are code regions during which the thread cannot be unilaterally suspended or terminated - the thread will only act on suspend or exit requests once it leaves the critical section. The kernel uses these extensively to ensure that when a thread is modifying a shared data structure in the kernel, the modifications will run to completion rather than the thread stopping part way through. Holding a fast mutex places a thread in an implicit critical section, as the scheduler depends on the fact that such a thread cannot block or otherwise be removed from the ready list.

Exception trapping

When inside a critical section, it is illegal for a thread to do any action that would result in the kernel terminating it - such as panicking it (due to invalid user arguments) or terminating it because it took an exception. The latter scenario can occur if a kernel service must copy data from memory supplied by the user-mode program, but the memory pointer provided is invalid. This makes the copying of user-mode data difficult, particularly when the thread needs to hold the system lock at the same time (which is an implicit thread critical section). EKA2 provides an exception handling and trapping system, XTRAP, which behaves in a similar way to the user-side TRAP/Leave, but instead it can catch hardware exceptions such as those generated by faulty memory access. The kernel most frequently uses XTRAP to safely copy user-mode memory while inside a thread critical section. Any errors reported can then result in the thread safely exiting the critical section before reporting the failure.

Transient objects

Occasionally a thread needs to allocate a temporary object during part of a kernel executive call. As the owning reference to this object is the thread's registers and call stack, the thread would have to enter a critical section to prevent a memory leak happening if the thread were terminated. However, thread critical sections make error handling more complex as they require the use of exception trapping and deferring of error reporting until the critical section is released. We provide some help here: each DThread in the kernel has two members to hold a DObject on which the thread has a temporary reference, and a temporary heap cell. If non-null, iTempObj and iExt-TempObj are closed and iTempAlloc is deleted during thread exit processing. Kernel code can use these members to own such temporary objects during an executive call, enabling the thread to release the critical section earlier.

System memory management

It is quite possible to write a single application that manages its own memory carefully, handles OOM scenarios and can adjust its behaviour when less memory is available. However, a single application cannot easily determine whether it should release some non-critical memory (for example, a cache) so that another application can run.

However, the kernel provides some support to the system as a whole, to enable the implementation of system-wide memory management policies, typically within a component in the UI.

The memory model keeps track of the number of unused pages of memory. When this goes below a certain threshold, the kernel completes any RChangeNotifier subscriptions with EChangesFreeMemory. When the amount of free memory increases the kernel signals the notifiers again. In addition, should any RAM allocation request fail due to insufficient memory, the kernel signals the notifiers with EChangesOutOfMemory.

The EUSER function UserSvr::SetMemoryThresholds() sets two values that control when the memory model should signal the notifiers.

Typically, the UI component responsible for managing the system memory policy will set the thresholds and then monitor the notifier for indications that free memory is getting low or has been exhausted. When this occurs, the system might employ various strategies for freeing some currently used memory:

  • The manager can request that applications reduce their memory consumption. For example, a browser could reduce a RAM cache it is using, or a Java Virtual Machine could garbage collect, and compact its memory space
  • The manager can request (or demand) that applications and OS services that have not been used recently to save any data and then exit. This is quite acceptable on a phone, as the user-concept of running many applications at once is still very much one that is tied to computers and not to phones.

The mechanism by which such requests arrive at an application are presently specific to the UI, if they are used at all.

In some respects, you can envisage this system level memory manager as implementing an application level garbage collector. In future, it may well be that the algorithms used for selecting which applications should be asked to release memory or exit will borrow ideas from the already established problem domain of garbage collecting memory allocators.

Summary

In this chapter I have talked about the way in which the memory model makes use of the physical memory, the cache and the MMU to provide the memory services required by both the Symbian OS kernel- and user-mode programs.

I also showed how the MMU is used to provide memory protection between processes. In the next chapter, I will talk about how we build on this basis to provide a secure operating system.

Licence icon cc-by-sa 3.0-88x31.png© 2010 Symbian Foundation Limited. This document is licensed under the Creative Commons Attribution-Share Alike 2.0 license. See http://creativecommons.org/licenses/by-sa/2.0/legalcode for the full terms of the license.
Note that this content was originally hosted on the Symbian Foundation developer wiki.

This page was last modified on 26 July 2012, at 07:00.
86 page views in the last 30 days.
×