diff --git a/10_DMA_memory/README.md b/10_DMA_memory/README.md index 9e6c48ae..d9b9bdf9 100644 --- a/10_DMA_memory/README.md +++ b/10_DMA_memory/README.md @@ -1,13 +1,245 @@ # Tutorial 10 - DMA Memory -Coming soon! +There's a secret I haven't told you! A certain part of our code doesn't work +anymore since the [virtual memory](../0D_virtual_memory) tutorial. There is a +regression that manifests in the `Videocore Mailbox` driver. It will only work +until **paging and caching** is switched on. Afterwards, the `call()` method +will fail. Why is that? -This lesson will teach about: -- Simple bump memory allocators and non-cacheable memory. -- Using MiniUart for early boot messages and dynamically switching to the PL011 - Uart later (which now needs the memory allocator that theoretically could fail - - which the MiniUart could then print). +The reason is that in our code, the RPi's processor is sharing a `DRAM buffer` +with the `Videocore` device. In other words, the concept of **shared memory** is +used. Let's recall a simplified version of the protocol: +1. RPi `CPU` checks the `STATUS` MMIO register of the `Videcore` if a message can + be written. +2. If so, `CPU` writes the address of the `DRAM buffer` in which the actual + message is stored into the `Videocore`'s `WRITE` MMIO register. +3. `CPU` checks the `STATUS` and `READ` MMIO registers if the Videocore has + answered. +4. If so, `CPU` checks the first `u32` word of the earlier provided `DRAM buffer` + if the response is valid (the `Videocore` puts its answer into the same buffer + in which the original request was stored. This is what is commonly called + a `DMA` transaction). + +At step **4**, things break. The reason is that code and **page tables** were +set up in a way that the `DRAM buffer` used for message exchange between `CPU` +and Videcore is attributed as _cacheable_. + +So when the `CPU` is writing to the buffer, the contents might not get written +back to `DRAM` in time before the notification of a new message is signaled to +the Videocore via the `WRITE` MMIO register (which is correctly attributed as +device memory in the page tables and hence not cached). + +Even if the contents would land in `DRAM` in time, the `Videocore`'s answer +which overwrites the same buffer would not be reflected in the `CPU`'s cache, +since there is no coherency mechanism in place between the two. The RPi `CPU` +would read back the same values it put into the buffer itself when setting up +the message, and not the `DRAM` content that contains the answer. + +![DMA block diagram](../doc/dma_0.png) + +The regression did not manifest yet because the Mailbox is only used before the +paging caching is switched on, and never afterwards. However, now is a good time +to fix this. + +## An Allocator for DMA Memory + +The first step is to introduce a region of _non-cacheable DRAM_ in the +`KERNEL_VIRTUAL_LAYOUT` in `memory.rs`: + +```rust +Descriptor { + name: "DMA heap pool", + virtual_range: || RangeInclusive::new(map::virt::DMA_HEAP_START, map::virt::DMA_HEAP_END), + translation: Translation::Identity, + attribute_fields: AttributeFields { + mem_attributes: MemAttributes::NonCacheableDRAM, + acc_perms: AccessPermissions::ReadWrite, + execute_never: true, + }, +}, +``` + +When you saw the inferior performance of non-cacheable mapped DRAM compared to +cacheable DRAM in the [cache performance tutorial](../0E_cache_performance) +earlier and asked yourself why anybody would ever want this: Exactly for the +use-case at hand! + +Theoretically, some linker hacks could be used to ensure that the `Videcore` is +using a buffer that is statically linked to the DMA heap pool once paging and +caching is turned on. However, in real-world kernels, it is common to frequently +map/allocate and unmap/free chunks of `DMA` memory at runtime, for example in +device drivers for DMA-capable devices. + +Hence, let's introduce an `allocator`. + +### Bump Allocation + +As always in the tutorials, a simple implementation is used for getting started +with basic concepts of a topic, and upgrades are introduced when they are +needed. + +In a `bump allocator`, when a requests comes in, it always returns the next +possible aligned region of its heap until it runs out of memory. What makes it +really simple is that it doesn't provide means for freeing memory again. When no +more memory is left, game is over. + +Conveniently enough, [Rust already provides memory allocation APIs](https://doc.rust-lang.org/alloc/alloc/index.html). There is an +[Alloc](https://doc.rust-lang.org/alloc/alloc/trait.Alloc.html) and a +[GlobalAlloc](https://doc.rust-lang.org/alloc/alloc/trait.GlobalAlloc.html) +trait. The latter is intended for realizing a _default allocator_, meaning it +would be the allocator used for any standard language construtcs that +automatically allocate something on the heap, for example a +[Box](https://doc.rust-lang.org/alloc/boxed/index.html). There can only be one +global allocator, so the tutorials will make use of it for cacheable DRAM later. + +Hence, for the DMA bump allocator, +[Alloc](https://doc.rust-lang.org/alloc/alloc/trait.Alloc.html) will be +used. What is also really nice is that for both traits, only the `alloc()` +method needs to be implemented. If this is done, you automatically get a bunch +of additional default methods for free, e.g. `alloc_zeroed()`. + +Here is the implementation in `memory/bump_allocator.rs`: + +```rust +pub struct BumpAllocator { + next: usize, + pool_end: usize, + name: &'static str, +} + +unsafe impl Alloc for BumpAllocator { + unsafe fn alloc(&mut self, layout: Layout) -> Result, AllocErr> { + let start = crate::memory::aligned_addr_unchecked(self.next, layout.align()); + let end = start + layout.size(); + + if end <= self.pool_end { + self.next = end; + + println!( + "[i] {}:\n Allocated Addr {:#010X} Size {:#X}", + self.name, + start, + layout.size() + ); + + Ok(NonNull::new_unchecked(start as *mut u8)) + } else { + Err(AllocErr) + } + } + + // A bump allocator doesn't care + unsafe fn dealloc(&mut self, _ptr: NonNull, _layout: Layout) {} +} +``` + +The `alloc()` method returns a pointer to memory. However, it is safer to +operate with [slices](https://doc.rust-lang.org/alloc/slice/index.html), since +they are intrinsically bounds-checked. Therefore, the `BumpAllocator` gets an +additional method called `alloc_slice_zeroed()`, which wraps around +`alloc_zeroed()` provided by the `Alloc` trait and on success returns a `&'a mut +[T]`. + +### Global Instance + +A global instance of the allocator is needed, and since its methods demand +_mutable references_ to `self`, it is wrapped into a `NullLock`, which was +introduced in the [last tutorial](../0F_globals_synchronization_println): + +```rust +/// The global allocator for DMA-able memory. That is, memory which is tagged +/// non-cacheable in the page tables. +static DMA_ALLOCATOR: sync::NullLock = + sync::NullLock::new(memory::BumpAllocator::new( + memory::map::virt::DMA_HEAP_START as usize, + memory::map::virt::DMA_HEAP_END as usize, + "Global DMA Allocator", + )); + +``` + +## Videocore Driver + +The `Videocore` driver has to be changed to use the allocator during +instantiation, and in contrast to earlier, this could fail now: + +```rust +let ret = crate::DMA_ALLOCATOR.lock(|d| d.alloc_slice_zeroed(MBOX_SIZE, MBOX_ALIGNMENT)); + +if ret.is_err() { + return Err(()); +} +``` + +## Reorg of the Kernel Init + +Since the `Videcore` now depends on the `DMA Allocator`, its initialization must +now happen _after_ the `MMU init`, which turns on **paging and caching**. This, +in turn, means that the `PL011 UART`, which is used for printing and needs the +`Videcore` for its setup, has to shift its init as well. So there is a lot of +shuffling happening. + +In summary, the new init procedure would be: + +1. GPIO +2. MMU +3. Videcore +4. PL011 UART + +That is a bit unfortunate, because if anything goes wrong at `MMU` or +`Videocore` init, we can not print any fault info on the console. For this +reason, the `MiniUart` from the earlier tutorials is revived, because it only +needs the `GPIO` driver to set itself up. So here is the revamped init: + +1. GPIO +2. MiniUart +3. MMU +4. Videcore +5. PL011 UART + +Using this procedure, the `MiniUart` can report faults for any of the subsequent +stages like`MMU` or `Videocore` init. If all is successful and the more capable +`PL011 UART` comes online, we can let it conveniently replace the `MiniUart` +through the `CONSOLE.replace_with()` scheme introduced in the [last tutorial](../0F_globals_synchronization_println). + +### Make it Fault + +If you feel curious and want to put all the theory to action, take a look at the +code in `main.rs` for the DMA allocator instantiation and try the changes in the +comments: + +```rust +/// The global allocator for DMA-able memory. That is, memory which is tagged +/// non-cacheable in the page tables. +static DMA_ALLOCATOR: sync::NullLock = + sync::NullLock::new(memory::BumpAllocator::new( + memory::map::virt::DMA_HEAP_START as usize, + memory::map::virt::DMA_HEAP_END as usize, + "Global DMA Allocator", + // Try the following arguments instead to see the PL011 UART init + // fail. It will cause the allocator to use memory that is marked + // cacheable and therefore not DMA-safe. The communication with the + // Videocore will therefore fail. + + // 0x00600000 as usize, + // 0x007FFFFF as usize, + // "Global Non-DMA Allocator", + )); +``` + +This might only work on the real HW and not in QEMU. + +## QEMU + +On the actual HW it is possible to reprogram the same `GPIO` pins at runtime to +either use the `MiniUart` or the `PL011`, and as a result the console output of +both is sent through the same USB-serial dongle. This is transparent to the +user. + +On QEMU, unfortunately, two different virtual terminals must be used and this +multiplexing is not possible. As a result, you'll see that the QEMU output has +changed in optics a bit and now provides separate views for the two `UARTs`. ## Output diff --git a/10_DMA_memory/src/main.rs b/10_DMA_memory/src/main.rs index 4f4b2405..f5410979 100644 --- a/10_DMA_memory/src/main.rs +++ b/10_DMA_memory/src/main.rs @@ -49,10 +49,9 @@ static DMA_ALLOCATOR: sync::NullLock = memory::map::virt::DMA_HEAP_END as usize, "Global DMA Allocator", // Try the following arguments instead to see the PL011 UART init - // fail. It will cause the allocator to use memory that are marked - // cacheable and therefore not DMA-safe. The answer from the Videocore - // won't be received by the CPU because it reads an old cached value - // that resembles an error case instead. + // fail. It will cause the allocator to use memory that is marked + // cacheable and therefore not DMA-safe. The communication with the + // Videocore will therefore fail. // 0x00600000 as usize, // 0x007FFFFF as usize, diff --git a/doc/dma_0.png b/doc/dma_0.png new file mode 100644 index 00000000..65c8c619 Binary files /dev/null and b/doc/dma_0.png differ diff --git a/doc/dma_0.svg b/doc/dma_0.svg new file mode 100644 index 00000000..8ea0f46a --- /dev/null +++ b/doc/dma_0.svg @@ -0,0 +1,402 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + core0 + + core1 + + core2 + + core3 + + + Cache + + + CPU + + DRAM + + Videocore + + + +