Introduction

Welcome to my Kernel From Scratch (KFS) project documentation! This repository contains my implementation of the École 42 "Kernel From Scratch" curriculum, consisting of 10 progressive projects designed to explore and understand kernel architecture from the ground up in the language C.

⚠️ This project is in early stage development. It's not yet ready for production use.

This Book

This book contains more information about my technical approach & a more blog-style writing about each project to get to the current point I am at.

To find the book online, go to this page -> a page

Overview

Introduction

Welcome to my Kernel From Scratch (KFS) project documentation! This repository contains my implementation of the École 42 "Kernel From Scratch" curriculum, consisting of 10 progressive projects designed to explore and understand kernel architecture from the ground up in the language C.

⚠️ This project is in early stage development. It's not yet ready for production use.

Technical Requirements

  • Target Architecture: i386 (x86_32)
  • Build System: Custom Makefile required
  • Programming Language: Not restricted to any specific language
  • Compiler Flags:
    • -nostdlib: Prevents linking with standard library
    • -fnodefaultlibs: Excludes default library functions
    • -fno-stack-protector: Disables stack protection mechanisms
  • Custom Linker: Implementation of a custom linker is mandatory
  • No cheating! All code must be original; no copying + pasting allowed

Documentation Structure

Each project in this repository is documented with:

  1. Project Goals and Requirements
  2. Technical Approach and Implementation Details
  3. Challenges Encountered and Solutions
  4. Conclusions and Lessons Learned

Join me

Feel free to explore each project's documentation to understand building a kernel from scratch. The projects are organized sequentially, with each building upon the knowledge and components developed in previous sections.

KFS 01 - Grub, boot & screen

Introduction

Our first KFS project! I was quite nervous myself at first looking at this. My previous experience with kernels was limited to Linux from Scratch, which barely scratches the surface compared to KFS.

For this project, I chose the C Programming Language. I simply already have a few years of experience in C/C++ & Assembly, which will definitly be useful.

Goals

  • A kernel that can boot via Grub
  • An ASM bootable base
  • A basic Kernel Library
  • Show "42" on screen

Technical Approach & Implimentation

My approach was quite straightforward for this project. Read, read & READ! I primarily started reading OSDev. They have a straightforward tutorial for OS booting in C/ASM. With ease, I was able to make a system bootable.

Philipp Oppermann's blog helped me a lot! The blog is focused on developing a Rust Kernel, but it gave me more insight on how to setup a C Language environment. His writing style is to me more understandable compared to OSDev.

After that I noticed Mr. Oppermann having a second tutorial on VGA; how to set it up and print to it, which is one of the requirements. After setting that all up, I just had to put the dots on the i, and cross the t's.

I added a nix-shell. nix-shell creates an interactive shell based on a Nix expression. It makes sure you get less of the "It works on my machine" & it is also great if you switch often from different devices. Nix as a language on the otherhand is quite unintutive. Their documentation is known to be a sluggish, because of that I had a very hard time finding out how to cross-compile gcc using nix-shell.

Challenges

The hardest challange of this project was understanding the nix-shell, because the documentation of Nix is quite limited. It was just trying a lot of things until it worked.

Conclusion & Lesson Learned

In the end, it went much smoother than expected. There were plenty of tutorials and understandable documentation to get me through the first project.

I am still happy with my choice to use nix-shell. It will definitely avoid headaches in the future.

KFS 02 - GDT & Stack

Introduction

Let's proceed to the second KFS project. The first was doable and I felt confident doing the second one.

For this project, we had to implement a GDT (Global Descriptor Table). The GDT serves as a fundamental data structure in x86 architecture, playing a crucial role in memory management and protection. When our computer starts, it begins in real mode, a simple operating mode that provides direct access to memory and I/O devices. However we need to switch to protected mode, which introduces memory protection, virtual memory, and privilege levels.

Think of protected mode as establishing different security clearance levels in a building. The GDT acts as the rulebook for this security system, defining the access rights and boundaries for each level. The x86 architecture provides four rings (0-3), where ring 0 is the most privileged (kernel space) and ring 3 is the least privileged (user space). Each ring has specific permissions and restrictions, all defined in our GDT.

The GDT is essential not just for security, but also for the basic operation of protected mode. Without a properly configured GDT, the CPU cannot execute protected mode code at all.

Goals

The project requires creating a GDT at 0x00000800 with entries for Kernel Data Space, Kernel Code Space, User Data Space, and User Code Space. Additionally, we need to add minimal PS/2 Keyboard Support and implement a basic shell with the commands reboot & gdt. The gdt command will print the GDT entries in a human-readable way.

Technical Approach & Implementation

My journey began with studying the OSDev documentation. The concepts were initially overwhelming. The terminology they used like segment descriptors, privilege levels, and descriptor flags felt like learning a new language. After watching several YouTube tutorials (here & here) about GDT implementation, things started to click.

I faced a choice: implement the GDT in Assembly or C. While Assembly would give more direct control, I chose C for its human-readability and my familiarity with the language. Here's how I structured the implementation:

The boot process begins in boot.asm, where we set up multiboot flags and prepare for the transition to protected mode. Once we call kmain(), it would call the function gdt_init(). This would create the GDT that we will use in our kernel.

extern void gdt_flush(uint32_t);

entry_t gdt_entries[5] __attribute__((section(".gdt")));
descriptor_pointer_t gdt_ptr;


void gdt_set_gate(uint32_t num, uint32_t base, uint32_t limit, uint8_t access,
                  uint8_t gran) {
    ...
}

void gdt_init() {
  gdt_ptr.limit = sizeof(entry_t) * 5 - 1;
  gdt_ptr.base = (uint32_t)&gdt_entries;

  gdt_set_gate(0, 0, 0, 0, 0);                // NULL Gate
  gdt_set_gate(1, 0, 0xFFFFFFFF, 0x9A, 0xCF); // Kernel Code Segment
  gdt_set_gate(2, 0, 0xFFFFFFFF, 0x92, 0xCF); // Kernel Data Segment
  gdt_set_gate(3, 0, 0xFFFFFFFF, 0xFA, 0xCF); // User Code Segment
  gdt_set_gate(4, 0, 0xFFFFFFFF, 0xF2, 0xCF); // User Data Segment

  gdt_flush((uint32_t)&gdt_ptr);
}

We call gdt_set_gate() which creates what are known as segment descriptors or gates. The function takes five parameters:

  • num: The index of the current Gate we are configuring
  • base: The starting address of the segment (0 for flat memory model)
  • limit: The maximum addressable unit (0xFFFFFFFF means use entire 4GB address space)
  • access: Defines segment privileges and type
  • granularity: Controls granularity and size

At the top of the code snippet you will see two variables. entry_t gdt_entries[5] __attribute__((section(".gdt"))); is where we will place the entries in. We will have 5 gates in total. The __attribute__((section(".gdt"))) is a compiler directive that instructs the linker to place this array in a special data section named .gdt. I then position this section at a specific memory address using the linker script.

descriptor_pointer_t gdt_ptr is what we put into the lgdt register. The CPU needs this to know where it can find the GDT Gates & what its size is.

After setting up the GDT, I implemented basic keyboard support. While my current polling approach isn't ideal (it continuously checks for keystrokes), it works for our basic shell. A proper implementation would use interrupts to handle keyboard events, but that's a topic for future projects. The VGA driver from KFS_01 was adapted to create a simple shell interface, allowing for the reboot and gdt commands.

The system still experienced triple faults initially. The cause of the initial triple faults was the linker script. Although I used the __attribute__ to place the GDT in a custom section, I hadn't yet told the linker where to place that .gdt section in the final memory layout, leading to the crash. I ensured our GDT was placed at the correct memory address. The ordering is crucial: BIOS boot code, then GDT, then the rest of our kernel.

  /* Start at 2MB */
  . = 2M;


  .gdt 0x800 : ALIGN(0x800)
    {
    *(.gdt)
  }

  /* The rest... */

By using the command: objdump -h ferrite-c.elf | grep gdt in your terminal. It will show something like this:

0 .gdt          00000028  00000800  00000800  00000800  2**11

The second column show 00000800, which means you did it! The GDT is placed in 0x00000800. Alternativly, you can write in your program:

void print_gdt(void) {
  descriptor_pointer_t gdtr;

  __asm__ volatile("sgdt %0" : "=m"(gdtr));

  printk("GDT Base Address: 0x%x\n", gdtr.base);
  printk("GDT Limit: 0x%x\n", gdtr.limit);
}

Challenges

The challenges were mostly understanding the GDT. I struggled to grasp its purpose and exact workings. It took me reading several articles and watching multiple videos to finally understand what it's meant to do.

I also had no real experience with the linker. Finding the source of the triple fault was particularly frustrating, and it took quite a while before I realized the linker might not be placing the GDT at the correct address.

Conclusion & Lesson Learned

I found that I needed to reread materials multiple times to fully grasp concepts. Fortunately, there was plenty of documentation available about the GDT and its implementation. Working with the GDT motivated me to document everything extensively, like these pages. I mainly do this to ensure I truly understand the functionality of each component I'm working with. And to help others who might follow this path, of course. ;)

KFS 04 - Interrupts

Introduction

Interrupts are a critical part of any operating system, serving as a key mechanism for handling everything from hardware input to critical CPU errors. They allow the OS to respond to events asynchronously, ensuring that unexpected exceptions don't become fatal system crashes and that devices like the keyboard can communicate with the CPU efficiently. This post details the process of implementing a complete interrupt handling system.

Goals

The project goals were as follows:

  • Create and Register an Interrupt Descriptor Table (IDT): The foundational step was to properly define, populate, and load an IDT so the CPU knows where to find our handlers.

  • Implement a Signal-Callback System: Design a kernel API that can associate interrupt signals with specific callback functions.

  • Develop an Interrupt Scheduling Interface: To keep the time spent in an actual interrupt minimal, the goal was to create a system to schedule the main workload of a handler to be run outside of the interrupt context.

  • Implement Pre-Panic Cleanup: For security and stability, create an interface to clean general-purpose registers before a system panic or halt.

  • Implement Stack Saving on Panic: To aid in debugging, the system needed to save the stack state before a panic, allowing for post-mortem analysis of what caused the error.

  • Implement Keyboard Handling: With the core interrupt system in place, the final goal was to use it to handle input from the keyboard.

Technical Approach & Implementation

My approach was as straightforward as it could be. I began by initializing the IDT, creating the necessary exception handlers, and testing them. Programming the IDT is quite abstract since much of the logic is already baked into the CPU, leaving little room for creative implementation.

I implemented the IDT as follows:

// idt.c

typedef struct {
    interrupt_handler_e type;
    union {
        // Some functions receive an error_code to help identify the error.
        interrupt_handler regular;
        interrupt_handler_with_error with_error_code;
    } handler;
} interrupt_handler_entry_t;

const interrupt_handler_entry_t INTERRUPT_HANDLERS[17] = {
    {REGULAR, .handler.regular = divide_by_zero_handler},
    ... // The other functions
};

interrupt_descriptor_t idt_entries[IDT_ENTRY_COUNT];
descriptor_pointer_t idt_ptr;

void idt_set_gate(uint32_t num, uint32_t handler) {
    ...
}

void idt_init(void) {
    for (int32_t i = 0; i < 17; i += 1) {
        uint32_t handler = 0;

        if (INTERRUPT_HANDLERS[i].type == REGULAR) {
            handler = (uint32_t)INTERRUPT_HANDLERS[i].handler.regular;
        } else if (INTERRUPT_HANDLERS[i].type == WITH_ERROR_CODE) {
            handler = (uint32_t)INTERRUPT_HANDLERS[i].handler.with_error_code;
        }

        // Pass the index & address of the function to set it correctly in the IDT.
        idt_set_gate(i, handler);
    }

    // Set gates for the first 32 exceptions.
    // 0x08 is the kernel code segment selector. 0x8E are the flags for an interrupt gate.
    idt_ptr.limit = sizeof(entry_t) * IDT_ENTRY_COUNT - 1;
    idt_ptr.base = (uint32_t)&idt_entries;

    // Load the IDT using the lidt assembly instruction.
    __asm__ __volatile__("lidt %0" : : "m"(idt_ptr));
}
// exceptions.c

__attribute__((target("general-regs-only"), interrupt)) void
divide_by_zero_handler(registers_t *regs) {
    // The 'cs' register is on the stack as part of the interrupt frame.
    // We can inspect it to see if the fault was in kernel-mode (CPL=0) or user-mode (CPL=3).
    // This requires a more complex reading of the interrupt frame.
    // For now, we assume a kernel fault is a panic.
    if ((regs->cs & 3) == 0) {
        panic(regs, "Division by Zero");
    }

    __asm__ volatile("cli; hlt");
}

The core of the logic revolves around two key variables: idt_entries and idt_ptr. The idt_entries array is the table itself, which will hold all 256 vectors. The idt_ptr is the structure we pass to the CPU, containing the base address and limit (size) of the table, so the processor knows exactly where to find it.

In the idt_init() function, we loop through our predefined exception handlers. While you could hardcode each idt_set_gate() call, a loop makes the code cleaner. This loop retrieves the memory address for each handler function and calls idt_set_gate() to correctly populate the entry in the idt_entries table.

The final step is the lidt assembly instruction. This tells the CPU to load our idt_ptr, making our new Interrupt Descriptor Table active. From this point on, the CPU will use our table to find the correct handler for any interrupt or exception that occurs.

When an interrupt happens, the CPU needs to stop its current work and jump to the handler, but it must be able to return and resume its work later. The __attribute__((interrupt)) tells the compiler to automatically add the necessary assembly code to save the machine's state before your C code runs and restore it after. This is why interrupts should be as fast as possible; while a handler is running, the rest of the system is paused. For frequent events like keyboard presses, a common strategy is for the handler to do the bare minimum—like adding a key press to a queue—and letting a separate, lower-priority task scheduler process it later.

Once the IDT was set, I added a task scheduler. Inside an interrupt handler, I would add a task to the task_scheduler, which would add it to a queue. The main kernel loop then calls run_scheduled_tasks() to trigger the actual work of the interrupt. This is a great way to avoid staying too long in the interrupt itself. The shorter the interrupt, the faster and more responsive your kernel will be.

The panic() function is designed to terminate everything gracefully when a fatal, unrecoverable error occurs. When panicking, it's important to not only print an error message but also to dump the register state for debugging and then clean them for security. Keep the printing to a minimum, since you cannot rely on the stability of any system services at this point.

After that was all set, I was able to set up the keyboard. This required communicating with the 8259 PIC (Programmable Interrupt Controller), which manages hardware interrupts. The keyboard sends a signal to the PIC, which then interrupts the CPU. I made use of my task scheduler to queue up keyboard presses to spend as little time as possible in the interrupt. It looks something like this:

// exception.c

__attribute__((target("general-regs-only"), interrupt)) void
keyboard_handler(registers_t *regs) {
    // Read the scancode from the keyboard's data port.
    int32_t scancode = inb(KEYBOARD_DATA_PORT);
    setscancode(scancode); // Store the scancode for processing.

    // Schedule the main keyboard logic to run outside the interrupt.
    schedule_task(keyboard_input);

    // Send End-of-Interrupt (EOI) signal to the PIC.
    pic_send_eoi(1);
}
// Kernel.c

while (true) {
    // In the main kernel loop, execute any tasks that have been scheduled.
    run_scheduled_tasks();

    // Halt the CPU until the next interrupt occurs to save power.
    __asm__ volatile("hlt");
}

To break it down, the keyboard_handler() first reads the scancode from the keyboard. It then schedules the real processing task and immediately sends an End-of-Interrupt (EOI) signal to the PIC, telling it we're done. Meanwhile, the main kernel while-loop continuously runs any scheduled tasks, ensuring that the heavy lifting happens outside of the critical interrupt context.

Challenges

Implementing interrupts involved coordinating numerous individual components: the GDT (Global Descriptor Table), IDT, PIC (Programmable Interrupt Controller), and the interrupts themselves. While resources like OsDev provided great checklists for setup, piecing together all the seperate elements proved challenging. It was quite helpful that I already setup the GDT. It also made the IDT setup much easier.

One unexpected hiccup was finding a source of truth for the behavior of each interrupt and determining which specific i386 interrupts were essential for our kernel. While an LLM offered some assistance, it couldn't match the detail and accuracy found in a specific MIT article.

We also encountered a minor hiccup with our LSP (Language Server Protocol), clangd. It reported an error with our interrupt logic, despite GCC, our compiler, successfully compiling and running the code without issues. The solution was to ignore the LSP warning and ensure gcc used the __attribute__((target("general-regs-only"), interrupt)) attributes. The general-regs-only attribute is a promise to the compiler that only general-purpose registers will be used, which can prevent certain headaches, though it doesn't eliminate all potential issues.

Conclusion & Lessons Learned

In the end, this assignment was very insightful. It is so cool to delve into the components like the Real-Time Clock (RTC), which can provide system time, or how PS/2 keyboards uses interrupts to communicate. Interrupts are truly an ingenious and fundamental part of any operating system. This deep dive has definitely interested me into exploring system calls, which also heavily rely on interrupts.

To-Do List

There is lots to do, but I will try and keep track of the things that are needed to keep the kernel functional.

  • Implement a Slab Allocator (or similar object cache).
  • Ensure buddy allocator is functional.
  • Figure out On-Demand Paging for the User Space.
  • Create a User Space that can be allocated on Syscalls.
  • Writing documentation on the logic of the Buddy Allocator
  • Explanation on why the buddy allocator should manage the physical address directly

Architecture overview

This chapter covers the core design decisions for my kernel. I will walk you through the architectural choices I made and why I made them.

We will focus on a surface-level structure and strategy. If you want to dive into the code and see how it all works, you will find links to the implementation details throughout the pages.

Memory Allocation

⚠️ Heads-up! This section is a work-in-progress. I am making a lot of changes right now, but this should still give you a great overview of the general design.

Memory allocation is one of my favorite and most difficult topics. My current approach is heavily inspired by the Linux kernel.

Here is the boot-up sequence for getting the memory allocators running, in chronological order:

  1. Set up a temporary page directory.
  2. Place the kernel in the higher-half of memory.
  3. Parse the Multiboot header to find available memory regions.
  4. Initialize memblock, our simple bootstrap allocator.
  5. Initialize the buddy_allocator as the core page allocator.
  6. Create the final, permanent page directory.
  7. Set up kmalloc() to provide a clean interface for kernel allocations.
  8. Allocate!

Let's walk through these steps.


The Higher Half and Initial Paging

First things first, we need to get the kernel into a known, protected location. I place it in the higher-half of the virtual address space, starting at 0xC0000000. This is a common technique that separates the kernel's address space from the user's, preventing conflicts. Everything below this address is reserved for user-space programs.

To make this work, I create a temporary page directory that handles the initial virtual-to-physical address translation with a simple offset. These macros help us out:

#define KERNEL_BASE 0xC0000000
#define VIRT_TO_PHYS(addr) ((addr) - KERNEL_BASE)
#define PHYS_TO_VIRT(addr) ((addr) + KERNEL_BASE)

With this initial mapping in place, the bootloader can jump to our kmain() function, which now operates from the higher-half.

memblock: The Bootstrap Allocator

Before we can set up the buddy allocator, we need a way to... well, allocate memory for the allocator!

This is where memblock comes in. After parsing the Multiboot header to see how much RAM we have and which parts are usable, we initialize memblock. It is a very simple bump allocator:

  • It keeps a pointer to the start of a free memory region.
  • When you ask for memory, it returns the current pointer and "bumps" it forward by the amount you requested.

The major limitation is that there is no free()! Once memory is allocated with memblock, it is allocated for good. It is a tool for the earliest stages of boot-up, used only until the buddy allocator is ready.

buddy_allocator: The Core Page Allocator

Now for the main event. The buddy allocator is the core of physical memory management. It is designed to handle requests for blocks of memory that are a power-of-two number of pages (e.g., 1 page, 2 pages, 4 pages, 16 pages).

Its main advantage is that it is very efficient at splitting large blocks and, more importantly, merging freed blocks back together to reduce external fragmentation.

The main drawback is internal fragmentation. If you need 5KB of memory, the buddy allocator will give you an 8KB block (the next power of two after 4KB), wasting 3KB. This is a problem a slab_allocator can solve later on. The buddy allocator's only job is to manage and hand out raw physical pages.

To keep track of all these blocks, the allocator is managed by a central struct that holds its entire state:

typedef struct {
  uintptr_t base;                         // Start address of the memory region we manage
  size_t size;                            // Total size of this region
  struct buddy_node *free_lists[MAX_ORDER + 1]; // Array of linked lists for free blocks
  uint8_t *map;                           // Bitmap to track the state of blocks
  size_t map_size;                        // The size of our bitmap
  uint8_t max_order;                      // The largest allocation size (2^max_order)
} buddy_t;

In simple terms, the free_lists array is the most important part. Each slot (or "order") points to a linked list of all available blocks of a specific size. For example, free_lists[0] might be for 4KB blocks, free_lists[1] for 8KB blocks, and so on. The map is a simple bitmap that helps the allocator quickly check if a block's "buddy" is also free.

Putting It All Together: kmalloc and Final Paging

With our buddy allocator ready, we can finally create the permanent page directory. Now that we know the exact memory layout, we can build a more robust mapping of our kernel's memory.

On older i386 processors that don't support the invlpg instruction, we have to flush the entire Translation Lookaside Buffer (TLB) every time we map a new page, which is quite inefficient!

Finally, we set up kmalloc(). This is the function that most of the kernel will actually use. It acts as a friendly frontend for the memory subsystem:

  1. A kernel component calls kmalloc() requesting a certain number of bytes.
  2. kmalloc() figures out how many pages it needs and asks the buddy allocator for them.
  3. The buddy allocator returns a physical address.
  4. kmalloc() finds a free virtual address and maps it to the physical address given by the buddy.
  5. It adds a small header to the allocation to keep track of its size and a magic number for debugging.
  6. It returns the new virtual address to the caller.

The header looks like this. The magic number helps detect memory corruption if it gets overwritten.

typedef struct block_header {
  size_t size;      // How big is this block?
  uint32_t magic;   // Is this a valid block?
} block_header_t;

And that's it! The kernel is now free to allocate memory.


WIP

  • Slab Allocator
  • User-space Allocations
  • Demand Paging

Sources

The Buddy Allocator

Explaining memory management can be tricky, so I've created this document to break down the algorithm: the Buddy Allocator.

What is the Buddy Allocator?

The Buddy Allocator is a memory management algorithm that takes charge of a large, contiguous region of physical memory. It acts like a librarian for your computer's RAM, handing out blocks of memory when requested and tidying them up when they're returned. As a core kernel component, it works directly with physical addresses, the actual hardware addresses of your RAM.

Why "Buddy"?

The algorithm gets its name from how it manages memory. It works on a simple principle: to get a small block of memory, you take a large block and split it in half. These two halves are now "buddies." This splitting process continues until a block of the perfect size is created.

When a block is freed, the allocator checks to see if its buddy is also free. If it is, they are instantly merged back into their original, larger block. This buddy system is a clever way to combat memory fragmentation, ensuring that large, contiguous blocks of memory are available when needed.

The Struct

Our Buddy Allocator has the following two structs:

#define PAGE_SIZE 4096 // The size of a single page, the smallest allocation unit.
#define MAX_ORDER 32   // The theoretical maximum order. The actual max is calculated at runtime.
#define MIN_ORDER 0    // The smallest order, representing a single page (2^0 pages).

// A node in a free list, stored within the free block itself.
typedef struct buddy_node {
  struct buddy_node *next;
} buddy_node_t;

typedef struct {
  uintptr_t base; // Starting physical address of the managed memory.
  size_t size;    // Total size of the managed memory in bytes.
  buddy_node_t *free_lists[MAX_ORDER + 1]; // Array of free lists, indexed by block order.
  uint8_t *map;      // Bitmap to track block states for efficient merging.
  size_t map_size;   // Size of the state bitmap in bytes.
  uint8_t max_order; // Actual max order, calculated from the total memory size.
} buddy_allocator_t;

My kernel creates a pre-defined buddy struct in the kernel that will manage the whole available physical memory.

The Math

The buddy allocator doesn't work with page counts directly. Instead, it uses a concept called order to manage block sizes. The order is simply the exponent in a power-of-two calculation.

The number of pages in a block is calculated as \(2^{\text{order}}\).

  • Order 0: \(2^0 = 1\) page
  • Order 1: \(2^1 = 2\) pages
  • Order 2: \(2^2 = 4\) pages
  • Order 3: \(2^3 = 8\) pages

To find the maximum order for a given amount of memory, we use the inverse of this formula: the base-2 logarithm.

$$\text{max_order} = \lfloor \log_2(\text{total_pages}) \rfloor$$

An example

Let's say we have 4 MiB of available memory and a page size of 4 KiB (4096 bytes).

  1. Calculate Total Pages: $$\text{total_pages} = \frac{4 \text{ MiB}}{4 \text{ KiB}} = \frac{4 \times 1024 \times 1024}{4096} = \frac{4,194,304}{4096} = 1024 \text{ pages}$$

  2. Calculate the Order: $$\text{max_order} = \lfloor \log_2(1024) \rfloor = 10$$

This tells us that our 4 MiB memory region can be managed as a single, top-level block of order 10.

On intialisation

Each variable in the struct needs to be initialised. The memory layout will be as follows:

Low Address
------------------------------------
| buddy_allocator_t struct         |
|----------------------------------|
| Bitmap                           |  <-- Metadata
|----------------------------------|
| (Optional padding for alignment) |
|----------------------------------|
| Managed Memory Region            |  <-- buddy_allocator_t->base points here
| ...                              |
| ...                              |
------------------------------------
High Address

As you can see, the struct and bitmap of the Buddy Allocator are located before the usable memory region that the buddy allocator will use. The start of this memory region is represented by base.

The Bitmap

At the heart of my buddy allocator is a simple but powerful data structure: the bitmap. While the free_lists tell us if a block of a certain size is available, the bitmap is important for managing the memory space efficiently, especially when freeing memory.

The bitmap's primary job is to track the state of memory blocks to make merging buddies incredibly fast. Every bit in the map corresponds to a single page of physical memory.

  • A 0 means the page is part of a larger, free block.
  • A 1 means the page is part of a block that is currently allocated or has been split into smaller chunks.

Let's see it in action. Imagine we have 16 pages of memory. When nothing is allocated, our bitmap is all zeroes.

Initial State (All Free):

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Now, let's say a program requests a 4-page block (order 2). We allocate pages 4, 5, 6, and 7. The bitmap is updated to mark these pages as "in-use".

After Allocating 4 Pages:

Bitmap: [0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0]
Pages:   0 1 2 3 4 5 6 7 8 9 ...
                 ^-----^
                 Allocated

Merging

So, what happens when we free this 4-page block? To keep our memory from becoming fragmented, we want to merge the newly freed block with its "buddy" if the buddy is also free.

The "buddy" is the adjacent block of the same size. For our block at page 4 (which is of order 2), its buddy is the block starting at page 0.

To check if we can merge, the allocator performs these steps:

  1. Identify the Buddy: It calculates the buddy's starting address.
  2. Check the Bitmap: It looks at the bitmap to check the buddy's status. It only needs to check the first bit of the buddy block (the bit for page 0).
  3. Make a Decision:
    • If the buddy's bit is 1, the buddy is still in use. We can't merge. The freed block is simply added to the free_list.
    • If the buddy's bit is 0, the buddy is free. We can merge them into a single, larger 8-page block.

In our case, the bit for page 0 is 0, so we can merge. The allocator combines the block from pages 0-3 and our newly freed block from pages 4-7 into one large, 8-page block.

After Deallocation and Merging:

Bitmap: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

The bitmap allowed us to instantly know the state of the buddy without searching through complicated data structures. This process is repeated up the chain, allowing small blocks to merge back into the largest possible free chunks, keeping the memory ready for future allocations.

Programming the bitmap:

Programming the bitmap is straightforward. We get the start and end addresses of the available physical memory and calculate how many pages fit within that range. If 16 pages fit in the available physical memory it means we need to create a bitmap of 16 bits (2 bytes).

  uintptr_t start_addr = (uintptr_t)get_next_free_addr();
  uintptr_t end_addr = (uintptr_t)get_heap_end_addr();

  size_t num_pages = (end_addr - start_addr) / PAGE_SIZE;
  size_t map_size_needed = CEIL_DIV(num_pages, 8);
  g_buddy.map_size = ALIGN(map_size_needed, sizeof(size_t));

  g_buddy.map = (uint8_t *)memblock(g_buddy.map_size);
  // Check if map exists
  memset(g_buddy.map, 0, g_buddy.map_size);

The base & size

The base & size is the second section we will need to figure out. With the help of the bitmap we can find the starting address for base. Ensure that base is aligned with your PAGE_SIZE. Take the end_addr and calculate how big your heap will be for the Buddy Allocator.

  g_buddy.base = ALIGN((uintptr_t)g_buddy.map + g_buddy.map_size, PAGE_SIZE);
  g_buddy.size = end_addr - g_buddy.base;
  // Check if size is smaller than PAGE_SIZE (MIN_ORDER).

With the bitmap, base, and size established, we have enough information to set up the rest of the buddy allocator.

The max order

The default max order on my kernel is 32 (4 294 967 296 bytes), which is impossible to reach, since we are working on a x86_32 machine.

We calculate the max_order with our current amount of memory with the following formula: $$O_{\text{max}} = \lfloor \log_2\left(\frac{S_{\text{total}}}{S_{\text{page}}}\right) \rfloor$$

If the requested allocation is over the max_order, it can be safely discarded.

The free_list

The free_list is the most important part of the buddy allocator's allocation strategy. It's structured as an array where each index corresponds to a block order (size). Each element of the array is the head of a singly linked list containing all the free blocks of that specific size. Since the blocks are free, we cleverly use the memory of the block itself to store a pointer to the next free block, requiring no extra memory.

An example

Imagine our system has just started, and all the memory is in one giant, free block. If our maximum size is an order 9 block (\(2^9 = 512\) pages), our free_list looks like this:

Free List: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Order:      0  1  2  3  4  5  6  7  8  9

Now, a program asks for a tiny order 0 block (1 page). The allocator follows these steps:

  1. Check Order 0: It looks in free_list[0] and finds it empty.
  2. Find a Bigger Block: It searches up the orders until it finds the first non-empty list, which is at order 9.
  3. Split the Block: The order 9 block is too big, so the allocator splits it. This creates two "buddy" blocks of order 8. One is kept to continue the process, and the other is placed in the order 8 free list.
Free List: [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
Order:      0  1  2  3  4  5  6  7  8  9
  1. Repeat: This process repeats. The order 8 block is split into two order 7s. One is kept, and the other is added to the free_list[7]. This continues all the way down until we finally create two order 0 blocks.

  2. Fulfill the Request: Once we have two order 0 blocks, the allocator gives one to the user and places the other on the free_list[0].

After fulfilling the request for one order 0 block, our free_list now has a free block available at every order except order 9.

Free List: [1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
Order:      0  1  2  3  4  5  6  7  8  9

This splitting mechanism ensures that even if only large blocks are free, the allocator can efficiently generate a block of any required size.

Implementation

A key challenge in initializing the buddy allocator is that the total amount of available physical memory is rarely a perfect power of two. If we have 7 pages of memory, for example, we can't create a single 8-page block.

To handle this, the initialization code "carves" the available memory into the largest possible power-of-two blocks that will fit.

The Carving Loop

We run a loop that repeatedly carves off the largest possible block from the remaining space and adds it to the appropriate free list. Let's trace it with an example of 7 pages of available memory:

  1. First Iteration:

    • remaining is 7 pages (28 672 bytes)
    • The largest power-of-two block that fits is 4 pages (order 2).
    • We create a 4-page block and add it to free_list[2].
    • remaining is now \(7 - 4 = 3\) pages.
  2. Second Iteration:

    • remaining is 3 pages (12 288 bytes)
    • The largest power-of-two block that fits is 2 pages (order 1).
    • We create a 2-page block and add it to free_list[1].
    • remaining is now \(3 - 2 = 1\) page.
  3. Third Iteration:

    • remaining is 1 page (4096 bytes)
    • The largest power-of-two block that fits is 1 page (order 0).
    • We create a 1-page block and add it to free_list[0].
    • remaining is now \(1 - 1 = 0\) pages. The loop stops here.

After the loop, our free lists will contain one block of order 2, one of order 1, and one of order 0, perfectly accounting for our 7 pages of memory.

Here is the C code that implements this logic:

// First, ensure all free lists are empty.
for (int32_t i = 0; i <= MAX_ORDER; i++) {
    g_buddy.free_lists[i] = NULL;
}

size_t remaining = g_buddy.size;
uintptr_t current_addr = g_buddy.base;

// Loop until all available memory has been carved up.
while (remaining >= PAGE_SIZE) {
    // Find the largest order that fits in the remaining space.
    uint32_t order = floor_log2(remaining / PAGE_SIZE);

    // This check is a safeguard.
    if (order > g_buddy.max_order) {
        order = g_buddy.max_order;
    }

    // Add the new block to the head of the correct free list.
    buddy_node_t *block = (buddy_node_t *)current_addr;
    block->next = g_buddy.free_lists[order];
    g_buddy.free_lists[order] = block;

    // Subtract the carved block's size and advance our address.
    size_t block_size = PAGE_SIZE << order;
    current_addr += block_size;
    remaining -= block_size;
}

Conclusion

The Buddy Allocator, with its clever use of free_lists, a bitmap, and a power-of-two "order" system, stands as a simple yet powerful solution for managing physical memory. By treating adjacent memory blocks as "buddies" that can be split and merged, it efficiently provides memory of various sizes while effectively combating fragmentation.

Global Descriptor Table (GDT)

What is it

The GDT serves as a fundamental data structure in x86 architecture, playing a crucial role in memory management and protection. When our computer starts, it begins in real mode, a simple operating mode that provides direct access to memory and I/O devices. However we need to switch to protected mode, which introduces memory protection, virtual memory, and privilege levels.

Think of protected mode as establishing different security clearance levels in a building. The GDT acts as the rulebook for this security system, defining the access rights and boundaries for each level. The x86 architecture provides four rings (0-3), where ring 0 is the most privileged (kernel space) and ring 3 is the least privileged (user space). Each ring has specific permissions and restrictions, all defined in our GDT.

The GDT is essential not just for security, but also for the basic operation of protected mode. Without a properly configured GDT, the CPU cannot execute protected mode code at all.

For more information go to OSDev

My Technical Approach

My approach was as follows: I would start at the boot.asm & setup the multiboot. After everything is setup, I would call kmain(), which then calls gdt_init(). gdt_init() will setup the segment descriptors of the GDT & ensure that it creates the correct struct pointer that will be passed to gdt_flush. gdt_flush will place the entries in the correct registers for the CPU to read.

Here are some snippets to give you a better idea:

extern void gdt_flush(uint32_t); // The Assembly function

entry_t gdt_entries[5] __attribute__((section(".gdt")));
descriptor_pointer_t gdt_ptr;


void gdt_set_gate(uint32_t num, uint32_t base, uint32_t limit, uint8_t access,
                  uint8_t gran) {
    ...
}

void gdt_init() {
  gdt_ptr.limit = sizeof(entry_t) * 5 - 1;
  gdt_ptr.base = (uint32_t)&gdt_entries;

  gdt_set_gate(0, 0, 0, 0, 0);                // NULL Gate
  gdt_set_gate(1, 0, 0xFFFFFFFF, 0x9A, 0xCF); // Kernel Code Segment
  gdt_set_gate(2, 0, 0xFFFFFFFF, 0x92, 0xCF); // Kernel Data Segment
  gdt_set_gate(3, 0, 0xFFFFFFFF, 0xFA, 0xCF); // User Code Segment
  gdt_set_gate(4, 0, 0xFFFFFFFF, 0xF2, 0xCF); // User Data Segment

  gdt_flush((uint32_t)&gdt_ptr);
}

The assembly code in gdt_flush.asm that actually loads the GDT:

    gdt_flush:
        ; Get the address of the variable we passed to the function
        mov  eax, [esp + 4]
        lgdt [eax]

        ;   Set up segments
        mov eax, 0x10
        mov ds, ax
        mov es, ax
        mov fs, ax
        mov gs, ax
        mov ss, ax

        ;   Far jump to flush pipeline and load CS
        jmp 0x08:.flush

    .flush:
        ret

Considerations

I considered doing a full assembly setup, but the reason I did not is simply because the C language ensures the readibilty.

Linker

I force the Linker to put GDT before the rest of the code. You can do it without this, but you will have issues if you would like to place it in a specific address. If that is not necessary for you, you can ignore this. The specific placement of the GDT in memory can be important for some system designs, particularly when dealing with memory management and virtual memory setup. This was my approach:

  /* Start at 2MB */
  . = 2M;


  .gdt 0x800 : ALIGN(0x800)
    {
    *(.gdt)
  }

Memory Layout Considerations

When placing the GDT at a specific address, it's important to ensure that:

  1. The address is accessible during the transition to protected mode
  2. The address doesn't conflict with other important system structures
  3. The address is properly aligned for optimal performance

Interrupts and the IDT

The Interrupt Descriptor Table (IDT) is a data structure used by the x86 architecture to implement an interrupt vector table. The IDT is used by the processor to determine the correct response to interrupts and exceptions. ~ OSDev Wiki

The Interrupt Descriptor Table, or IDT, is a component of the x86 operating system. Interrupts are a mechanism that allows hardware and software to get the CPU's attention without requiring it to constantly check their status. Instead of constantly polling, a device or a program can simply "interrupt" the CPU when it needs something. Good examples is a keyboard press or a timer indicating that a time slice has passed.

The IDT can hold up to 256 entries, called vectors. Each entry points to an Interrupt Service Routine (ISR), which is the code that runs when a specific interrupt occurs. These 256 interrupts can be broadly categorized into two types.

Exceptions (CPU-Generated Interrupts)

These are triggered directly by the CPU when it detects a problem during program execution. The first 32 vectors (0-31) are reserved for these exceptions. Non all the 32 vectors are taken. There were only 17 when the i386 came out.

A classic example is vector 0: Divide-by-Zero. If your code attempts to divide a number by zero, the CPU hardware itself will stop execution and trigger interrupt 0. If this fault happens while the kernel (ring 0) is executing, it's usually an unrecoverable error. The OS has no choice but to panic the system.

Hardware Interrupts (External Interrupts)

These are the interrupts generated by hardware devices external to the CPU. Your keyboard, mouse, and system timer (RTC) all use hardware interrupts to communicate.

On older i386 systems, these devices don't talk to the CPU directly. Instead, they are connected to a controller chip called the 8259 PIC (Programmable Interrupt Controller). The PIC manages the hardware requests (IRQs) and signals the CPU.

The Common CPU Exceptions

When you write code to handle interrupts, you're mostly concerned with the first 32 vectors, which are reserved for CPU exceptions. Some of these push an error code onto the stack to provide more information about the fault, while others do not. Here is a list of the most common exceptions you will need to handle.

InterruptDescriptionError Code
0Divide errorNo
1Debug exceptionsNo
2Nonmaskable interruptNo
3BreakpointNo
4OverflowNo
5Bounds checkNo
6Invalid opcodeNo
7Coprocessor not availableNo
8Double faultYes (always 0)
9(reserved)-
10Invalid TSSYes
11Segment not presentYes
12Stack exceptionYes
13General protection faultYes
14Page faultYes
15(reserved)-
16Coprocessor errorNo
17-31(reserved)-

More information about each interrupt

Setup an IDT

Setting up the IDT as quite similair to the GDT. There are some minor difference. Each vector holds a function pointer to the function that should be called on a certain interrupt. Let's take a look on the Divide-By-Zero again:

// idt.c

typedef struct {
    interrupt_handler_e type;
    union {
        // Some functions receive an error_code to help identify the error.
        interrupt_handler regular;
        interrupt_handler_with_error with_error_code;
    } handler;
} interrupt_handler_entry_t;

const interrupt_handler_entry_t INTERRUPT_HANDLERS[17] = {
    {REGULAR, .handler.regular = divide_by_zero_handler},
    ... // The other functions
};

interrupt_descriptor_t idt_entries[IDT_ENTRY_COUNT];
descriptor_pointer_t idt_ptr;

void idt_set_gate(uint32_t num, uint32_t handler) {
    ...
}

void idt_init(void) {
    for (int32_t i = 0; i < 17; i += 1) {
        uint32_t handler = 0;

        if (INTERRUPT_HANDLERS[i].type == REGULAR) {
            handler = (uint32_t)INTERRUPT_HANDLERS[i].handler.regular;
        } else if (INTERRUPT_HANDLERS[i].type == WITH_ERROR_CODE) {
            handler = (uint32_t)INTERRUPT_HANDLERS[i].handler.with_error_code;
        }

        // Pass the index & address of the function to set it correctly in the IDT.
        idt_set_gate(i, handler);
    }

    // Set gates for the first 32 exceptions.
    // 0x08 is the kernel code segment selector. 0x8E are the flags for an interrupt gate.
    idt_ptr.limit = sizeof(entry_t) * IDT_ENTRY_COUNT - 1;
    idt_ptr.base = (uint32_t)&idt_entries;

    // Load the IDT using the lidt assembly instruction.
    __asm__ __volatile__("lidt %0" : : "m"(idt_ptr));
}
// exceptions.c

__attribute__((target("general-regs-only"), interrupt)) void
divide_by_zero_handler(registers_t *regs) {
    // The 'cs' register is on the stack as part of the interrupt frame.
    // We can inspect it to see if the fault was in kernel-mode (CPL=0) or user-mode (CPL=3).
    // This requires a more complex reading of the interrupt frame.
    // For now, we assume a kernel fault is a panic.
    if ((regs->cs & 3) == 0) {
        panic(regs, "Division by Zero");
    }

    __asm__ volatile("cli; hlt");
}

The core of the logic revolves around two key variables: idt_entries and idt_ptr. The idt_entries array is the table itself, which will hold all 256 vectors. The idt_ptr is the structure we pass to the CPU, containing the base address and limit (size) of the table, so the processor knows exactly where to find it.

In the idt_init() function, we loop through our predefined exception handlers. While you could hardcode each idt_set_gate() call, a loop makes the code cleaner. This loop retrieves the memory address for each handler function and calls idt_set_gate() to correctly populate the entry in the idt_entries table.

The final step is the lidt assembly instruction. This tells the CPU to load our idt_ptr, making our new Interrupt Descriptor Table active. From this point on, the CPU will use our table to find the correct handler for any interrupt or exception that occurs.

When an interrupt happens, the CPU needs to stop its current work and jump to the handler, but it must be able to return and resume its work later. The __attribute__((interrupt)) tells the compiler to automatically add the necessary assembly code to save the machine's state before your C code runs and restore it after. This is why interrupts should be as fast as possible; while a handler is running, the rest of the system is paused. For frequent events like keyboard presses, a common strategy is for the handler to do the bare minimum—like adding a key press to a queue—and letting a separate, lower-priority task scheduler process it later.

Here is a drawing on how the logic of the interrupts are being handled: A flowchart showing the logic of how interrupts are handled, from the interrupt signal to the IDT, to the specific handler, and finally to a task scheduler.

System Calls via Software Interrupts

You ever wondered how system calls worked? Well, a system call is simply a custom software interrupt to do a specific task. It is the most common way to implement system calls. It is probably the most portable way to implement. Linux traditionally uses interrupt 0x80 for this purpose on x86.

Debugging

Debugging is essential for finding bugs and truly understanding what your program does. gdb is great for debugging kernels. It is extremely powerful and useful tool.

Unlike regular programs where you can directly run gdb on the compiled binary, kernel debugging requires additional steps since we're running our code in a VM.

There are several debugging approaches:

  1. Create logfiles
  2. Print to the terminal (via QEMU)
  3. Use gdb

The first two methods are similar, with the main difference being the output destination - file vs terminal. Choose based on your preference.

GDB with QEMU

QEMU can be configured to wait for a GDB connection before executing any code, enabling debugging. Run make debug. A window will open & QEMU will wait for GDB to connect before starting execution

Connect GDB from another terminal:

gdb
(gdb) target remote localhost:1234

Load debug symbols:

(gdb) symbol-file <kernel bin>

Example debugging session:

gdb
(gdb) target remote localhost:1234
Remote debugging using localhost:1234
0x0000fff0 in ?? ()

(gdb) symbol-file kernel.b
Reading symbols from kernel.b...done.

(gdb) break kernel_main     # Add breakpoint to any kernel function
Breakpoint 1 at 0x101800: file kernel/kernel.c, line 12.

(gdb) continue # Qemu starts the kernel
Breakpoint 1, kernel_main (mdb=0x341e0, magic=0) at kernel/kernel.c:12

Reference

Building from Source

You'll need:

  • nix-shell for an isolated development environment
  • QEMU for testing
# Clone the repository
git clone https://github.com/xannyxs/ferrite-c
cd ferrite-c

# Initiate nix-shell
nix-shell shell.nix 

# Build the kernel
make

# Run in QEMU
make run

References

Official Documentation

OSDev and Community Resources

Learning Resources

Development Tools