5 December 2024

Chapter 6

by Arpon Sarker

The Elements of Computing Systems Chapter 6 - Assembler

Introduction

This is the first chapter in the software hierarchy as the hardware platform has been completed. The main complication with writing an assembler which converts assembly to machine code is managing symbolic references to memory addresses which is done with a symbol table. The language I used to accomplish this is C++.

Symbol Resolution

Symbols are either variables which are assigned memory addresses holding some value (the address doesn’t matter, what matters is if they correspond with each other) and labels which mark various locations in the programs.

In general-purpose computers using a shared address space, the symbol table will start at some address we choose such as 1024 and the translated code will be at address 0. This allows for a maximum of 1024 lines in our machine code program (this doesn’t happen in the Hack computer since we have RAM and ROM). Then we build a symbol table and translate the program into its symbol-less version (i=1 becomes M[1024]=1). Lastly, the program can now be translated into machine code directly. However, this assumes each command and variable is only mapped to one word, which is naive.

Specification

The assembler parses each assembly command into its underlying fields, translates each field into its equivalent binary code, and assembles the generated codes into a binary instruction, and assembles the generated codes into a binary instructions which is a one-to-one mapping (apart from pseudo-instructions).

Parser Module

The main function of the parser module is to break each assembly command into its underlying components (fields and symbols) such as returning the type of command (A_COMMAND, C_COMMAND, L_COMMAND), symbol (if a or l command), dest, comp, and jump (if c command).

Code Module

The code module takes the dest, comp, and jump strings from the parser module and converts them into binary codes.

SymbolTable Module

The SymbolTable will contain all the pre-defined symbols on initialisation and then is filled afterwards with ad hoc symbols present in the program. The pseudo-command $(XXX)$ which doesn’t actually produce machine code defines the symbol $XXX$ to refer to the instruction memory location holding the next command after itself in the program. Variables which are not pre-defined or as $(XXX)$ command is mapped to consecutive memory locations.

The SymbolTable contains the capability of adding entries, checking if it is contained, and getting the address associated with a symbol.

Assembly programs are allowed to use symbolic labels (not talking about variables which need to be defined first) before they are defined. This is done by using a two-pass assembler where the first pass builds the symbol table and generates no code and in the second pass, dthe assembler replaces each symbol with its corresponding meaning and generates the final binary code.

Perspective

We can extend the assembler using “constant arithmetic” which allows programmers to explicitly associate symbols with particular data addresses. For example, doing x+5 to add 5 to the memory address x (address not value) which gets us the 5th location offset. We can also add macro commands to make programming much easier. The original assemblers like the original computers were actually people and the first assembler was likely created using bootstrapping.

tags: compsci - computer_architecture - programming

Hacker theme

Hacker is a theme for GitHub Pages.