Assembly language

An assembly language is a low-level programming language for computers, microprocessors, microcontrollers, and other integrated circuits. It implements a symbolic representation of the binary machine codes and other constants needed to program a given CPU architecture. This representation is usually defined by the hardware manufacturer, and is based on mnemonics that symbolize processing steps (instructions), processor registers, memory locations, and other language features. An assembly language is thus specific to a certain physical (or virtual) computer architecture. This is in contrast to most high-level programming languages, which, ideally, are portable.
A utility program called an assembler is used to translate assembly language statements into the target computer's machine code. The assembler performs a more or less isomorphic translation (a one-to-one mapping) from mnemonic statements into machine instructions and data. This is in contrast with high-level languages, in which a single statement generally results in many machine instructions.
Many sophisticated assemblers offer additional mechanisms to facilitate program development, control the assembly process, and aid debugging. In particular, most modern assemblers include a macro facility (described below), and are called macro assemblers.

Assembler

Typically a modern assembler creates object code by translating assembly instruction mnemonics into opcodes, and by resolving symbolic names for memory locations and other entities The use of symbolic references is a key feature of assemblers, saving tedious calculations and manual address updates after program modifications. Most assemblers also include macro facilities for performing textual substitution—e.g., to generate common short sequences of instructions as inline, instead of called subroutines.
Assemblers are generally simpler to write than compilers for high-level languages, and have been available since the 1950s. Modern assemblers, especially for RISC based architectures, such as MIPS, Sun SPARC, and HP PA-RISC, as well as x86(-64), optimize instruction scheduling to exploit the CPU pipeline efficiently.

Number of passes

There are two types of assemblers based on how many passes through the source are needed to produce the executable program.

One-pass assemblers go through the source code once and assume that all symbols will be defined before any instruction that references them.
Two-pass assemblers create a table with all symbols and their values in the first pass, then use the table in a second pass to generate code. The assembler must at least be able to determine the length of each instruction on the first pass so that the addresses of symbols can be calculated.

The advantage of a one-pass assembler is speed, which is not as important as it once was with advances in computer speed and abilities. The advantage of the two-pass assembler is that symbols can be defined anywhere in program source code. This lets programs be defined in more logical and meaningful ways, making two-pass assembler programs easier to read and maintain

0110000 01100001

This binary computer code can be made more human-readable by expressing it in hexadecimal as follows

B0 61

Here, B0 means 'Move a copy of the following value into AL', and 61 is a hexadecimal representation of the value 01100001, which is 97 in decimal. Intel assembly language provides the mnemonic MOV (an abbreviation of move) for instructions such as this, so the machine code above can be written as follows in assembly language, complete with an explanatory comment if required, after the semicolon. This is much easier to read and to remember.

MOV AL, 61h       ; Load AL with 97 decimal (61 hex)

At one time, many assembly language mnemonics were three letter abbreviations, such as JMP for jump, INC for increment, etc. Modern processors have a much larger instruction set and many mnemonics are now longer, for example FPATAN for "floating point partial arctangent" and BOUND for "check array index against bounds". Many assembly language statements consist of an opcode mnemonic followed by a comma-separated list of data, arguments or parameters
The same mnemonic MOV refers to a family of related opcodes to do with loading, copying and moving data, whether these are immediate values, values in registers, or memory locations pointed to by values in registers. The opcode 10110000 (B0) copies an 8-bit value into the AL register, while 10110001 (B1) moves it into CL and 10110010 (B2) does so into DL. Assembly language examples for these follow.

MOV AL, 1h        ; Load AL with immediate value 1
MOV CL, 2h        ; Load CL with immediate value 2
MOV DL, 3h        ; Load DL with immediate value 3

The syntax of MOV can also be more complex as the following examples show

MOV EAX, [EBX]   ; Move the 4 bytes in memory at the address contained in EBX into EAX
MOV [ESI+EAX], CL ; Move the contents of CL into the byte at address ESI+EAX

In each case, the MOV mnemonic is translated directly into an opcode in the ranges 88-8E, A0-A3, B0-B8, C6 or C7 by an assembler, and the programmer does not have to know or remember which.
Transforming assembly language into machine code is the job of an assembler, and the reverse can at least partially be achieved by a disassembler. Unlike high-level languages, there is usually a one-to-one correspondence between simple assembly statements and machine language instructions. However, in some cases, an assembler may provide pseudoinstructions (essentially macros) which expand into several machine language instructions to provide commonly needed functionality. For example, for a machine that lacks a "branch if greater or equal" instruction, an assembler may provide a pseudoinstruction that expands to the machine's "set if less than" and "branch if zero (on the result of the set instruction)". Most full-featured assemblers also provide a rich macro language (discussed below) which is used by vendors and programmers to generate more complex code and data sequences.
Each computer architecture and processor architecture usually has its own machine language. On this level, each instruction is simple enough to be executed using a relatively small number of electronic circuits. Computers differ by the number and type of operations they support. For example, a new 64-bit machine would have different circuitry from a 32-bit machine. They may also have different sizes and numbers of registers, and different representations of data types in storage. While most general-purpose computers are able to carry out essentially the same functionality, the ways they do so differ; the corresponding assembly languages reflect these differences.
Multiple sets of mnemonics or assembly-language syntax may exist for a single instruction set, typically instantiated in different assembler programs. In these cases, the most popular one is usually that supplied by the manufacturer and used in its documentation.

Article of Blogs

Tuesday, January 25, 2011