Encoding Intel x86/IA-32 Assembler Instructions
April 28th, 2008 | Published in Assembler, Debug, History, Retro | 16 Comments
Albeit I decided to write about twice (perhaps once) a week, and so far have only 4 posts, I'm surprised for the amount of readers this blog already has. Thanks a lot to everybody!
One of those readers, commenting on the post "Debugging hello, world" asked about the reason for translating the instruction jmp 114 into hexadecimal EB12. To answer this, we are going to recur to the "lovely" and elder Intel Architecture Software Developer Manual (IASDM), Volume 2. This volume describes the instructions set of the Intel Architecture processor (x86/IA-32) and the opcode structure. I'll review some terms involved here:
- x86: It refers to the instruction set of the Intel-compatible CPU architectures (chips produced by Intel, AMD, VIA, and others) inaugurated by Intel's original 16-bit 8086 CPU. A decision which proved wise was to make each new instance of x86 processors almost fully backwards compatible.
IA-32: It is Intel's 32-bit implementation of the x86 architecture; IA-32 distinguishes this implementation from the preceding 16-bit x86 processors. Note that when the 64-bit era arrived, Intel launched its Itanium processor, which discards compatibility with the IA-32 instruction set. Such 64-bit architecture description and implementation is referred to as IA-64, meaning "Intel Architecture, 64-bit", but even though the names are similar, IA-32 and IA-64 are very different architectures and instructions sets. However, AMD's response to Intel 64-bit processors, uses an instruction set that, in essence, is composed of 64-bit extensions to IA-32, i.e., it's a superset of the x86 instruction set. Such instruction set is referred to as AMD64 (initially, x86-64.) Later, Intel cloned it under the name Intel 64. AMD's processors Athlon 64, Terium, Opteron, Sempron, etc., are based on AMD64.
Opcode: An opcode (operation code) is the part of a machine language instruction (pure binary code) specifying the operation to be performed. The other portion of the instruction is the operand, which is optional and represents the data to be operated on. In assembly language, mnemonics are used to represent the opcodes. Concretely, and according to the IASDM, a mnemonic is a reserved name for a class of instruction opcodes which have the same function. For example, in
JMP 114, the mnemonic is JMP, and the operand is 114 (remember, 114 in hexadecimal, which is 276 in decimal.)
Unlike in high-level languages, there is usually a one-to-one correspondence between basic assembly statements and the binary code of machine language instructions. Nevertheless, in some cases, an assembler may provide pseudo-instructions which expand into several machine language instructions to provide commonly needed functionality. Or no instruction at all, such as DB in
db 0d,0a,"hello, world!",0d,0a,"$"
which directly translates into the sequence of characters (in hexadecimal):
0D 0A 68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21 0D 0A 24
Therefore, pseudo-instruction DB acts only as a data markup for the assembler. Now, for clarity, I'll repeat the code of Debugging "hello, world" here:
- a 100 CS:0100 jmp 114 ; Jump over the 18 bytes of the string CS:0102 db 0d,0a,"hello, world!",0d,0a,"$" CS:0114 mov ah,9 ; Print function CS:0116 mov dx,102 CS:0119 int 21 CS:011B mov ah, 0 ; Terminate the program CS:011D int 21 CS:011F -g =100
Translation of the second line is a direct and solved issue. What about jmp 114? Well, we want to jump over the data (18 bytes, one byte per each character in the string.) IASDM tell us (Appendix B) that the opcode for unconditional jumps in the same segment is 11101011, which in hexadecimal, is expressed as EB. We need to provide the operand for completing the instruction. In this case, as we want to jump over the string data, our operand is 18 (12 in hexadecimal.) That's why jmp 114 translates into EB12. Note that the operand for this jmp specifies the 8-bit displacement, i.e., the operand is not an explicit address.
Translation of the other instructions is straightforward, and again we only have to follow the IASDM. Let's analyze encoding of mov ah,9 anyway. In this case we have an immediate operand (a constant, 9.) Thus, for moving an immediate operand to a register the encoding adopts this form:
1011 w reg : immediate data
There, w represents the bit for operand size. That bit specifies if data is byte or full-sized (where full-sized is either 16 or 32 bits.) As we'll be using 8-bit operands, set the bit to 0. On its side, reg is a 3-bit sequence identifying the destination register. Table B-3 of the IASDM dictates that if w = 0, then register AH is encoded as binary 100. Thus, encoding of mov ah,9 is
10110100 00001001
which in hexadecimal is expressed as B409. The next instruction, mov dx,102, follows a similar approach:
1011 1 010 0000 0001 0000 0010
In this case, however, w is set to 1, as the operand 102 requires more than 1-byte storage. The 3-bit sequence for DX is 010. Needless to say, 0000 0001 0000 0010 is the binary representation of the hexadecimal value 102 (16 bits are required). Expressing in hexadecimal, we would have BA0102. However, the bytes for the operand has to be stored in reverse order, and thereby the right encoding for the instruction is BA0201.
Next, INT n (Interruption type n) is encoded as 1100 1101 : type. Therefore, int 21 is encoded as 1100 1101 0010 0001 (CD21 in hexadecimal.) And encoding of mov ah, 0 as B400 follows directly from our previous explanations. Finally, we can translate our little "hello, world!" into binary code directly:
-e 100 EB 12 0D 0A 68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 -e 110 21 0D 0A 24 B4 09 BA 02 01 CD 21 B4 00 CD 21 0D -g =100
And that's all. I think that my explanations have been clear. But I'm always open to any suggestions and corrections. Thanks for reading.
Meta
2008 - August, 30th
Thanks to BaBax for pointing out an error in the encoding of db 0d,0a,"hello, world!",0d,0a,"$". I had involuntarily included the address of the character "!" into the encoding. Thanks, BaBax, for the correction.

April 28th, 2008at 11:00 pm(#)
I think that assembly is not my thing anymore… And in the remote chance I code in assembly again, I’d go for RISC architectures.
April 29th, 2008at 1:58 am(#)
Thanks for the post… I didn’t know the IA32 - IA64 thing. But now the translation from assembly to binary code seems pretty straightforward to me.
Heck, perhaps we could now code directly in binary!
April 29th, 2008at 2:03 am(#)
Wow, now tell us about coding in pure binary. Who does need C++, Python, Perl and such inefficient things?
April 29th, 2008at 3:00 am(#)
@Carlos_Vasquez: “And in the remote chance I code in assembly again”
Tell that to my hardware architecture professor!
April 29th, 2008at 4:07 am(#)
So far, so good. Your article is very clear, and now the doubts about JMP 114 translation should be out.
However, please, please, don’t forget to change the layout colors. Those ‘grayed’ assembly comments are killing my eyes.
April 29th, 2008at 5:10 am(#)
…hace rato que no programo en ensamblador, pero el post me trae buenos recuerdos… y la explicación está muy bien hecha
prácticamente, podemos abrir un archivo de texto y escribir los caracteres en hexadecimal, cambiamos la extensión a ejecutable y deberíamos tener un programa funcionando, sin ensamblador ni compilador…
ahora, yo no le dedicaría más tiempo al MS-DEBUG, y me iría por algo mucho mejor como el Gas (GNU as)…
April 29th, 2008at 5:16 am(#)
aaaggghhh… who needs assembly nowadays ?!
April 29th, 2008at 5:20 am(#)
Thanks everybody for dropping by, and thanks for your comments.
@El_Hombre_Que_Programaba: Aunque aún no estoy seguro, dudo que los próximos 2 o 3 artículos traten de ensamblador. Y lo expuesto en el artículo, en su mayoría, es de carácter general (no restringido a MS-DEBUG). La referencia a MS-DEBUG proviene del post anterior. Ahora, ciertamente usaré ‘GNU as’ en el futuro… y más ahora que Gas ha incluido soporte para la sintaxis Intel.
April 29th, 2008at 5:31 am(#)
Traducción de "hello, world!" a Código Binario…
Explicación concreta y clara sobre la traducción de un simple programa "hello, world!" (en ensamblador x86/IA-32) a código de máquina. Para nostálgicos….
April 29th, 2008at 7:33 am(#)
@El_Hombre_Que_Programaba: Cómo harás para introducir el 0D final en el archivo de texto?
May 3rd, 2008at 7:23 pm(#)
El ‘Carriage return’ del final realmente es prescindible. El programa termina después de ejecutar INT 21.
May 4th, 2008at 8:11 am(#)
[...] binary, executable file. Is that useful? Surely not. But it’s a healthy way to waste your time. As suggested by a reader, this can be achieved by writing the characters of the executable file, using a simple text editor [...]
May 11th, 2008at 11:45 pm(#)
[...] hello, world!. Next, we coded a hello, world! program by using the MS-DOS DEBUG program. Later, we encoded such program directly in hexadecimal (no need for DEBUG). And finally, we abused the MS-DOS ECHO command to create a binary, executable [...]
June 10th, 2008at 12:16 pm(#)
[...] Encoding Intel x86/IA-32 Assembler Instructions [...]
August 30th, 2008at 9:16 pm(#)
It’s ok! Without the knowledge of assembler mnemonic it is
hard to understand what a “high-tech computer” is doing when
it is booted.
Good explanation!
By the way, there is an error in the part
———————————————————-
“which directly translates into the sequence of characters
(in hexadecimal):
0D 0A 68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 110 21 0D 0A 24″
———————————————————-
The number 110 is the address of the byte 0×21 (the
character “!”) and doesn’t belong to the string.
The 0×110 is out of range of standard ASCII and the whole
string would look like “hello, worldÉ!” if you are using
Extended ASCII Codes. Furthermore, with the 0×110 the
whole string would be 19 characters long (as shown) and not
18.
August 30th, 2008at 11:57 pm(#)
You’re right, BaBax. I’m sorry for the mistake. I have corrected the encoding, and included a note recognizing your contribution.
Thank you very much.