A compiler is a computer program (or set of programs) that transforms source code written in a programming language (the source language) into another computer language, The most common reason for wanting to transform source code is to create an executable program.
The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a lower level language (e.g., assembly language or machine code). A program that translates from a low level language to a higher level one is a decompiler.
A compiler is likely to perform many or all of the following operations:
- lexical analysis
- preprocessing
- parsing
- semantic analysis
- code generation
- code optimization.
Compilers enabled the development of programs that are machine-independent. Before the development of FORTRAN (FORmula TRANslator), the first higher-level language, in the 1950s, machine-dependent assembly language was widely used. While assembly language produces more reusable and relocatable programs than machine code on the same architecture, it has to be modified or rewritten if the program is to be executed on different hardware architecture.
With the advance of high-level programming languages soon followed after FORTRAN, such as COBOL, C, BASIC, programmers can write machine-independent source programs. A compiler translates the high-level source programs into target programs in machine languages for the specific hardwares. Once the target program is generated, the user can execute the program.
Compilers bridges source programs in high-level languages with the underlying hardwares. A compiler requires:
- To recognize legitimacy of programs.
- To generate correct and efficient code.
- Run-time organization.
- To format output according to assembler or linker conventions.
Structure of compiler:
A compiler consists of three main parts: frontend, middle-end, and backend.
Frontend checks whether the program is correctly written in terms of the programming language syntax and semantics. Here legal and illegal programs are recognized. Errors are reported, if any, in a useful way. Type checking is also performed by collecting type information. Frontend generates IR (intermediate representation) for the middle-end. Optimization of this part is almost complete so much are already automated. There are efficient algorithms typically in O(n) or O(nlog n).
Middle-end is where the optimizations for performance take place. Typical transformations for optimization are:
- Removal of useless or unreachable code.
- Discovering and propagating constant values, relocation of computation to a less frequently executed place (e.g., out of a loop).
- Specializing a computation based on the context. Middle-end generates IR for the following backend. Most optimization efforts are focused on this part.
Backend is responsible for translation of IR into the target assembly code. The target instruction(s) are chosen for each IR instruction. Variables are also selected for the registers. Backend utilizes the hardware by figuring out how to keep parallel FUs busy, filling delay slots, and so on. Although most algorithms for optimization are in NP, heuristic techniques are well-developed.
One-pass versus multi-pass compilers:
We can classify compilers by number of passes to first pass compiler and multi-pass compiler, one pass compilers generally compile faster than multi-pass compilers.
The disadvantage of compiling in a single pass is that it is not possible to perform many of the sophisticated optimizations needed to generate high quality code. It can be difficult to count exactly how many passes an optimizing compiler makes. For instance, different phases of optimization may analyze one expression many times but only analyze another expression once.
While the typical multi-pass compiler outputs machine code from its final pass, there are several other types:
- A "source-to-source compiler" is a type of compiler that takes a high level language as its input and outputs a high level language. For example, an automatic parallelizing compiler will frequently take in a high level language program as an input and then transform the code and annotate it with parallel code annotations (e.g. OpenMP) or language constructs (e.g. Fortran's DOALL statements).
- Stage compiler that compiles to assembly language of a theoretical machine, like some Prolog implementations
- This Prolog machine is also known as the Warren Abstract Machine (or WAM). Bytecode compilers for Java, Python, and many more are also a subtype of this.
- Just-in-time compiler, used by Smalltalk and Java systems, and also by Microsoft .NET's Common Intermediate Language (CIL)
- Applications are delivered in bytecode, which is compiled to native machine code just prior to execution.
Related Sciences, Topics and Techniques:
- Formal languages.
- Automata Theory.
- Program semantics.
- Type Theory.
Applications and Project Ideas:
- Programming Languages.
- Compiler correctness: is the branch of software engineering that deals with trying to show that a compiler behaves according to its language specification. Techniques include developing the compiler using formal methods and using rigorous testing (often called compiler validation) on an existing compiler.
- Compiler Generator: A compiler-compiler or compiler generator is a tool that creates a parser, interpreter, or compiler from some form of formal description. The earliest and still most common form of compiler-compiler is a parser generator, whose input is a grammar (usually in BNF) of a programming language, and whose generated output is the source code of a parser, The ideal compiler takes a description of a programming language and a target instruction set architecture, and automatically generates a usable compiler from them. In practice, the state of the art has yet to reach this degree of sophistication and most compiler generators are not capable of handling semantic or target architecture information.