August 13, 2010

Microprocessors

Processor, or CPU (Central Processing Unit), is the component of the computer running the computer programs. With the memory in particular, is one of the components that have existed since the first computers and that are present in all computers. A processor built into a single integrated circuit is a microprocessor.

The processors were designed specifically early for a computer to a given type. This costly method of designing processors for a specific application has led to the development of mass production of processors that are suitable for one or more uses. This standardization trend generally began in the area of mainframe computers (mainframes discrete transistor and minicomputers) has accelerated rapidly with the advent of integrated circuits. The IC has allowed increasingly complex CPUs. The miniaturization and standardization of CPUs have increased their distribution in modern life far beyond the use of dedicated computing machines.

Microprocessors

The introduction of the microprocessor in the 1970s marked a significant design and implementation of CPUs. Since introducing the first microprocessor (Intel 4004) in 1971 and the first widely used microprocessor (Intel 8080) in 1974, this class of CPUs has almost completely overtaken all other methods of implementing CPU. Manufacturers of mainframes (mainframe and minicomputers) time have started their own programs of development of integrated circuits to upgrade the old architecture of their computers and subsequently produces microprocessors instruction set compatible ensuring backward compatibility with their older models. Previous generations of CPUs contained an assembly of many discrete components and integrated circuits low on one or more electronic cards. Microprocessors are built with a very small number of highly integrated circuits (ULSI), usually one. Microprocessors are implemented on a single chip, so small in size, which means the shortest switching time related to physical factors such as reduced parasitic capacitance of the gates. This has allowed synchronous microprocessors to increase their base frequency of a few tens of megahertz to several gigahertz. Moreover, as the ability to manufacture extremely small transistors on an integrated circuit has increased the complexity and number of transistors in a single CPU has increased dramatically. This widely observed trend is described by Moore's Law, which proved to be far enough out to be predictive of the increasing complexity of processors (and any other integrated circuit).

The heart multicore processors (multicore) lately now include multiple cores in a single integrated circuit, their effectiveness depends greatly on the topology of interconnection between the cores. New approaches such as overlay memory and heart of the processor (memory stacking) are being studied and should lead to a further increase performance. Based on the trends of the last 10 years, processor performance should reach the petaflop, 2010 for servers and PCs in 2030.

Beginning in June 2008 the military supercomputer IBM Roadrunner is the first to cross the symbolic threshold of one petaflop. Then in November 2008 it was the turn of the Jaguar Cray supercomputer. In April 2009 they are the only two supercomputers to have passed the petaflop.

While the complexity, size, construction, and the general form of CPUs have changed considerably over the last sixty years, the design and basic functions have not changed much. Almost all common CPUs today can be described very precisely as machines stored program von Neumann. While Moore's law, mentioned above, continues to be tested, questions have arisen about the limits of the technology of integrated circuit transistor. The miniaturization of electronic gates is so important that the effects of phenomena such as electromigration (progressive degradation of metallic interconnects resulting in decreased reliability of integrated circuits) and leakage currents (their importance increases with downsizing of integrated circuits They are the source of electrical energy punitive), previously negligible, become increasingly significant. These new issues are among the many factors leading researchers to investigate the one hand, new treatment technologies such as quantum computers use parallel computing, and on the other hand, other methods of use classical model of von Neumann.

Operation

Composition of a processor

The essential parts of a processor are: Arithmetic Logic Unit (ALU, English and Arithmetic Logical Unit - ALU), which supports basic arithmetic calculations and tests; control unit or sequencer, which synchronizes the various components of the processor. In particular, it initializes the registers at the start of the machine and it handles interruptions; records, which are memories of small size (few bytes), fast enough that the ALU can manipulate their content to each clock cycle. A number of records are common to most processors.

Program counter: This register contains the memory address of the instruction during execution; accumulator register is used to store data being processed by the ALU; register address, so it always contains the address of the next information to be read by the ALU, is the result of the current instruction or the next instruction; instruction register: it contains the instruction being processed.

Register state: it is used to store the processor context, which means that different bits of this register are the flags (flags) used to store information regarding the outcome of the last instruction executed; stack pointers: this type of register, whose number varies depending on processor type, contains the address of the stack (or stacks); general registers: These registers are available for calculations; clock that synchronizes all actions of the CPU. It is present in the processors, synchronous and asynchronous absent processors and processors autosynchrones. Unit O, which supports communication with the computer's memory or transmit the orders to fly for his dedicated processors, allowing the processor to access devices in the computer.

The current processors also include more complex elements: several UAL, which can process several instructions simultaneously. The superscalar architecture, in particular, allow access to UAL in parallel, each ALU can execute an instruction without the other; pipeline architecture allows to cut salaries to make temporally. This technique comes from the world of supercomputing; a prediction unit jump, which allows the processor to anticipate a jump in the course of a program, to avoid waiting for the final value of the jump address.

That helps fill the pipeline; a unit of floating-point (Floating Point Unit - FPU), which accelerates the computations on real numbers encoded in floating point; the cache, which speeds up processing by reducing the access time to memory. These buffers are much faster than RAM and slow unless the CPU. The instruction cache receives the next instruction to be executed, the cache data manipulates data. Sometimes, a single unified cache is used for code and data. Multiple levels of caches can coexist, they are often called by the names of L1, L2 or L3. In advanced processors, special units of the processor are assigned to research by statistical and / or predictive of future access to main memory.

A processor is defined by: the breadth of its internal records of data manipulation (8, 16, 32, 64, 128) bits; the rate of its clock in MHz (mega hertz) or GHz (giga hertz); the number of computing cores (core); its instruction set (ISA English Instruction Set Architecture) depending on the family (CISC, RISC, etc.); its fine engraving expressed in nm (nanometers) and its microarchitecture.

But what characterizes a processor is mainly the family to which it belongs:
CISC (Complex Instruction Set Computer: choice of instructions as close as possible to a high level language);
RISC (Reduced Instruction Set Computer: choice of simpler instructions and a structure for fast execution); VLIW (Very Long Instruction Word); DSP (Digital Signal Processor). Although the latter family (DSP) is relatively specific. Indeed, a processor is a programmable component and is therefore a priori capable of performing any type of program. However, for the sake of optimization, specialized processors are designed and adapted to certain types of calculations (3D, sound, etc..). DSPs are specialized processors for calculations related to signal processing. For example, it is not uncommon to see implementing Fourier Transforms in a DSP.

A processor has three types of buses: a data bus, sets the size of data manipulated (regardless of the size of internal registers); an address bus determines the number of memory slots available; a control bus defines the management processor IRQ, RESET, etc..

The operations of the processor

The role of most CPUs, regardless of physical form they take, is to run a series of stored instructions called "program".

The instructions (sometimes broken down into micro instructions) and data transmitted to the processor are expressed in words binary (machine code). They are usually stored in memory. The sequencer directs the reading of memory contents and the formation of words presented to the ALU which interprets them.

A set of instructions and data is a program.

The language closest to the machine code while remaining readable by humans is the assembly language, also called assembly language (Anglicized form of the English word together). However, IT has developed a range of languages, called high-level (such as BASIC, Pascal, C, C + +, Fortran, Ada, etc.) designed to simplify the writing of programs.

The operations described here are consistent with the von Neumann architecture. The program is represented by a series of instructions that perform operations on the RAM of the computer. There are four steps that nearly all von Neumann architectures use:

fetch - Search instruction;
decode - decode the instruction (operation and operands);
execute - execution of the operation;
writeback - writing the result.

The first stage, FETCH (search) is to seek an instruction in the memory of the computer. The location in memory is determined by the program counter (PC), which stores the address of the next instruction in the program memory. After an instruction has been sought, the PC is incremented by the length of instruction word. In the case of simple constant word length is always the same number. For example, a 32-bit word length constant that uses words of 8 bits of memory always increment the PC by 4 (except in the case of jumps). The instruction set that uses variable length instructions as x86, increment the PC by the number of memory words corresponding to the last instruction length. In addition, in central processing units more complex, incrementing the PC does not necessarily occur at the end of the instruction execution. This is particularly the case in highly parallelized and superscalar architectures. Often, finding the statement must be made in slow memory, slowing down the CPU awaits trial. This issue is largely resolved by modern processors use caches and pipeline architectures.
The instruction that the processor memory search is used to determine what the CPU should do. In step DECODE (decoding), the instruction is split into several parts such that they can be used by other parts of the processor. The way the value of the instruction is interpreted is defined by the instruction set (ISA) processor 1. Often, part of an instruction, called opcode (operation code), indicates which operation is done, for example an addition. The remaining parts of the instruction typically include other information necessary for the execution of the instruction as examples by the operands of the addition. These operands can take a constant value, called immediate value, or contain the location where to meet (in a register or memory address) the value of the operand, depending on addressing mode used. In the old conceptions, the parts of the processor responsible for decoding were fixed and unchangeable because they were coded in the circuits. In the latest processors, firmware is often used to translate the instructions into different orders. This firmware is sometimes modified to change the way the CPU decodes instructions even after its manufacture.

After the stages of research and decoding reaches the EXECUTE stage (execution) of instruction. During this phase, different parts of the processor are correlated to achieve the desired operation. For example, for an addition, the arithmetic logic unit (ALU) is connected to the inputs and outputs. The entries show the numbers to be added and outputs contain the final sum. The ALU contains the circuitry to perform arithmetic and logic on single inputs (adding operation on the bits). If the result of addition is too large to be encoded by the processor, an overflow signal is set in a status register (see below chapter on the coding of numbers).

The last step writeback (write result), simply writing the results of the execution stage in memory. Very often, the results are written in a register internal to the processor to take advantage of very short access time for instructions. In other cases, the results are written more slowly in RAM, so cheaply and accepting encodings larger numbers.

Some types of instructions manipulate the program counter rather than directly produce result data. These instructions are called jumps (jumps) and can perform loops (loops), conditional execution of programs or functions (subroutines) in programs 2. Many instructions also serve to change the status of flags (flags) in a status register. These statements can be used to condition the behavior of a program, since they often indicate the end of execution of various operations. For example, an instruction for comparing two numbers will place a flag in a status register as the result of the comparison. This flag can then be reused by a jump instruction to continue the program flow.

After executing the instruction and writing results, the whole process is repeated, the next instruction cycle research the following instruction sequence as the program counter was incremented. If the previous instruction was a jump is the jump destination address is registered in the program counter. In more complex processors, multiple instructions can be searched, decoded and executed simultaneously, this is called a pipelined architecture, now commonly used in electronic equipment.

Processing Speed

The processing speed of a processor is still sometimes expressed in MIPS (million instructions per second) or megaflops (million floating-point operations per second) for the floating part, called FPU (Floating Point Unit). Yet today, the processors are based on different architectures and parallelization techniques treatments that do more than simply determining their performance. Specific programs for performance evaluation (benchmarks) have been developed to obtain comparative execution time of real programs.

Design and Implementation

The coding of numbers

The way a CPU represents numbers is a design choice that affects how deep its basic operation. Some older computers used an electrical model of the decimal number system (base 10). Some have chosen to digital systems more exotic systems like ternary (base 3). Modern processors represent numbers in binary (base 2) in which each number is represented by a physical quantity that can take only two values as a voltage "high" or "low."

The physical concept of voltage is analog in nature because it can take an infinite number of values. For the purpose of physical representation of binary numbers, the values of voltages are defined as states 1 and 0. These states result from the operational parameters of the switching elements making up the processor as the threshold levels of transistors.

In addition to the system of representation of numbers, it must consider the size and precision of numbers a processor can handle. In the case of a binary processor, a "bit" corresponds to a specific position in numbers that the processor can handle. The number of bits (digits) that a CPU uses to represent numbers is often called "word size (word size, bit width, data path width) or" full precision "when dealing with integers ( as opposed to floating point numbers). This number differs between architectures, and often along the various modules of a single processor. For example, an 8-bit CPU manages numbers that can be represented by 8 binary digits (each digit which can take 2 values), or 28 or 256 discrete values. Accordingly, the size of the integer set a limit to the range of integers the software run by the processor can use.

The size of the whole number also affects the number of memory locations the CPU can address (locate). For example, if a bit processor uses 32 bits to represent a memory address and that each memory address is represented by one byte (8 bits), the maximum memory size that can be addressed by this processor is 232 bytes, or 4GB It's a very simplistic to the address space of a processor and many designs use different types of route statements much more complex, such as pagination, to address more memory than the size of their whole number would with a flat address space.

Larger ranges of integers require more than basic structures to manage the additional digits, leading to more complexity, larger size, more energy consumption and higher costs. It is not uncommon to find 4-bit microcontrollers or 8-bit in modern applications, even if 16-bit processors, 32-bit, 64-bit and even 128-bit are available. To get the benefit of both sizes of short and long over, many CPUs are designed with different widths in different parts of the whole component. For example, the IBM System/370 has a native 32-bit CPU but uses a floating point unit (FPU) 128-bit precision to achieve greater precision in the calculations with floating point numbers . Many of the latest processors use a combination of comparable size numbers, especially when the processor is dedicated to a general purpose for which it is necessary to find the right balance between the capacity to deal with integer and floating point numbers.

The clock signal

Most processors, and more generally most sequential logic circuits, operate in a synchronous nature. This means they are designed and operate at a rate of a signal synchronization. This signal is the clock signal. It often takes the form of a periodic square wave. By calculating the maximum time it takes the electrical signal to propagate in the different branches of the processor circuitry, the designer can select the appropriate period of the clock signal.
This period must be greater than the time it takes the signal to propagate in the worst case. In determining the period of the clock to a value well above the worst case propagation time, it is possible to design the entire CPU and how it moves data around the "fronts" rising or falling signal clock. This has the advantage of simplifying the CPU significantly both in terms of its design from that of many of its components. By cons, this has the disadvantage of slowing the processor must adjust its speed as that of its slowest component, although other parts are much faster. These limitations are largely offset by various methods of increasing CPU parallelism.

Architectural improvements can not alone solve all the drawbacks of globally synchronous CPUs. For example, a clock signal is subject to delays as all other electrical signals. The clock frequencies higher than found in processors to create increasingly complex challenges to keep the clock signal in phase (synchronized) throughout the CPU. Consequently, many processors now require the provision of multiple identical clock signals to avoid the delay of one signal can cause a malfunction of the processor. The high amount of heat must be dissipated by the processor is another major problem due to increasing clock frequencies. The frequent changes of state of the clock are switching a large number of components, whether or not used at this time. In general, components that switch use more energy than those who remain in a static state. and, most clock frequencies increase and more heat dissipation does the same, so that processors require more efficient cooling solutions.

The method of clock gating is used to manage the involuntary switching of components in inhibiting the clock signal on selected items, but this practice is difficult to implement and is dedicated to the needs of very low power circuits.

Another method is to disable the global clock signal, power consumption and heat dissipation are reduced but the circuit design becomes more complex. Some designs have been achieved without global clock signal, such as families of processors ARM or MIPS, others show that asynchronous parts such as the use of an asynchronous ALU superscalar pipelining to achieve performance gains in arithmetic. It is not certain that a fully asynchronous processor can deliver performance levels comparable or higher processor synchronous while it is obvious it will be better in the math simple, it will be rather restricted to embedded applications (handheld computers, game consoles ...).

Parallelism


Model subscalaire processor: it takes 15 cycles to execute three instructions.
The description of the basic operating mode of a processor made in the previous chapter presents the simplest form that can take a CPU. This type of processor, called subscalaire, executes an instruction on one or two data fields at once.
This process is inefficient and inherent subscalaires processors. Since only one instruction is executed at a time while the processor waits until the processing of this instruction before turning to the next with the result that the CPU is frozen on the instructions that require more than one cycle clock to run. The addition of a second processing unit (see below), does not significantly improve performance, it is no longer a processing unit that is fixed but two, further increasing the number of transistors Unused. This design, in which the CPU execution resources handle only one instruction at a time can reach only scalar performance (one instruction per clock cycle) or subscalaires (less than one instruction per clock cycle).

In attempting to obtain performance scalar and beyond, it has led to various methods that drive the CPU has a less linear and more parallel. When we talk about parallel processor, two terms are used to classify these design techniques:

Instruction Level Parallelism (ILP) - Parallel at trial;
Thread Level Parallelism (TLP) - Parallel to the thread level (group instruction).

The ILP seeks to increase the speed at which instructions are executed by a CPU (that is to say to increase the use of enforcement resources present in the integrated circuit). The objective of the TLP is to increase the number of threads that the CPU can execute simultaneously. Each method differs from the other one hand, by the way in which it is implemented and secondly, because of their relative effectiveness in increasing processor performance for an application.

ILP: Instruction pipelining and superscalar


Pipeline Based on 5 floors. In the best scenario, this pipeline can sustain an execution rate of one instruction per cycle.

One of the simplest ways to increase the parallelism is to start early research stages (fetch) and decoding (decode) a statement before the end of execution of the previous instruction. This is the simplest form of pipelining technique, it is used in most modern processors unskilled. Pipelining allows you to run more than one instruction at a time by breaking the path of execution in different stages. This division can be compared to an assembly line.

Pipelining can create conflicts of data dependency, where the result of the previous operation is necessary to perform the following operation. To resolve this problem, special care must be taken to check this type of situation and delay, if any, part of the instruction pipeline. Naturally, the additional circuitry to provide for this additional complexity of parallel processors.

No comments:

Post a Comment