# 数学代写|数值分析代写Numerical analysis代考|STAT721 Machine representation

## 数学代写数值分析代写Numerical analysis代考|Machine representation

So far, we have described a floating point representation in the abstract. Here are a few more details about how this representation is implemented on a computer. Again, in this section we will discuss the double precision format; the other formats are very similar.

Each double precision floating point number is assigned an 8-byte word, or 64 bits, to store its three parts. Each such word has the form
$$s e_1 e_2 \ldots e_{11} b_1 b_2 \ldots b_{52}$$
where the sign is stored, followed by 11 bits representing the exponent and the 52 bits following the decimal point, representing the mantissa. The sign bit $s$ is 0 for a positive number and 1 for a negative number. The 11 bits representing the exponent come from the positive binary integer resulting from adding $2^{10}-1=1023$ to the exponent, at least for exponents between -1022 and 1023 . This covers values of $e_1 \ldots e_{11}$ from 1 to 2046 , leaving 0 and 2047 for special purposes, which we will return to later.

The number 1023 is called the exponent bias of the double precision format. It is used to convert both positive and negative exponents to positive binary numbers for storage in the exponent bits. For single and long-double precision, the exponent bias values are 127 and 16383 , respectively.

MATLAB’s format hex consists simply of expressing the 64 bits of the machine number $(0.10)$ as 16 successive hexadecimal, or base 16 , numbers. Thus, the first 3 hex numerals represent the sign and exponent combined, while the last 13 contain the mantissa.

## 数学代写|数值分析代写Numerical analysis代考|Addition of floating point numbers

Machine addition consists of lining up the decimal points of the two numbers to be added, adding them, and then storing the result again as a floating point number. The addition itself can be done in higher precision (with more than 52 bits) since it takes place in a register dedicated just to that purpose. Following the addition, the result must be rounded back to 52 bits beyond the binary point for storage as a machine number.
For example, adding 1 to $2^{-53}$ would appear as follows:
\begin{aligned} & 1.00 \ldots 0 \times 2^0+1.00 \ldots 0 \times 2^{-53} \ = & 1.0000000000000000000000000000000000000000000000000000 \times 2^0 \ • & 0.0000000000000000000000000000000000000000000000000000 \times 2^0 \ = & 1.00000000000000000000000000000000000000000000000000001 \times 2^0 \end{aligned}
This is saved as $1 . \times 2^0=1$, according to the rounding rule. Therefore, $1+2^{-53}$ is equal to 1 in double precision IEEE arithmetic. Note that $2^{-53}$ is the largest floating point number with this property; anything larger added to 1 would result in a sum greater than 1 under computer arithmetic.

The fact that $\epsilon_{\text {mach }}=2^{-52}$ does not mean that numbers smaller than $\epsilon_{\text {mach }}$ are negligible in the IEEE model. As long as they are representable in the model, computations with numbers of this size are just as accurate, assuming that they are not added or subtracted to numbers of unit size.

$$s e_1 e_2 \ldots e_{11} b_1 b_2 \ldots b_{52}$$

MATLAB 的 format hex 只是表示机器号的 64 位 $(0.10)$ 作为 16 个连续的十六进制数或以 16 为基数的数字。 因此，前 3 个十六进制数字代表符号和指数的组合，而后 13 个包含尾数。

## 数学代写|数值分析代写Numerical analysis代考|Addition of floating point numbers

$\$ \|begin { aligned } \& 1.00 \backslash Idots 0 \backslash times 2 \wedge 0+1.00 \backslash Idots 0 \backslash times 2 \wedge{-53} \backslash=\& 1.0000000000000000000000000000000000000000000000000000000 |times 2 \wedge 0 1 • \& 0.0000000000000000000000000000000000000000000000000000000 \backslash times 2 \wedge 0 \backslash =\& 1.000000000000000000000000000000000000000000000000000000001 \backslash times保存为 \
{end\align2 \wedge 0}$1.$\times 2^0=1$，根据舍入规则。所以，$1+2^{-53}$在双精度 IEEE 算术中等于 1 。注意$2^{-53}$是具有此属性的最 大浮点数；在计算机算法下，加到 1 上的任何更大的值都会导致总和大于 1 。 事实上$\epsilon_{\text {mach }}=2^{-52}$并不意味着数字小于$\epsilon_{\text {mach }}\$ 在 IEEE 模型中可以忽略不计。只要它们在模型中是可表示 的，使用这种大小的数字进行的计算就同样准确，假设它们没有添加或减去单位大小的数字。

