Text Compression in data compression

Text compression seems natural for Huffman coding. For example, the probability model for a particular novel will not differ significantly from the probability model for another novel. Similarly, the probability model for a set of FORTRAN programs is not going to be much different than the probability model for a different set of FORTRAN programs. The probabilities in Table 3.26 are the probabilities of the 26 letters (upper- and lowercase) obtained for the U.S. Constitution and are representative of English text. The probabilities in Table 3.27 were obtained by counting the frequency of occurrences of letters in an earlier version of this chapter. While the two documents are substantially different, the two sets of probabilities are very much alike.

T A B L E 3 . 26

Letter	Probability		Letter	Probability

A		0057305			N	0.056035

B		0014876			O	0.058215
C		0025775			P	0.021034
D		0026811			Q	0.000973
E		0112578			R	0.048819
F		0022875			S	0.060289
G		0009523			T	0.078085
H		0042915			U	0.018474
I		0053475			V	0.009882
J		0002031			W	0.007576
K		0001016			X	0.002264
L		0031403			Y	0.011702
M		0015892			Z	0.001502

TABLE 3. 27
Letter	Probability		Letter	Probability

A	0049855			N	0.048039

B	0016100			O	0.050642
C	0025835			P	0.015007
D	0030232			Q	0.001509
E	0097434			R	0.040492
F	0019754			S	0.042657
G	0012053			T	0.061142
H	0035723			U	0.015794
I	0048783			V	0.004988
J	0000394			W	0.012207
K	0002450			X	0.003413
L	0025835			Y	0.008466
M	0016494			Z	0.001050

Leave a Comment Cancel reply