Compressor and Decompressor
by Adam Greenberg: U6 Age ~17
Introduction
Adam had the idea of a program for the compression of text files. He would identify all the words between spaces and/or full stops, and add all those containing more than one letter to a lexicon consisting of an array of string. (We use the word lexicon in favour of dictionary or vocabulary in order to help you to remember the term lexical analysis - the first stage in the compilation process). Each character in a code string of characters would provide an index to the appropriate word in the array, so that a single character could represent a word. The compressed file consisted of the lexicon (with adjacent words separated by a space) on one line of a text file and another line comprising the code string. The decompression program read the words in the compressed file into an array and used the line of code characters to place the words in their correct positions in the decompressed file. Spaces were not saved but were added automatically on decompression.
Adam produced much of the code and ample comments for programs Compressor and Decompressor. We extended them (with Adam's blessing) to handle more punctuation and multiple lines of text. (Adam quite rightly gave precedence to his coursework at this stage of his A2 studies). You can add characters to the PUNCTUATION set constant in both programs to suit your needs.
In program Compressor, characters with ASCII values above 128 are used as indices to the lexicon, and other characters in the string are the original single letter words or punctuation. We are limited in the number of different words to which we can provide indices with a single character, and need to choose our text documents carefully. Two very different examples, a list of chemical reactions and a tennis match commentary, provide the opportunity to convey plenty of information with relatively few different words. We include both the test files reactions.txt and tennis.txt in a zip file together with the source code of programs Compressor and Decompressor. You need to put both programs and the test files in the same folder. We did not compress the downloadable zip file with program Compressor! The compression provided by the standard utilities is much better, even for our contrived text files. However, their code (which works for any type of file) is very much longer and more complicated.
A sample output from the program follows.

Output
The Programs
program Compressor; { Copyright (c) 2011 Adam Greenberg and PPS Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License, as described at http://www.apache.org/licenses/ and http://www.pp4s.co.uk/licenses/ } {$APPTYPE CONSOLE} uses SysUtils, StrUtils; const MAX_CELLS = 150; MAX_LINES = 200; ORIGINAL = 'reactions.txt'; PUNCTUATION = ['.',',',';',':']; var fOriginal, fCompressed : Text; iCount, iArrayCount, iLastLine : integer; sCurrentWord : string; Strings, Codes : array[1 .. MAX_LINES] of string; aWords: array[1 .. MAX_CELLS] of string; bAdd, bFound : Boolean; procedure ReadFileIntoArray; //We read the file into an array of strings, Strings. var iLineNumber : integer; begin iLineNumber := 0; AssignFile(fOriginal, ORIGINAL); Reset(fOriginal); while not eof(fOriginal) do begin inc(iLineNumber); readln(fOriginal, Strings[iLineNumber]); //Append a space to mark the end of the last word in a line Strings[iLineNumber] := Strings[iLineNumber] + ' '; end; CloseFile(fOriginal); iLastLine := iLineNumber; end; procedure AddToArray(sWord : string); {The procedure AddToArray adds a word to our array. See the procedure 'ConvertStringsIntoArray' below it first. This procedure is only called from there.} var iCountWord : integer; begin if length(sWord) > 1 then //We won't bother adding the word to the lexicon if it is only one //character, as we can't achieve any compression with this. begin bAdd:=TRUE; //Below, we check if the word that has been read from the string in the procedure //that called this one is already in the array. If it is, we do not want to add //it again. for iCountWord := 1 to MAX_CELLS do begin if sWord = aWords[iCountWord] then bAdd:=FALSE; end; if bAdd=TRUE then //This would be called only if the word does not already exist in the array. begin aWords[iArrayCount] := sWord; //Add the word to the array. inc(iArrayCount); //Increment a count which stores the number of elements in the array. end; end; end; procedure ConvertStringsIntoArray; {The file has been read in to the program and is now stored as an array of strings, Strings. We now want to go through the strings and separate them into words, and add each word to an array of words (our lexicon), aWords.} var iCountChars, iCountLines : integer; begin for iCountLines := 1 to iLastLine do begin for iCountChars := 1 to length(Strings[iCountLines]) do begin {Below we check if the character is a space or punctuation. If it is then we have reached the end of a word and we can skip this step. If it is not then we need to continue to build up the word, one character at a time, each time appending the next character in the string until we reach punctuation which symbolises the end of a word.} if not (Strings[iCountLines][iCountChars] in PUNCTUATION + [' ']) then begin sCurrentWord := sCurrentWord + Strings[iCountLines][iCountChars]; //Add a letter to sCurrentWord. end else //If the current character is a full stop, comma or space, then we have reached the end of //a word, and so we can now add it to the array. begin AddToArray(sCurrentWord); //Call the procedure 'AddToArray'. sCurrentWord:= ''; //Set sCurrentWord back to a blank string so we can start again. end; end; //for iCountchars end; //for iCountLines {We could remove from the array words used once only and include them in the code string instead. This would improve the compression slightly. Otherwise, we could compress while adding words to the array, but the code may be easier to follow by using these two separate procedures ConvertStringsIntoArray and Compress.} end;//proc procedure Compress; {The file has already been read in as a string and the words have been stored in an array, aWords - this is our lexicon. All the words in the array are in a specific cell in the array. We will now re-read through the array of strings, Strings and replace the words with the number of the cell in the array in which the word is stored.} var iCountChars, iCountWords, iCountLines : integer; charCurrent : char; begin for iCountLines := 1 to iLastline do begin for iCountChars := 1 to length(Strings[iCountLines]) do //We will be reading one character at a time begin bFound:=FALSE; {Below we check if the character is a space, full stop or comma. If it is then we have reached the end of a word and we can skip this step. If it is not then we need to continue to build up the word, one character at a time, each time appending the next character in the string until we reach a space or full stop which symbolises the end of a word.} charCurrent := Strings[iCountLines][iCountChars]; if not (charCurrent in PUNCTUATION + [' ']) then begin sCurrentWord := sCurrentWord + charCurrent; //In the above line we have appended the current character in the string sFile //that we are reading to a string sCurrentWord. end else //If the current character is a full stop or space, then we have reached the end of //a word, and we can now compress that word to a character dependant upon the number of the cell //which stored this word in aWords. begin if length(sCurrentWord) = 1 then //It was not worth saving the word in the lexicon Codes[iCountLines] := Codes[iCountLines] + sCurrentWord else begin iCountWords := 1; //We will loop through the elements in the array looking for the current word. while (iCountWords <= MAX_CELLS) and (bFound = FALSE) do begin if sCurrentWord = aWords[iCountWords] then begin //Here we have found the word at the position iCountWords of the array, aWords. bFound:=TRUE; //Below we start to build up a new string, Codes[iCountLines]. //The number is iCountWords, which is where the word is stored in the array. Codes[iCountLines] := Codes[iCountLines] + chr(iCountWords+128); if charCurrent in PUNCTUATION then Codes[iCountLines] := Codes[iCountLines] + charCurrent; end; inc(iCountWords); end; //while end; //if sCurrentWord:=''; //Set sCurrentWord back to a blank string so we can start again. end; end; end; end; procedure WriteCompressedFile; {The program now has an array of string, Codes, which is strings of characters effectively acting as pointers to the words in the array, aWords. We now need to output the vocabulary and the strings of characters to a new text file} var iCount : integer; begin AssignFile(fCompressed, 'Compressed.txt'); Rewrite(fCompressed); //We will loop through the array aWords, adding each word in the array, each //separated by a space. for iCount := 1 to iArrayCount do begin write(fCompressed, aWords[iCount], ' '); end; writeln(fCompressed); for iCount := 1 to iLastLine do writeln(fCompressed, Codes[iCount]); //Write the lines of code to the new file. CloseFile(fCompressed); end; procedure GetStats; {The program has now compressed the original file into a new file. We will now present the user with statistics of the compression.} var fByteOriginal, fByteCompressed : file; OriginalSize, CompressedSize: Int64; begin AssignFile(fByteOriginal, ORIGINAL); Reset(fByteOriginal, 1); OriginalSize := FileSize(fByteOriginal); CloseFile(fByteOriginal); writeln('Original size in bytes: ',OriginalSize); AssignFile(fByteCompressed, 'Compressed.txt'); Reset(fByteCompressed, 1); CompressedSize := FileSize(fByteCompressed); CloseFile(fByteCompressed); writeln('Compressed size in bytes: ', CompressedSize); writeln('% Compression: ',(OriginalSize - CompressedSize) * 100 / OriginalSize : 4 : 2); writeln('The compressed file is ',((CompressedSize/OriginalSize)*100):2:2,'% of the size of the original file.'); end; procedure init; var iCount : integer; begin for iCount := 1 to MAX_LINES do begin Strings[iCount] := ''; Codes[iCount] := ''; end; end; begin writeln('*** Adam',chr(96),'s Compressor! ***'); writeln(#13#10'Please press enter once you have placed a file',#13#10, ORIGINAL,' in the folder which contains this program.'); readln; //Everything between here and my next comment is not needed for the program to work. write(#13#10'Compressing'); for iCount := 1 to 10 do begin write('.'); sleep(50); end; writeln(#13#10#13#10'Successfully compressed!'); writeln(#13#10'The newly created compressed file named Compressed.txt has been placed'); writeln('in the folder which contains this program.'); write(#13#10'Loading statistics'); for iCount := 1 to 10 do begin write('.'); sleep(50); end; writeln(#13#10#13#10'Compression statistics:'#13#10); //Everything between here and my previous comment is not needed for the program to work. iArrayCount:=1; //We need to start by setting the array counter to 1, rather than the default 0. //Below we simply call the procedures one by one. Init; ReadFileIntoArray; ConvertStringsIntoArray; Compress; WriteCompressedFile; GetStats; readln; end.
program Decompressor; { Copyright (c) 2011 Adam and PPS Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License, as described at http://www.apache.org/licenses/ and http://www.pp4s.co.uk/licenses/ } {$APPTYPE CONSOLE} uses SysUtils, StrUtils; const MAX_CELLS = 150; MAX_LINES = 200; PUNCTUATION = ['.',',',';',':']; var Codes, Strings : array[1 .. MAX_LINES] of string; fDecompressed, fCompressed: Text; iCount, iArrayCount, iLastLine: integer; sArray, sCurrentWord : string; aWords: array[1 .. MAX_CELLS] of string; procedure ReadFileIntoString; {First, we need to read in the compressed file. The compressor program output outputted the array as a string of words separated by a space. then started a new line for each line of code characters. } var iCountLine : integer; begin AssignFile(fCompressed, 'Compressed.txt'); Reset(fCompressed); readln(fCompressed, sArray); iCountLine := 0; while not eof (fCompressed) do begin inc(iCountLine); readln(fCompressed, Codes[iCountline]); end; iLastLine := iCountLine; CloseFile(fCompressed); end; procedure Decompress; //After ReadFileIntoString we can decompress the compressed file. var iCountChars, iCountLines : integer; charCurrent : char; begin {We will start by reading each letter at a time, as we did in the compressor program. As before, when we get to a space we know we have reached the end of the word. We can then build up a lexicon of words like before.} for iCountChars := 1 to length(sArray) do begin if (sArray[iCountChars]=' ') then //If it is a space, we have reached the end of a word. begin aWords[iArrayCount]:=sCurrentWord; //Add the word to our lexicon. inc(iArrayCount); //Increment a count which stores the number of elements in the array. sCurrentWord:=''; //Set sCurrentWord back to a blank string so we can start again. end else begin sCurrentWord := sCurrentWord + sArray[iCountChars]; //Add a letter to sCurrentWord. end; end; {We now have a complete lexicon of words. Remember that we read in each string and replace each character with an ASCII value over 128 with the word stored in that position in our array of words.} for iCountLines := 1 to iLastline do begin for iCountChars := 1 to length(Codes[iCountLines]) do begin charCurrent := Codes[iCountLines][iCountChars]; if ord(charCurrent) < 127 then begin Strings[iCountLines] := Strings[iCountLines] + charCurrent; //Append a space if the current character is a one letter word rather than punctuation. if not (charCurrent in PUNCTUATION) then Strings[iCountLines] := Strings[iCountLines] + ' '; end else begin //Append the word from the lexicon. Strings[iCountLines] := Strings[iCountLines] + aWords[ord(charCurrent)-128]; //Append a space unless the following character is punctuation. if not (Codes[iCountLines][iCountChars+1] in PUNCTUATION) then Strings[iCountLines] := Strings[iCountLines] + ' '; end; end; end; end; procedure WriteDecompressedFile; //The procedure outputs the decompressed file. var iCountLines : integer; begin AssignFile(fDecompressed, 'Decompressed.txt'); Rewrite(fDecompressed); for iCountLines := 1 to iLastLine do writeln(fDecompressed, Strings[iCountLines]); CloseFile(fDecompressed); end; begin writeln('*** Adam',chr(96),'s Decompressor! ***'#13#10); writeln('Please press enter to confirm that you have run program Compressor'); readln; //Everything between here and my next comment is not needed for the program to work. writeln; write('Decompressing'); for iCount := 1 to 10 do begin write('.'); sleep(100); end; writeln(#13#10#13#10'Successfully decompressed!'#13#10); writeln('The newly created decompressed file named Decompressed.txt has been placed'); writeln('in the folder which contains this program.'); writeln; //Everything between here and my previous comment is not needed for the program to work. iArrayCount:=1; //We need to start by setting the array counter to 1, rather than the default 0. //Below we simply call the functions one by one. ReadFileIntoString; Decompress; WriteDecompressedFile; readln; end.
Remarks
Can you write a better compressor and decompressor?