Compressor and Decompressor

by Adam Greenberg: U6 Age ~17

Introduction

Adam had the idea of a program for the compression of text files. He would identify all the words between spaces and/or full stops, and add all those containing more than one letter to a lexicon consisting of an array of string. (We use the word lexicon in favour of dictionary or vocabulary in order to help you to remember the term lexical analysis - the first stage in the compilation process). Each character in a code string of characters would provide an index to the appropriate word in the array, so that a single character could represent a word. The compressed file consisted of the lexicon (with adjacent words separated by a space) on one line of a text file and another line comprising the code string. The decompression program read the words in the compressed file into an array and used the line of code characters to place the words in their correct positions in the decompressed file. Spaces were not saved but were added automatically on decompression.

Adam produced much of the code and ample comments for programs Compressor and Decompressor. We extended them (with Adam's blessing) to handle more punctuation and multiple lines of text. (Adam quite rightly gave precedence to his coursework at this stage of his A2 studies). You can add characters to the PUNCTUATION set constant in both programs to suit your needs.

In program Compressor, characters with ASCII values above 128 are used as indices to the lexicon, and other characters in the string are the original single letter words or punctuation. We are limited in the number of different words to which we can provide indices with a single character, and need to choose our text documents carefully. Two very different examples, a list of chemical reactions and a tennis match commentary, provide the opportunity to convey plenty of information with relatively few different words. We include both the test files reactions.txt and tennis.txt in a zip file together with the source code of programs Compressor and Decompressor. You need to put both programs and the test files in the same folder. We did not compress the downloadable zip file with program Compressor! The compression provided by the standard utilities is much better, even for our contrived text files. However, their code (which works for any type of file) is very much longer and more complicated.

A sample output from the program follows.

Sample output

Output

The Programs

program Compressor;
{
    Copyright (c) 2011 Adam Greenberg and PPS

    Licensed under the Apache License, Version 2.0 (the "License"); you may not
    use this file except in compliance with the License, as described at
    http://www.apache.org/licenses/ and http://www.pp4s.co.uk/licenses/
}
  {$APPTYPE CONSOLE}
uses
  SysUtils, StrUtils;
const
  MAX_CELLS = 150;
  MAX_LINES = 200;
  ORIGINAL = 'reactions.txt';
  PUNCTUATION = ['.',',',';',':'];
var
  fOriginal, fCompressed : Text;
  iCount, iArrayCount, iLastLine : integer;
  sCurrentWord : string;
  Strings, Codes : array[1 .. MAX_LINES] of string;
  aWords: array[1 .. MAX_CELLS] of string;
  bAdd, bFound : Boolean;

procedure ReadFileIntoArray;
  //We read the file into an array of strings, Strings.
var
  iLineNumber : integer;
begin
  iLineNumber := 0;
  AssignFile(fOriginal, ORIGINAL);
  Reset(fOriginal);
  while not eof(fOriginal) do
    begin
      inc(iLineNumber);
      readln(fOriginal, Strings[iLineNumber]);
      //Append a space to mark the end of the last word in a line
      Strings[iLineNumber] := Strings[iLineNumber] + ' ';
    end;
  CloseFile(fOriginal);
  iLastLine := iLineNumber;
end;

procedure AddToArray(sWord : string);

{The procedure AddToArray adds a word to our array.
See the procedure 'ConvertStringsIntoArray' below it first.
This procedure is only called from there.}

var
  iCountWord : integer;
begin
  if length(sWord) > 1 then //We won't bother adding the word to the lexicon if it is only one
                            //character, as we can't achieve any compression with this.
    begin
      bAdd:=TRUE;
      //Below, we check if the word that has been read from the string in the procedure
      //that called this one is already in the array. If it is, we do not want to add
      //it again.
      for iCountWord := 1 to MAX_CELLS do
        begin
          if sWord = aWords[iCountWord] then bAdd:=FALSE;
        end;
      if bAdd=TRUE then //This would be called only if the word does not already exist in the array.
        begin
          aWords[iArrayCount] := sWord; //Add the word to the array.
          inc(iArrayCount); //Increment a count which stores the number of elements in the array.
        end;
    end;
end;

procedure ConvertStringsIntoArray;

{The file has been read in to the program and is now stored as an array of strings, Strings.
We now want to go through the strings and separate them into words, and add each
word to an array of words (our lexicon), aWords.}

var
  iCountChars, iCountLines : integer;
begin
  for iCountLines := 1 to iLastLine do
    begin
      for iCountChars := 1 to length(Strings[iCountLines]) do
        begin
         {Below we check if the character is a space or punctuation. If it is then
         we have reached the end of a word and we can skip this step. If it is not
         then we need to continue to build up the word, one character at a time,
          each time appending the next character in the string until we reach punctuation
          which symbolises the end of a word.}
      if not (Strings[iCountLines][iCountChars] in PUNCTUATION + [' '])  then
        begin
          sCurrentWord := sCurrentWord + Strings[iCountLines][iCountChars]; //Add a letter to sCurrentWord.
        end
      else //If the current character is a full stop, comma or space, then we have reached the end of
             //a word, and so we can now add it to the array.
        begin
          AddToArray(sCurrentWord); //Call the procedure 'AddToArray'.
          sCurrentWord:= ''; //Set sCurrentWord back to a blank string so we can start again.
        end;
    end; //for iCountchars
  end; //for iCountLines
  {We could remove from the array words used once only and include them in the code string instead.
  This would improve the compression slightly. Otherwise, we could compress while adding words to the
  array, but the code may be easier to follow by using these two separate procedures
  ConvertStringsIntoArray and Compress.}
end;//proc

procedure Compress;

{The file has already been read in as a string and the words have been stored in an
array, aWords - this is our lexicon. All the words in the array are in a specific
cell in the array. We will now re-read through the array of strings, Strings and replace the
words with the number of the cell in the array in which the word is stored.}

var
  iCountChars, iCountWords, iCountLines  : integer;
  charCurrent : char;
begin
  for iCountLines := 1 to iLastline do
    begin
      for iCountChars := 1 to length(Strings[iCountLines]) do //We will be reading one character at a time
        begin
          bFound:=FALSE;
          {Below we check if the character is a space, full stop or comma. If it is then
           we have reached the end of a word and we can skip this step. If it is not
           then we need to continue to build up the word, one character at a time,
           each time appending the next character in the string until we reach a space
            or full stop which symbolises the end of a word.}
           charCurrent :=  Strings[iCountLines][iCountChars];
           if not (charCurrent in PUNCTUATION + [' '])  then
             begin
              sCurrentWord := sCurrentWord + charCurrent;
              //In the above line we have appended the current character in the string sFile
               //that we are reading to a string sCurrentWord.
              end
           else  //If the current character is a full stop or space, then we have reached the end of
              //a word, and we can now compress that word to a character dependant upon the number of the cell
              //which stored this word in aWords.
            begin
              if length(sCurrentWord) = 1 then  //It was not worth saving the word in the lexicon
                Codes[iCountLines] := Codes[iCountLines] + sCurrentWord
            else
              begin
                iCountWords := 1;
                //We will loop through the elements in the array looking for the current word.
                while (iCountWords <= MAX_CELLS) and (bFound = FALSE) do
                  begin
                    if sCurrentWord = aWords[iCountWords] then
                      begin
                        //Here we have found the word at the position iCountWords of the array, aWords.
                         bFound:=TRUE;
                        //Below we start to build up a new string, Codes[iCountLines].
                        //The number is iCountWords, which is where the word is stored in the array.
                        Codes[iCountLines] := Codes[iCountLines] + chr(iCountWords+128);
                        if charCurrent in PUNCTUATION then Codes[iCountLines] := Codes[iCountLines] + charCurrent;
                      end;
                   inc(iCountWords);
                  end; //while
              end;  //if
          sCurrentWord:=''; //Set sCurrentWord back to a blank string so we can start again.
        end;
    end;
  end;
end;

procedure WriteCompressedFile;

{The program now has an array of string, Codes, which is strings of characters effectively
acting as pointers to the words in the array, aWords. We now need to output the vocabulary
and the strings of characters to a new text file}

var
  iCount : integer;
begin
  AssignFile(fCompressed, 'Compressed.txt');
  Rewrite(fCompressed);
  //We will loop through the array aWords, adding each word in the array, each
  //separated by a space.
  for iCount := 1 to iArrayCount do
    begin
      write(fCompressed, aWords[iCount], ' ');
    end;
  writeln(fCompressed);
  for iCount := 1 to iLastLine do
    writeln(fCompressed, Codes[iCount]);  //Write the lines of code to the new file.
  CloseFile(fCompressed);
end;


procedure GetStats;
{The program has now compressed the original file into a new file. We will now
present the user with statistics of the compression.}
var
  fByteOriginal, fByteCompressed : file;
  OriginalSize, CompressedSize: Int64;
begin
  AssignFile(fByteOriginal, ORIGINAL);
  Reset(fByteOriginal, 1);
  OriginalSize := FileSize(fByteOriginal);
  CloseFile(fByteOriginal);
  writeln('Original size in bytes: ',OriginalSize);
  AssignFile(fByteCompressed, 'Compressed.txt');
  Reset(fByteCompressed, 1);
  CompressedSize := FileSize(fByteCompressed);
  CloseFile(fByteCompressed);
  writeln('Compressed size in bytes: ', CompressedSize);
  writeln('% Compression: ',(OriginalSize - CompressedSize) * 100 / OriginalSize : 4 : 2);
  writeln('The compressed file is ',((CompressedSize/OriginalSize)*100):2:2,'% of the size of the original file.');
end;

procedure init;
var
  iCount : integer;
begin
  for iCount := 1 to MAX_LINES do
    begin
      Strings[iCount] := '';
      Codes[iCount] := '';
    end;
end;

begin
  writeln('*** Adam',chr(96),'s Compressor! ***');
  writeln(#13#10'Please press enter once you have placed a file',#13#10,
          ORIGINAL,' in the folder which contains this program.');
  readln;
  //Everything between here and my next comment is not needed for the program to work.

  write(#13#10'Compressing');
  for iCount := 1 to 10 do
    begin
       write('.');
       sleep(50);
    end;
  writeln(#13#10#13#10'Successfully compressed!');
  writeln(#13#10'The newly created compressed file named Compressed.txt has been placed');
  writeln('in the folder which contains this program.');
  write(#13#10'Loading statistics');
  for iCount := 1 to 10 do
    begin
       write('.');
       sleep(50);
    end;
  writeln(#13#10#13#10'Compression statistics:'#13#10);
  //Everything between here and my previous comment is not needed for the program to work.
  iArrayCount:=1; //We need to start by setting the array counter to 1, rather than the default 0.
  //Below we simply call the procedures one by one.
  Init;
  ReadFileIntoArray;
  ConvertStringsIntoArray;
  Compress;
  WriteCompressedFile;
  GetStats;
  readln;
end.
program Decompressor;
{
    Copyright (c) 2011 Adam and PPS

    Licensed under the Apache License, Version 2.0 (the "License"); you may not
    use this file except in compliance with the License, as described at
    http://www.apache.org/licenses/ and http://www.pp4s.co.uk/licenses/
}

  {$APPTYPE CONSOLE}
uses
  SysUtils, StrUtils;
const
  MAX_CELLS = 150;
  MAX_LINES = 200;
  PUNCTUATION = ['.',',',';',':'];
var
  Codes, Strings : array[1 .. MAX_LINES] of string;
  fDecompressed, fCompressed: Text;
  iCount, iArrayCount, iLastLine: integer;
  sArray, sCurrentWord : string;
  aWords: array[1 .. MAX_CELLS] of string;

procedure ReadFileIntoString;

  {First, we need to read in the compressed file. The compressor program output
  outputted the array as a string of words separated by a space.
  then started a new line for each line of code characters. }

var
  iCountLine : integer;
begin
  AssignFile(fCompressed, 'Compressed.txt');
  Reset(fCompressed);
  readln(fCompressed, sArray);
  iCountLine := 0;
  while not eof (fCompressed) do
  begin
    inc(iCountLine);
    readln(fCompressed, Codes[iCountline]);
  end;
  iLastLine := iCountLine;
  CloseFile(fCompressed);
end;

procedure Decompress;
  //After ReadFileIntoString we can decompress the compressed file.
var
  iCountChars, iCountLines : integer;
  charCurrent : char;
begin
 {We will start by reading each letter at a time, as we did in the compressor program.
  As before, when we get to a space we know we have reached the end of the word. We can
  then build up a lexicon of words like before.}

   for iCountChars := 1 to length(sArray) do
      begin
        if (sArray[iCountChars]=' ') then //If it is a space, we have reached the end of a word.
          begin
            aWords[iArrayCount]:=sCurrentWord; //Add the word to our lexicon.
            inc(iArrayCount); //Increment a count which stores the number of elements in the array.
            sCurrentWord:=''; //Set sCurrentWord back to a blank string so we can start again.
          end
        else
          begin
            sCurrentWord := sCurrentWord + sArray[iCountChars]; //Add a letter to sCurrentWord.
          end;
   end;
  {We now have a complete lexicon of words. Remember that we read in
  each string and replace each character with an ASCII value over 128 with the word stored in
  that position in our array of words.}
  for iCountLines := 1 to iLastline do
    begin
      for iCountChars := 1 to length(Codes[iCountLines]) do
        begin
          charCurrent := Codes[iCountLines][iCountChars];
          if ord(charCurrent) < 127 then
            begin
              Strings[iCountLines] := Strings[iCountLines] + charCurrent;
              //Append a space if the current character is a one letter word rather than punctuation.
              if not (charCurrent in PUNCTUATION) then Strings[iCountLines] := Strings[iCountLines] + ' ';
            end
          else
            begin
              //Append the word from the lexicon.
              Strings[iCountLines] := Strings[iCountLines] + aWords[ord(charCurrent)-128];
              //Append a space unless the following character is punctuation.
              if not (Codes[iCountLines][iCountChars+1] in PUNCTUATION) then
                Strings[iCountLines] := Strings[iCountLines] + ' ';
            end;
        end;
    end;
end;

procedure WriteDecompressedFile;
//The procedure outputs the decompressed file.
var
  iCountLines : integer;
begin
  AssignFile(fDecompressed, 'Decompressed.txt');
  Rewrite(fDecompressed);
  for iCountLines := 1 to iLastLine do
    writeln(fDecompressed, Strings[iCountLines]);
  CloseFile(fDecompressed);
end;

begin
  writeln('*** Adam',chr(96),'s Decompressor! ***'#13#10);
  writeln('Please press enter to confirm that you have run program Compressor');
  readln;
    //Everything between here and my next comment is not needed for the program to work.
  writeln;
  write('Decompressing');
  for iCount := 1 to 10 do
    begin
       write('.');
       sleep(100);
    end;
  writeln(#13#10#13#10'Successfully decompressed!'#13#10);
  writeln('The newly created decompressed file named Decompressed.txt has been placed');
  writeln('in the folder which contains this program.');
  writeln;
    //Everything between here and my previous comment is not needed for the program to work.
  iArrayCount:=1; //We need to start by setting the array counter to 1, rather than the default 0.
    //Below we simply call the functions one by one.
  ReadFileIntoString;
  Decompress;
  WriteDecompressedFile;
  readln;
end.

Remarks

Can you write a better compressor and decompressor?

Programming - a skill for life!

Student programs to inspire you!