CSSS Coding Competition

CSSS Coding Competition -- October 2002

Problem 2

Write a program to find the longest repeating substring of a given arbitrarily long character string. For example, in the string

        The 123 computers used in this competition to perform matrix
        computations are supplied by Commercial Computer Corporation.

the repeating substrings include "co", "com", "comp", and ""comput". (There are also many trivial substrings consisting of one letter.) The longest repeating substring is "comput". Note that the comparisons are not case sensitive --- "Commercial" and "Computer" are included in these repeating substrings. Only upper and lower case letters are considered to be parts of substrings; blanks and punctuation symbols are ignored.

The input file consists of a string which may extend over more than one line. It ends with a blank line.

The output file consists of a single word: the longest repeating substring in the input.

Example input file:

        The 123 computers used in this competition to perform matrix
        computation are supplied by Commercial Computer Corporation.

The corresponding output file:

        comput

Test Data

First:

The 123 computers used in this competition to perform matrix
computation are supplied by XYZ computer corporation

Expected output:

comput

Second:

The 123 computers used in this competition to perform matrix
computation are supplied by commercial computers corp.

Expected output:

computers

Third:

a abcd d def abcd def abc gh abcd swa abc def abcd ss

Expected output:

abcd

Discussion

I made the mistake of changing this problem shortly before the competition: the original problem had specified case-sensitive matches and I changed this to case-insensitive matches (to make it slightly more difficult). That made the problem statement incorrect (because the answer should have been "computer" rather than "comput") and this may have confused some teams.

Jay supplied the first test data set; I added the second. After the competition started, I decided that both test sets were rather similar to the given example, and I added the third test set.

When solutions started coming in, I realized that some teams were reading only one line of input. The sample data suggests that there could be more than one line of input, but perhaps not strongly enough. I removed new lines from the test data for subsequent trials when I realized that some teams were stuck on this detail. The program, of course is much easier to write if there is only one line of text. Once you realize that there might be several lines of input, it is not hard to write a program that reads to the end of the file, concatenating lines as it goes.

The key to solving this problem quickly is to imagine two copies of the string, one sliding past the other:

The 123 computers used in this competition to perform matrix
                       The 123 computers used in this competition to perform matrix

Starting a scan from "T" in the lower string, we find that "c" is in the same place in both strings. We move along until we discover that "e" and "u" are different and, at that point, we have "comp" as a repeating substring. If this substring is longer than one we have seen before, we record it.

There is a bug in the solution below, but it would nevertheless have passed in the competition.

Code

#include <iostream.h>
#include <fstream.h>
#include <strstrea.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

ifstream in;
ofstream out;

void main(int argc, char *argv[])
{
   in.open(argv[1]);
   out.open(argv[2]);
   char str[4000] = "";
   while(in)
   {
      char buf[500];
      in.getline(buf, 500);
      strcat(str,buf);
      strcat(str, " ");
   }

   for (char *p = str; *p; p++)
      *p = tolower(*p);

   char *fst = NULL;
   char *lst = NULL;
   for (unsigned int d = 1; d < strlen(str); d++)
   {
      char *p = str;
      char *r = str;
      for (char *q = str + d; *q; p++, q++)
      {
         if (!isalpha(*p) || !isalpha(*q) || *p != *q)
         {
            if (p - r > lst - fst)
            {
               fst = r;
               lst = p;
            }
            r = p;
         }
      }
   }
   for (p = fst; p < lst; p++)
      out << *p;
   out << endl;
}