Needleinahaystack

John 'Jake' JacobAPLLeave a Comment

The Problem

A string of text has had all of the spaces removed:

This e-mail is confidential and may be privileged. If you have received it in error, please contact the sender immediately by return e-mail then delete the e-mail and do not disclose its contents to any person.

Becomes

Thise-mailisconfidentialandmaybeprivileged.Ifyouhavereceiveditinerror,pleasecontactthesenderimmediatelybyreturne-mailthendeletethee-mailanddonotdiscloseitscontentstoanyperson.

How do we get the words back?

Suggested Approach

1. Find a set of rules to split the unbroken string into unique and indivisible parts – lets call them atoms. E.g be/p/ri/vil/eg/ed/./If/you/ha/ve/re/cei/v/ed/it/in/er/ror/, /p/lea/se

  • Forget about case;
  • Any punctuation is a gift as implies a following space (but not hyphens) and indicate the start or end for an atom;
  • An atom will be similar to a syllable but not the same so we may get odd fragments like isolated consonants and vowels;
  • Are there special rules for vowel pairs, consonant pairs or double consonants?

2. Find a set of rules to combine a sequence of atoms back into words.

  • What are the binding rules for atoms and how sticky are they in relative to each other;
  • It may be possible to ‘learn’ the joining rules from comparing how the atoms would be fitted into the correct answer.