The Problem
A string of text has had all of the spaces removed:
This e-mail is confidential and may be privileged. If you have received it in error, please contact the sender immediately by return e-mail then delete the e-mail and do not disclose its contents to any person.
Becomes
Thise-mailisconfidentialandmaybeprivileged.Ifyouhavereceiveditinerror,pleasecontactthesenderimmediatelybyreturne-mailthendeletethee-mailanddonotdiscloseitscontentstoanyperson.
How do we get the words back?
Suggested Approach
1. Find a set of rules to split the unbroken string into unique and indivisible parts – lets call them atoms. E.g be/p/ri/vil/eg/ed/./If/you/ha/ve/re/cei/v/ed/it/in/er/ror/, /p/lea/se
- Forget about case;
- Any punctuation is a gift as implies a following space (but not hyphens) and indicate the start or end for an atom;
- An atom will be similar to a syllable but not the same so we may get odd fragments like isolated consonants and vowels;
- Are there special rules for vowel pairs, consonant pairs or double consonants?
2. Find a set of rules to combine a sequence of atoms back into words.
- What are the binding rules for atoms and how sticky are they in relative to each other;
- It may be possible to ‘learn’ the joining rules from comparing how the atoms would be fitted into the correct answer.