Sunday, November 23, 2008
Anagram
Anagram is re-arranging the letters in the word such that it gives a different meaning word. Eg. syrup and pyres.
The problem is to find all the anagrams in the english dictionary. The best solution I found is in the Jon Bentley's book Programming Pearls book.
Here is the algorithm:
Step 1 : Arrange each words in the dictionary in an order(Alphabetic), this gives the signature of the word.
Step 2 : Then sort these signature along with the word.
Step 3 : The words with the same signature are the anagrams.
words | signature( The word sorted in Alphabetical order) | signature with words, | Anagram, words with same signature |
pal | alp | alp pal | |
lap | alp | alp lap | pal lap |
cat | act | act cat |
We will use the unix philosophy for solving this problem .
Unix philosophy states:
Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.
We will write a c++ program to generate the signature and the word pair, use the unix sort program to sort the signatures and use AWK script to extract the words with the same signature. The resulting words are the anagrams.
#include <iostream>
#include <fstream>
#include <string>
#include <iterator>
#include <algorithm>
bool charComp (char i, char j)
{
return (i < j);
}
int main (int argc, char * const argv[])
{
std::string signature;
std::ifstream file;
std::istream_iterator<std::string> inputItr;
if(argc > 1)
{
file.open(argv[1]);
inputItr = std::istream_iterator<std::string>(file);
}
else
{
inputItr = std::istream_iterator<std::string>(std::cin);
}
for (; inputItr != std::istream_iterator<std::string>(); inputItr++)
{
signature = *inputItr;
// Sort the word in alphabetical order
std::sort(signature.begin(), signature.end(), charComp);
// Print the signature of the word and the word
std::cout << signature << " " << *inputItr << std::endl;
}
return 0;
}
The program reads the words as the string either from file or through the standard input. Then sorts each word alphabetically which makes the signature of the word and prints the signature and the word.
The sort program will sort the output of the program. It arranges the words with same signature next to each other. The command looks like this:
~ Shankar$ cat /usr/share/dict/words |tr [A-Z] [a-z]| ./Anagram |sort
Here cat is feeding the words in the dictionary to the anagram program. The words are all converted to lower case before feeding by tr. The anagram program will print the signature of each word along with the word onto standard output like this:
aaaabcilllz calabazilla
aaaabcl alcaaba
aaaabcllmnrtu antambulacral
The AWK filter program prints the words with the same signature on a line.
#!/usr/bin/awk -f
# Shankar Chakkere (shankara_c@yahoo.com)
# Given the input of the form
# signature word
#
# prsu spur
# prsuu usurp
# prsuy pursy
# prsuy pyrus
# prsuy syrup
# prsuyy syrupy
# .......
# print all the words which has the same signature
# pursy pyrus syrup
# which are anagrams.
function saveState(state)
{
prevSignature = $1;
prevWord = $2;
matched = state;
}
{
if (NR == 1)
{
saveState(0);
}
else
{
# for all the lines greater than 1
# print the previous word if its signature had matched
if($1 == prevSignature)
{
printf("%s ",prevWord);
saveState(1);
}
else
{
# print the previous word if its signature had matched.
# This is the last word of the anagram list, so print line feed
if(matched == 1)
printf("%s\n ",prevWord);
saveState(0);
}
}
}
Tying them all together:
~ Shankar$ cat /usr/share/dict/words |tr [A-Z] [a-z]| ./Anagram |sort |./printAnagram.awk
Will print the list of anagrams on to the standard output.
Labels: c++