Sunday, November 23, 2008

 

Anagram

Anagram is re-arranging the letters in the word such that it gives a different meaning word. Eg. syrup and pyres.

The problem is to  find all the anagrams in the english dictionary. The best solution I found is in the Jon Bentley's book  Programming Pearls  book.

Here is the algorithm:

Step 1 : Arrange each words in the dictionary in an order(Alphabetic), this gives the signature of the word.
Step 2 : Then sort these signature along with the word.
Step 3 : The words with the same signature are the anagrams.

words
signature( The word sorted in Alphabetical order)
signature with words,
Anagram, words with same signature
pal
alp
alp pal

lap
alp
alp  lap
pal lap
cat
act
act  cat



We will use the unix philosophy for solving this problem .

Unix philosophy states:
Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.

We will write a  c++ program to generate the signature and the word pair, use the unix sort program to sort the signatures and use AWK script to extract the words with the same signature. The resulting words are the anagrams.

#include <iostream>
#include <fstream>
#include <string>
#include <iterator>
#include <algorithm>
bool charComp (char i, char j)
{
 return (i < j);
}
int main (int argc, char * const argv[])
{
 std::string signature;
 std::ifstream file;
 std::istream_iterator<std::string> inputItr;
 if(argc > 1)
 {
  file.open(argv[1]);
  inputItr = std::istream_iterator<std::string>(file);
 }
 else
 {
  inputItr = std::istream_iterator<std::string>(std::cin);
 }
 for (; inputItr != std::istream_iterator<std::string>(); inputItr++)
 {
   signature = *inputItr;
   // Sort the word in alphabetical order
   std::sort(signature.begin(), signature.end(), charComp);
   // Print the signature of the word and the word
   std::cout << signature << "  " << *inputItr   << std::endl;
 }
 return 0;
}
The program reads the words as the string either from file or through the standard input. Then sorts  each word alphabetically which makes the signature of the word and prints the signature and the word.

The sort program will sort the output of the program. It arranges the words with same signature next to each other. The command looks like this:

~ Shankar$ cat /usr/share/dict/words |tr [A-Z] [a-z]| ./Anagram |sort

Here cat is feeding the words in the dictionary to the anagram program. The words are all converted to lower case before feeding by tr. The anagram program will print the signature of each word along with the word onto standard output like this:

aaaabcilllz  calabazilla
aaaabcl  alcaaba
aaaabcllmnrtu  antambulacral


The AWK filter  program prints the words with the same signature on a line.

#!/usr/bin/awk -f
# Shankar Chakkere (shankara_c@yahoo.com)
# Given the input of the form
#    signature word
#
#    prsu  spur
#    prsuu  usurp
#    prsuy  pursy
#    prsuy  pyrus
#    prsuy  syrup
#    prsuyy  syrupy
#   .......
# print all the words which has the same signature
#   pursy pyrus syrup
# which are anagrams.
function saveState(state)
{
 prevSignature = $1;
 prevWord = $2;
 matched = state;
}
{
 if (NR == 1)
 {
 saveState(0);
 }
 else
 {
 # for all the lines greater than 1
 # print the previous word if its signature had matched
 if($1 == prevSignature)
 {
 printf("%s ",prevWord);
 saveState(1);
 }
 else
 {
 # print the previous word if its signature had matched.
 # This is the last word of the anagram list, so print line feed
 if(matched == 1)
 printf("%s\n ",prevWord);
 saveState(0);
 }
 }
}

Tying them all together:

~ Shankar$ cat /usr/share/dict/words |tr [A-Z] [a-z]| ./Anagram |sort |./printAnagram.awk

Will print the list of anagrams on to the standard output.




Labels:


 

Counting Bits

There are many reasons we need to count the number of bits set in a data. It can be used to calculate the parity of a byte. To find the hamming distance etc.There are many ways to do it in C / C++.

Method 1

 

int countBits(unsigned int x)
{
int i=0;
while(x != 0)
{
if(x & 0x1)
{
++i;
}
x >>= 1;
}
return i;
}

How it works?

Unlike the assembly language C or C++ don't have a operator to test if the bit is set. Here we mask the rest of the bits except the least significant bit and test if it is zero or not. It is guaranteed to work for unsigned number because the C++ standard guarantees it.

Shift operators

The value of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a non-negative value, the value of the result is the integral part of the quotient of E1/2E2. If E1 has a signed type and a negative value, the resulting value is implementation-defined.

What it means is , when you shift right, the most significant bit is padded with zero as show in the diagram below.
countBits.jpg

Disadvantage:


For the worst case when least significant and the most significant bits are set eg. 10000001.

Method 2


This method is an exercise in Kernighan and Ritchie's "The C programming Language" book.
int countBits(unsigned int x)
{
int i=0;
while (x != 0)
{
++i;
x &= x - 1; // deletes the rightmost bit that is set in x
}
return i;
}
How it works?

This method uses the property of subtracting a binary number with one. When its done the least significant bit set will be reset.
For decimal 3 which is 011 in binary, the following diagram illustrates the process.
011         value
& 010         value -1     count = 1
------
010

010         value
& 001         value -1     count = 2
------
000
As we see the bit set count is 2.

Advantage:

The number of loops depends on the number of bits set, even for the worst case eg. 10000001 it does it in two loops.

Method 3

Using lookup table. This speeds the operation because we are doing a nibble (4 bits ) at a time instead of a bit at a time.

const int int2bitCount[16] =
{
0,1,1,2,
1,2,2,3,
1,2,3,3,
2,3,4,4
};
int countBits(unsigned int x)
{
int count = 0;
while(x != 0)
{
count += int2bitCount[ x & 0xf ];
x >>= 4;
}
return count;
}

How it works?

The lookup table stores the number of bits that are set for its index. Here we are creating a lookup table for a nibble. 4 bits at a time seems to be a best compromise for space / time. We are also assuming that the integer is a multiple of 4 bits.


Advantage:

The number of loops for the worst case is sizeof( int ) / 4.

This is how you invoke it.

#include <iostream>

int main(int argc, char* argv[])
{
unsigned int x = 0xfffe;
std::cout << std::hex << x << " has " << std::dec << countBits(x) << " bits" << std::endl;
return 0;
}



Labels:


This page is powered by Blogger. Isn't yours?