D2Zim Class Reference

EST Analyzer that uses the d2 algorithm to compute distances between two ESTs. More...

#include <D2Zim.h>

Inheritance diagram for D2Zim:
Inheritance graph
[legend]
Collaboration diagram for D2Zim:
Collaboration graph
[legend]

List of all members.

Public Member Functions

virtual ~D2Zim ()
 The destructor.
virtual void showArguments (std::ostream &os)
 Display valid command line arguments for this analyzer.
virtual bool parseArguments (int &argc, char **argv)
 Process command line arguments.
int initialize ()
 Method to begin EST analysis.
virtual std::string getName () const
 Method to obtain human-readable name for this EST analyzer.
virtual int setReferenceEST (const int estIdx)
 Set the reference EST id for analysis.
virtual int analyze ()
 Method to perform exhaustive EST analysis.

Protected Member Functions

virtual float getMetric (const int otherEST)
 Analyze and obtain a distance metric.
bool compareMetrics (const float metric1, const float metric2) const
 Method to compare two metrics generated by this class.
float getInvalidMetric () const
 Obtain an invalid (or the worst) metric generated by this analyzer.
virtual bool getAlignmentData (int &alignmentData)
 Get alignment data for the previous call to analyze method.
bool isDistanceMetric () const
 Determine if this EST analyzer provides distance metrics or similarity metrics.
void buildWordTable (int *wordTable, const char *s)
 Creates a "word table" mapping integer indices to integer hashes of words, in effect translating the sequence from a sequence of characters to a sequence of n-words (where n = wordSize).
void buildFdHashMaps (int *sed)
 Helper method to build frequency distribution hash map.
void refShiftUpdateFd (int *sed, const int framePos)
 Helper method to update the frequency hash map and squared Euclidean distance after the frame is shifted by 1 character on the reference sequence.
void rightShiftUpdateFd (int *sed, const int framePos)
 Helper method to update the frequency hash map and squared Euclidean distance after a 1 character right-shift on the comparison sequence.
void leftShiftUpdateFd (int *sed, const int framePos)
 Helper method to update the frequency hash map and squared Euclidean distance after a 1 character left-shift on the comparison sequence.

Protected Attributes

int * fdHashMap
 Instance variable that keeps track of the frequency differentials for words found in the current windows on the reference and comparison sequences.
int * s1WordTable
 Instance variable that maps an index in the reference sequence (sequence s1) to the hash of a word.
int * s2WordTable
 Instance variable that maps an index in the comparison sequence (sequence s2) to the hash of a word.

Static Protected Attributes

static int frameShift = 1
 Parameter to define number of characters to shift the frame on the reference sequence, when computing D2.
static size_t wordTableSize = 0
static int BitMask = 0

Private Member Functions

 D2Zim (const int refESTidx, const std::string &outputFileName)

Private Attributes

int alignmentMetric
 Instance variable to track alignment metric computed by the analyze() method.
int NumWordsWin

Static Private Attributes

static arg_parser::arg_record argsList []
 The set of arguments specific to the D2 algorithm.

Friends

class ESTAnalyzerFactory

Detailed Description

EST Analyzer that uses the d2 algorithm to compute distances between two ESTs.

This analyzer provides the mechanism to use vanilla D2 algorithm to compute the distance values between a pair of ESTs. The D2 implementation has been adapted from the implementations of WCD and CLU.

Note:
This D2 analyzer uses a word size of 8 base pairs. This is hard coded into the algorithm in the form of various assumptions (data types of variables etc.). A similar approach is used by other D2 implementations as well. This is a compromise between performance and flexibility of the implementation.

Definition at line 59 of file D2Zim.h.


Constructor & Destructor Documentation

D2Zim::~D2Zim (  )  [virtual]

The destructor.

The destructor frees up all any dynamic memory allocated by this object for its operations.

Definition at line 64 of file D2Zim.cpp.

References fdHashMap, s1WordTable, and s2WordTable.

D2Zim::D2Zim ( const int  refESTidx,
const std::string &  outputFileName 
) [private]

Definition at line 57 of file D2Zim.cpp.

References fdHashMap, s1WordTable, and s2WordTable.


Member Function Documentation

virtual int D2Zim::analyze (  )  [inline, virtual]

Method to perform exhaustive EST analysis.

This method is used to perform the core tasks of comparing all ESTs to one another for full analysis of ESTs. This is an additional feature of PEACE that is not used for clustering but just doing an offline analysis. Currently, this method merely calls the corresponding base class implementation that performs all the necessary operations.

Returns:
This method returns zero if all the processing proceeded successfully. On errors this method returns a non-zero value.

Reimplemented from FWAnalyzer.

Definition at line 175 of file D2Zim.h.

void D2Zim::buildFdHashMaps ( int *  sed  )  [protected]

Helper method to build frequency distribution hash map.

This is a helper method that is used to build the initial frequency differential hash map for the pair of ESTs currently being compared. This structure maps integer hashes of words to integers denoting the difference in word frequency from sequence 1 to sequence 2. Additionally, it computes the initial squared Euclidean distance (d2's distance measure).

Definition at line 174 of file D2Zim.cpp.

References fdHashMap, FWAnalyzer::frameSize, s1WordTable, s2WordTable, and FWAnalyzer::wordSize.

Referenced by getMetric().

void D2Zim::buildWordTable ( int *  wordTable,
const char *  s 
) [protected]

Creates a "word table" mapping integer indices to integer hashes of words, in effect translating the sequence from a sequence of characters to a sequence of n-words (where n = wordSize).

Definition at line 148 of file D2Zim.cpp.

References ASSERT, FWAnalyzer::wordSize, and wordTableSize.

Referenced by getMetric(), and setReferenceEST().

bool D2Zim::compareMetrics ( const float  metric1,
const float  metric2 
) const [inline, protected, virtual]

Method to compare two metrics generated by this class.

This method provides the interface for comparing metrics generated by this ESTAnalyzer when comparing two different ESTs. This method returns true if metric1 is comparatively better than or equal to metric2.

Note:
As per the ESTAnalyzer API requirements, EST analyzers that are based on distance measures (such as this D2 analyzer) must override this method.
Parameters:
[in] metric1 The first metric to be compared against.
[in] metric2 The second metric to be compared against.
Returns:
This method returns true if metric1 is comparatively better than metric2.

Reimplemented from ESTAnalyzer.

Definition at line 210 of file D2Zim.h.

bool D2Zim::getAlignmentData ( int &  alignmentData  )  [protected, virtual]

Get alignment data for the previous call to analyze method.

This method can be used to obtain alignment data that was obtained typically as an byproduct of the previous call to the analyze() method. This method essentially returns the difference between the two windows that provided the minimum d2 distance value.

Parameters:
[out] alignmentData The parameter is updated to the alignment information generated as a part of the the immediately preceding analyze(const int) method call is returned in the parameter.
Returns:
This method always returns true to indicate that alignment data is computed by this ESTAnalyzer.

Reimplemented from ESTAnalyzer.

Definition at line 258 of file D2Zim.cpp.

References alignmentMetric.

float D2Zim::getInvalidMetric (  )  const [inline, protected, virtual]

Obtain an invalid (or the worst) metric generated by this analyzer.

This method can be used to obtain an invalid metric value for this analyzer. This value can be used to initialize metric values.

Note:
Derived distance-based metric classes (such as this D2 analyzer) must override this method to provide a suitable value.
Returns:
This method returns an invalid (or the worst) metric of 1e7 for this EST analyzer.

Reimplemented from ESTAnalyzer.

Definition at line 227 of file D2Zim.h.

float D2Zim::getMetric ( const int  otherEST  )  [protected, virtual]

Analyze and obtain a distance metric.

This method can be used to compare a given EST with the reference EST (set via the call to the setReferenceEST()) method.

Parameters:
[in] otherEST The index (zero based) of the EST with which the reference EST is to be compared.
Returns:
This method returns the distance value reported by the D2 algorithm.

Reimplemented from FWAnalyzer.

Definition at line 197 of file D2Zim.cpp.

References ASSERT, buildFdHashMaps(), buildWordTable(), CHECK_SED_AND_BREAK, frameShift, FWAnalyzer::frameSize, EST::getEST(), EST::getESTCount(), EST::getSequence(), leftShiftUpdateFd(), ESTAnalyzer::refESTidx, refShiftUpdateFd(), rightShiftUpdateFd(), and s2WordTable.

virtual std::string D2Zim::getName (  )  const [inline, virtual]

Method to obtain human-readable name for this EST analyzer.

This method provides a human-readable string identifying the EST analyzer. This string is typically used for display/debugging purposes (particularly via the PEACE Interactive Console).

Returns:
This method returns the string "d2zim" identifiying this d2 analyzer was based on description in Zimmermann's paper.

Implements ESTAnalyzer.

Definition at line 140 of file D2Zim.h.

int D2Zim::initialize (  )  [virtual]

Method to begin EST analysis.

This method is invoked just before commencement of EST analysis. This method currently does not have any specific tasks to perform. It simply returns 0.

Returns:
Currently, this method always returns 0 (zero) to indicate initialization was successfully completed.

Reimplemented from FWAnalyzer.

Definition at line 100 of file D2Zim.cpp.

References BitMask, fdHashMap, FWAnalyzer::frameSize, EST::getMaxESTLen(), NumWordsWin, s1WordTable, s2WordTable, FWAnalyzer::wordSize, and wordTableSize.

bool D2Zim::isDistanceMetric (  )  const [inline, protected, virtual]

Determine if this EST analyzer provides distance metrics or similarity metrics.

This method can be used to determine if this EST analyzer provides distance metrics or similarity metrics. If this method returns true, then this EST analyzer returns distance metrics (smaller is better). On the other hand, if this method returns false, then this EST analyzer returns similarity metrics (bigger is better).

Returns:
This method returns true to indicate that this EST analyzer operates using distance metrics.

Reimplemented from ESTAnalyzer.

Definition at line 260 of file D2Zim.h.

void D2Zim::leftShiftUpdateFd ( int *  sed,
const int  framePos 
) [inline, protected]

Helper method to update the frequency hash map and squared Euclidean distance after a 1 character left-shift on the comparison sequence.

Definition at line 307 of file D2Zim.h.

References fdHashMap, NumWordsWin, and s2WordTable.

Referenced by getMetric().

bool D2Zim::parseArguments ( int &  argc,
char **  argv 
) [virtual]

Process command line arguments.

This method is used to process command line arguments specific to this EST analyzer. This method is typically used from the main method just after the EST analyzer has been instantiated. This method consumes all valid command line arguments. If the command line arguments were valid and successfully processed, then this method returns true.

Currently, this EST analyzer does not require any additional command line parameters. Consequently, it simply calls the corresponding method in the base class.

Note:
The ESTAnalyzer base class requires that derived EST analyzer classes must override this method to process any command line arguments that are custom to their operation. When this method is overridden don't forget to call the corresponding base class implementation to display common options.
Parameters:
[in,out] argc The number of command line arguments to be processed.
[in,out] argv The array of command line arguments.
Returns:
This method returns true if the command line arguments were successfully processed. Otherwise this method returns false. This method checks to ensure that a valid frame size and a valid word size have been specified.

Reimplemented from FWAnalyzer.

Definition at line 85 of file D2Zim.cpp.

References ESTAnalyzer::analyzerName, arg_parser::check_args(), and frameShift.

void D2Zim::refShiftUpdateFd ( int *  sed,
const int  framePos 
) [inline, protected]

Helper method to update the frequency hash map and squared Euclidean distance after the frame is shifted by 1 character on the reference sequence.

Definition at line 284 of file D2Zim.h.

References fdHashMap, NumWordsWin, and s1WordTable.

Referenced by getMetric().

void D2Zim::rightShiftUpdateFd ( int *  sed,
const int  framePos 
) [inline, protected]

Helper method to update the frequency hash map and squared Euclidean distance after a 1 character right-shift on the comparison sequence.

Definition at line 295 of file D2Zim.h.

References fdHashMap, NumWordsWin, and s2WordTable.

Referenced by getMetric().

int D2Zim::setReferenceEST ( const int  estIdx  )  [virtual]

Set the reference EST id for analysis.

This method is invoked just before a batch of ESTs are analyzed via a call to the analyze(EST *) method. This method currently saves the index in the instance variable for further look up. Next it creates a "word table" mapping integer indices to integer hashes of words, in effect translating the sequence from a sequence of characters to a sequence of n-words (where n = wordSize). This word table is kept until the reference EST is changed, which reduces overhead.

Note:
This method must be called only after the initialize() method is called.
Returns:
This method returns true if the estIdx was within the given range of values. Otherwise this method returns a non-zero value as the error code.

Reimplemented from FWAnalyzer.

Definition at line 130 of file D2Zim.cpp.

References buildWordTable(), ESTAnalyzer::chain, EST::getEST(), EST::getESTCount(), EST::getSequence(), ESTAnalyzer::refESTidx, s1WordTable, and HeuristicChain::setReferenceEST().

void D2Zim::showArguments ( std::ostream &  os  )  [virtual]

Display valid command line arguments for this analyzer.

This method must be used to display all valid command line options that are supported by this analyzer. Currently, this analyzer does not require any special command line parameters.

Note:
The ESTAnalyzer base class requires that derived EST analyzer classes must override this method to display help for their custom command line arguments. When this method is overridden don't forget to call the corresponding base class implementation to display common options.
Parameters:
[out] os The output stream to which the valid command line arguments must be written.

Reimplemented from FWAnalyzer.

Definition at line 77 of file D2Zim.cpp.


Friends And Related Function Documentation

friend class ESTAnalyzerFactory [friend]

Definition at line 60 of file D2Zim.h.


Member Data Documentation

int D2Zim::alignmentMetric [private]

Instance variable to track alignment metric computed by the analyze() method.

This instance variable is used to hold the alignment metric that was computed in the previous analyze method call. By default this value is set to zero. The alignment metric is computed as the difference in the window positions (on the two ESTs being analyzed) with the minimum d2 distance.

Definition at line 410 of file D2Zim.h.

Referenced by getAlignmentData().

Initial value:
 {
    {"--frameShift", "Frame Shift (default=1)",
     &D2Zim::frameShift, arg_parser::INTEGER},
    {NULL, NULL, NULL, arg_parser::BOOLEAN}
}

The set of arguments specific to the D2 algorithm.

This instance variable contains a static list of arguments that are specific only to the D2 analyzer class. This argument list is statically defined and shared by all instances of this class.

Note:
Use of static arguments and parameters makes D2 class hierarchy not MT-safe.

Definition at line 378 of file D2Zim.h.

int D2Zim::BitMask = 0 [static, protected]

Definition at line 364 of file D2Zim.h.

Referenced by initialize().

int* D2Zim::fdHashMap [protected]

Instance variable that keeps track of the frequency differentials for words found in the current windows on the reference and comparison sequences.

These frequency differentials are used in the calculation of the D2 distance between two windows.

This variable is created in the initialize() method and contains 4wordSize entries.

Definition at line 322 of file D2Zim.h.

Referenced by buildFdHashMaps(), D2Zim(), initialize(), leftShiftUpdateFd(), refShiftUpdateFd(), rightShiftUpdateFd(), and ~D2Zim().

int D2Zim::frameShift = 1 [static, protected]

Parameter to define number of characters to shift the frame on the reference sequence, when computing D2.

This parameter is used to enable D2-asymmetric behavior. The default value is 1, which means D2 symmetric: all frames in both sequences will be compared. Higher values mean that the algorithm will shift by more than one character when shifting the frame on the reference sequence, resulting in fewer computations but a possible loss of accuracy from not comparing every frame in both sequences.

Definition at line 360 of file D2Zim.h.

Referenced by getMetric(), and parseArguments().

int D2Zim::NumWordsWin [private]

Definition at line 412 of file D2Zim.h.

Referenced by initialize(), leftShiftUpdateFd(), refShiftUpdateFd(), and rightShiftUpdateFd().

int* D2Zim::s1WordTable [protected]

Instance variable that maps an index in the reference sequence (sequence s1) to the hash of a word.

This hash can then be used as an index in the fdHashMap to get the frequency differential for that word.

The word table is created in the initialize() method and filled in using the buildWordTable() method. For the reference EST, buildWordTable() is called in the setReferenceEST() method, meaning it does not need to be rebuilt every time we analyze a new comparison sequence.

Definition at line 335 of file D2Zim.h.

Referenced by buildFdHashMaps(), D2Zim(), initialize(), refShiftUpdateFd(), setReferenceEST(), and ~D2Zim().

int* D2Zim::s2WordTable [protected]

Instance variable that maps an index in the comparison sequence (sequence s2) to the hash of a word.

This hash can then be used as an index in the fdHashMap to get the frequency differential for that word.

The word table is created in the initialize() method and filled in using the buildWordTable() method. For the comparison EST, buildWordTable() must be called in the analyze() method because a new comparison sequence is given every time analyze() is called.

Definition at line 347 of file D2Zim.h.

Referenced by buildFdHashMaps(), D2Zim(), getMetric(), initialize(), leftShiftUpdateFd(), rightShiftUpdateFd(), and ~D2Zim().

size_t D2Zim::wordTableSize = 0 [static, protected]

Definition at line 362 of file D2Zim.h.

Referenced by buildWordTable(), and initialize().


The documentation for this class was generated from the following files:

Generated on 19 Mar 2010 for PEACE by  doxygen 1.6.1