Thinking in C++ - Practical Programming - A completely reusable tokenizer

Thinking in C++ Vol 2 - Practical Programming
Prev	Home	Next

A completely reusable tokenizer

C# Essentials
eBook

$9.99

eBookFrenzy.com

The word list examples use different approaches to extract tokens from a stream, neither of which is very flexible. Since the STL containers and algorithms all revolve around iterators, the most flexible solution will itself use an iterator. You could think of the TokenIterator as an iterator that wraps itself around any other iterator that can produce characters. Because it is certainly a type of input iterator (the most primitive type of iterator), it can provide input to any STL algorithm. Not only is it a useful tool in itself, the following TokenIterator is also a good example of how you can design your own iterators.[108]

The TokenIterator class is doubly flexible. First, you can choose the type of iterator that will produce the char input. Second, instead of just saying what characters represent the delimiters, TokenIterator will use a predicate that is a function object whose operator( ) takes a char and decides whether it should be in the token. Although the two examples given here have a static concept of what characters belong in a token, you could easily design your own function object to change its state as the characters are read, producing a more sophisticated parser.

The following header file contains two basic predicates, Isalpha and Delimiters, along with the template for TokenIterator:

//: C07:TokenIterator.h

#ifndef TOKENITERATOR_H

#define TOKENITERATOR_H

#include <algorithm>

#include <cctype>

#include <functional>

#include <iterator>

#include <string>

struct Isalpha : std::unary_function<char, bool> {

bool operator()(char c) { return std::isalpha(c); }

};

class Delimiters : std::unary_function<char, bool> {

std::string exclude;

public:

Delimiters() {}

Delimiters(const std::string& excl) : exclude(excl) {}

bool operator()(char c) {

return exclude.find(c) == std::string::npos;

}

};

template<class InputIter, class Pred = Isalpha>

class TokenIterator : public std::iterator<

std::input_iterator_tag, std::string, std::ptrdiff_t> {

InputIter first;

InputIter last;

std::string word;

Pred predicate;

public:

TokenIterator(InputIter begin, InputIter end,

Pred pred = Pred())

: first(begin), last(end), predicate(pred) {

++*this;

}

TokenIterator() {} // End sentinel

// Prefix increment:

TokenIterator& operator++() {

word.resize(0);

first = std::find_if(first, last, predicate);

while(first != last && predicate(*first))

word += *first++;

return *this;

}

// Postfix increment

class CaptureState {

std::string word;

public:

CaptureState(const std::string& w) : word(w) {}

std::string operator*() { return word; }

};

CaptureState operator++(int) {

CaptureState d(word);

++*this;

return d;

}

// Produce the actual value:

std::string operator*() const { return word; }

const std::string* operator->() const { return &word; }

// Compare iterators:

bool operator==(const TokenIterator&) {

return word.size() == 0 && first == last;

}

bool operator!=(const TokenIterator& rv) {

return !(*this == rv);

}

};

#endif // TOKENITERATOR_H ///:~

The TokenIterator class derives from the std::iterator template. It might appear that some kind of functionality comes with std::iterator, but it is purely a way of tagging an iterator, to tell a container that uses it what it can do. Here, you can see input_iterator_tag as the iterator_category template argument this tells anyone who asks that a TokenIterator only has the capabilities of an input iterator and cannot be used with algorithms requiring more sophisticated iterators. Apart from the tagging, std::iterator doesn t do anything beyond providing several useful type definitions. You must implement all other functionality yourself.

The TokenIterator class may look a little strange at first, because the first constructor requires both a begin and an end iterator as arguments, along with the predicate. Remember, this is a wrapper iterator that has no idea how to tell when it s at the end of its input, so the ending iterator is necessary in the first constructor. The reason for the second (default) constructor is that the STL algorithms (and any algorithms you write) need a TokenIterator sentinel to be the past-the-end value. Since all the information necessary to see if the TokenIterator has reached the end of its input is collected in the first constructor, this second constructor creates a TokenIterator that is merely used as a placeholder in algorithms.

The core of the behavior happens in operator++. This erases the current value of word using string::resize( ) and then finds the first character that satisfies the predicate (thus discovering the beginning of the new token) using find_if( ). The resulting iterator is assigned to first, thus moving first forward to the beginning of the token. Then, as long as the end of the input is not reached and the predicate is satisfied, input characters are copied into word. Finally, the TokenIterator object is returned and must be dereferenced to access the new token.

The postfix increment requires an object of type CaptureState to hold the value before the increment, so it can be returned. Producing the actual value is a straightforward operator*. The only other functions to define for an output iterator are the operator== and operator!= to indicate whether the TokenIterator has reached the end of its input. You can see that the argument for operator== is ignored it only cares about whether it has reached its internal last iterator. Notice that operator!= is defined in terms of operator==.

A good test of TokenIterator includes a number of different sources of input characters, including a streambuf_iterator, a char*, and a deque<char>::iterator. Finally, the original word list problem is solved:

//: C07:TokenIteratorTest.cpp {-g++}

#include <fstream>

#include <iostream>

#include <vector>

#include <deque>

#include <set>

#include "TokenIterator.h"

#include "../require.h"

using namespace std;

int main(int argc, char* argv[]) {

char* fname = "TokenIteratorTest.cpp";

if(argc > 1) fname = argv[1];

ifstream in(fname);

assure(in, fname);

ostream_iterator<string> out(cout, "\n");

typedef istreambuf_iterator<char> IsbIt;

IsbIt begin(in), isbEnd;

Delimiters delimiters(" \t\n~;()\"<>:{}[]+-=&*#.,/\\");

TokenIterator<IsbIt, Delimiters>

wordIter(begin, isbEnd, delimiters), end;

vector<string> wordlist;

copy(wordIter, end, back_inserter(wordlist));

// Output results:

copy(wordlist.begin(), wordlist.end(), out);

*out++ = "-----------------------------------";

// Use a char array as the source:

char* cp = "typedef std::istreambuf_iterator<char> It";

TokenIterator<char*, Delimiters>

charIter(cp, cp + strlen(cp), delimiters), end2;

vector<string> wordlist2;

copy(charIter, end2, back_inserter(wordlist2));

copy(wordlist2.begin(), wordlist2.end(), out);

*out++ = "-----------------------------------";

// Use a deque<char> as the source:

ifstream in2("TokenIteratorTest.cpp");

deque<char> dc;

copy(IsbIt(in2), IsbIt(), back_inserter(dc));

TokenIterator<deque<char>::iterator,Delimiters>

dcIter(dc.begin(), dc.end(), delimiters), end3;

vector<string> wordlist3;

copy(dcIter, end3, back_inserter(wordlist3));

copy(wordlist3.begin(), wordlist3.end(), out);

*out++ = "-----------------------------------";

// Reproduce the Wordlist.cpp example:

ifstream in3("TokenIteratorTest.cpp");

TokenIterator<IsbIt, Delimiters>

wordIter2(IsbIt(in3), isbEnd, delimiters);

set<string> wordlist4;

while(wordIter2 != end)

wordlist4.insert(*wordIter2++);

copy(wordlist4.begin(), wordlist4.end(), out);

} ///:~

When using an istreambuf_iterator, you create one to attach to the istream object and one with the default constructor as the past-the-end marker. Both are used to create the TokenIterator that will produce the tokens; the default constructor produces the faux TokenIterator past-the-end sentinel. (This is just a placeholder and is ignored.) The TokenIterator produces strings that are inserted into a container of string here a vector<string> is used in all cases except the last. (You could also concatenate the results onto a string.) Other than that, a TokenIterator works like any other input iterator.

When defining a bidirectional (and therefore also a random access) iterator, you can get reverse iterators for free by using the std::reverse_iterator adaptor. If you have already defined an iterator for a container with bidirectional capabilities, you can get a reverse iterator from your forward-traversing iterator with lines like the following inside your container class:

// Assume "iterator" is your nested iterator type

typedef std::reverse_iterator<iterator> reverse_iterator;

reverse_iterator rbegin() {return reverse_iterator(end());

reverse_iterator rend() {return reverse_iterator(begin());

The std::reverse_iterator adaptor does all the work for you. For example, if you use the * operator to dereference your reverse iterator, it automatically decrements a temporary copy of the forward iterator it is holding in order to return the correct element, since reverse iterators logically point one position past the element they refer to.

Thinking in C++ Vol 2 - Practical Programming
Prev	Home	Next