"Regular Expressions Library (REX)"

REX is Automata Compiler - it turns regular expressions into Automaton (NFA or DFA). REX does not support back references. REX doesn't support anchors directly, but the support can be added externally. The REX library is based on the theory of Deterministic Finite Automata (DFA). Regular expressions like '[a-zA-Z_][a-zA-Z0-9_]*' are first compiled to Non-deterministic Finite Automata(NFA) and then transformed to DFA, using algorithm similar to the DFA subset construction method.


How to use.

To implement matching using REX, the user has to run the input through the DFA states until the automaton arrives at accepting state. For example, if the input is a NULL terminated string:

while (*str) {
    nstate = REX_DFA_NEXT(dfa, nstate, *str);
    if (REX_DFA_STATE(dfa, nstate)->type == REX_STATETYPE_DEAD) {
        /* Did not match */
    } else if (REX_DFA_STATE(dfa, nstate)->type == REX_STATETYPE_ACCEPT) {
        /* The DFA is in accepting state, lets find out what exactly is being accepted. */
    } else {
        /* Keep going... */

REX also provides some API for matching and searching. Example:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "rex/rexregex.h"

 To build:
 gcc -o rexregex rexregex.c -I/usr/include/rpatk -lrex -lrlib
#define BUF_SIZE 0x10000

int main(int argc, char *argv[])
        rexregex_t *regex;
        char *buf, *end;
        const char *where;
        int line = 0;

        if (argc < 2) {
                fprintf(stderr, "Usage:\n cat <file> | %s <regexp>\n", argv[0]);
                return 0;
        regex = rex_regex_create_s(argv[1]);
        buf = calloc(BUF_SIZE, 1);
        while (fgets(buf, BUF_SIZE - 1, stdin)) {
                for (end = buf; *end; ++end);
                if (end > buf)
                        *(end - 1) = '\0';
                if (rex_regex_scan(regex, REX_ENCODING_UTF8, buf, end, &where) > 0)
                        printf("%d: %s\n", line, buf);
        return 0;

The JavaScript tokenizer example js-tokenizer::c is a simple demonstration how to use the REX library for lexical analysis of UTF8 encoded text.