11.20.2010

Micro parser

I finished the desgin work on a recursive, pattern tree-based parser for my student project. It'll be an integral part of ITK as it consists basically of three functions, one for parsing a single pattern, one for determining if the current character read is part of a word description (like a regular expression) and one for checking a pattern for ambiguous rules/logical errors (these may happen during pattern design). You can define a pattern via an ordered set of words and more patterns which can optionally repeat. It has some interesting options like source read control to create explicit patterns (two words to define start and ende which are only part of this pattern) and implicit (start/end characters of other patterns can be a part of their own pattern and thus create pattern not made of limiting/delimiting words or characters. It's quite a nice and rather subtle design and it should be possible to parse a lot of tree-based, sequenced structure like programming languages, PGN blocks, XML files, etc. . Using this, it won't be a problem to get define the PGN parse patterns like you would describe other other programming language with it. Well, there are limitations, like there's no way to define a fork of rule sets depending on what was written in the file. But what you can do instead is to make words and patterns inside a pattern optional and recursive, so you can nest a structure in itself if you want (LISP parser ahoi). The parser itself is OK, does only recognize words in format of a randomly mixed sequence of per-pattern/word defined characters. Next step could be implementing regular expressions checks, which will get a bit more difficult. I don't have so much experience with regular expressions since I mostly don't need them, but I can remember that they couldn't evaluate recursively nested expressions cause their format didn't allow it. That made me think: Can't I parse words with my micro parser, too? Yes I can, when using patterns in patterns in and so on. But that'd uncomfortable and rather limited, beeing only and concatenation of characters with heavy CPU usage. So I thought again and found regexp's method of dividing between concatenation and forks very similar to structured text files, if not identical and totally the same - with the exception of recursive nexting, which makes them rather useless for parsing programming languages and other more open structured files. So I made it clear to me that I don't need a pimped version of my micro parser, but a generalized system recursive regular expressions, giving you the possibility to next regexps of different abstractation layers into other regexps. Quite awesome stuff as it gives the possibilities to parse every kind of nested and sequenced repeats of characters/patterns. So this is my next goal for ITK after I did the thread implementation. I'm sure it's possible to use multithreading for parallel evaluation of regular expressions while stopping a thread while the opposite already evaluated it right. Though it'd be questionable due to the message checking (if the other threads already finished) and parallel access to the same string data. I doubt it'd be THAT much faster, but let's see if I'll ever go that far into it. However, I need to read some more about implementing regular expressions and how it's usually done, so I can decide if it's worth implementing a generalized version. All you'd need to do is define some operations between the given segments, possibly including other recursive regexp checks on a lower abstraction level.

But that future's music. For today, I'll simply enjoy my simple approach. If I ever need to have forks, I can still change the rules on-the-fly via Callbacks. The whole micro parser's interpretation is implemented using callbacks, so I can - no matter how the syntax looks in detail - simply look at the components flying in and leave syntactical errors to the parser. Of course do I have an error callback, too.

No comments: