On occasion people will want to parse the syntax of a programming language, to create manuals, syntax highlighters, etc… This however isn’t always easy to achieve. The easiest way to do this, is to use regular expressions. I’ve written up an expression to handle the syntax of the pascal language. In this expression I will be using capture groups for each part of the syntax, because of this, you will be able to determine what part of the syntax was matched by checking what group has a content. Note that the order of matching the syntax is very important.
(?:(//(?:(?!$).)*)|(\{\$[^\}]*\})|(\{(?:.*?\}|.*))|(\(\*(?:.*?\*\)|.*))|(‘(?:(?!$|’).)*’?)|(#\d+)|(\d+(?:\.\d+)?|\$[0-9A-F]+)|([A-Z_][A-Z0-9_]*))
Capture groups:
- 1: // One line comments
- 2: {$COMPILER DIRECTIVES}
- 3: {Multi line comments}
- 4: (*Multi line comments*)
- 5: ‘Strings’
- 6: Characters (#125)
- 7: Numbers (268 or 569.56)
- 8: Words (compare with list of keywords afterwards)
You can easily create similar regexes for other languages which work the same way. Note that you have to set the regex engine to have dot match newlines and match caseless for this regex to work.