#* ABOUT Creating an [ebnf] style language with [nom] as the compile target. I will use a W3C ebnf style with no commas between tokens. It would be nice to have a more natural language that *targets* [nom] This is compiling simple [ebnf] to [nom] . This is the first example of using nom as the target of a nom script. Another strange oed://corollary arises: that we can use this new language to implement a compiler for itself BUGS lit: 'a'; compiles the same as a. is this ok? I really need "replace ^'text' 'new';" and "replace 'old'$ 'new'; where ^ and $ are anchors to the start and end of the workspace. without this, alot of my compilation is potentially buggy. For example, with lookahead compilation, I want to replace the first few tokens of a sequence, but I cant be sure I am only replacing the first. DONE Compiling uneven alternation sequences using the tape variable LHS and the ';' token attribute to save the partially compiled code. - finished attrule parsing to assign to @2 @3 etc. - variables like $server, which will then interpolate in strings - but user defined vars dont interpolate. TODO TESTING * first working program >> pep -f syntagma.pss -i "[:alpha:]+{color:'blue'|'green';delete;}digit:[0-9]+;lit:';';delete; option = color digit ';'; eof { print 'at eof\n';exit 0;} " > junk.pss The syntagma program below seems to compile correctly with this syntagma.pss script. See the phrases section for lots of syntagma syntax. * an example program, with lexing and a grammar rule -------- # comments can be written with '#' # * multiline comments are also ok, following a ebnf style * # [:alpha:]+ { color:'blue'|'green'|'red'|'orange'; word: *; # define default token within a lex block } space: [:space:]; # a space token contains a single \r\n\t or ' ' etc integer: [0-9]+ ; # double quotes, single quotes or classes can be used # lit is a special lexing keyword lit: ';'|":"|[@#$]; # keywords like 'not','empty' or 'delete' are not case sensitive NOT EMPTY { delete; } # parse rules start here. The syntagma grammar knows how to # work out where lexing ends and parsing begins. The parse rule assignment # operator is '=' option = color digit ';' ; # parse rules can have alternations with | # literal tokens like ':' can be used in parse rules, but must be defined # above with the 'lit' or 'literal' lexing keyword position = integer ':' integer | integer ';' integer ; # parse-rules can have code that executes when the rule matches position = integer ':' integer | integer ';' integer { print "found position at line $lines!\n"; } EOF { print 'at eof\n'; exit 0;} ,,,, >> pep -f syntagma.pss -i 'com = word param; block = word newword;' * sample output of syntagma.pss when compiling with nom script ------+ # sample input BNF rules (white-space doesnt matter): # com = word param ; # block = word newword ; # output (produced by this script) pop;pop; "word*param*" { clear; add "com*"; push; .reparse } push;push; pop;pop; "word*newword*" { clear; add "block*"; push; push; push; .reparse } push;push ,,,, This is pretty cool, because we now have a ebnf-to-nom compiler that produces executable and translatable (to go/java/tcl/python/ruby etc) [nom] code. The language has a lexing and parsing syntax. The syntagma language may not be as efficient as hand coded [nom] because it does redundant "pushes" nom://push and "pops" nom://pop between code blocks, but it is easier to write and probably less prone to errors. * compiling syntax for syntagma ---- link = quotedtext url { @1 = "$2"; } ,,,, I may also allow '.' as a string concatenator. $1 refers to the attribute of the first token on the RHS right-hand-side of the bnf grammar rule. The compiling block takes the place of the ';' in the syntax above. ALTERNATION Alternation has now been implemented, including for unequal length alternation branches (with no following code block) (18 may 2026) Also alternation (all branches same length) with code blocks such as >> a = b e | c d { print "parsing"; } I have been thinking about how to implement alternation in syntagma and nom* for quite some time, and I have come up with a few promising ideas. Originally I thought that this was going to be impossible or nearly impossible (but it isnt) * parsing acrobatics and alternation -------- example1: a = b c | d e f ; compile: pop;pop; "b*c*" { clear; add "a*"; push; .reparse } push;push; pop;pop;pop; "d*e*f*" { clear; add "a*"; push; .reparse } push;push;push; example2: a = b c | d e { #1 = "$1 and $2"; } compile: pop;pop; "b*c*","d*e*" { clear; get; add " and "; ++; get; --; put; clear; add "a*"; push; .reparse } push;push; ,,, The second example is probably only useful if there are the same number of parse tokens in each branch of the alternation ? * use a variable on the tape, like this in nom >> begin { mark 'LHS'; ++ } NOTES Below is a remarkably simple way to implement 'lookahead' in syntagma. had the idea of lookahead grouping in rules, this is implemented, but not ruleblocks yet. >> a = b +(c|d|e) ; so b will reduce to a, but only if b in followed by c,d or e The parse stack would be: sequence '=' rsequence lookahead ; This could compile as >> "b*c*","b*d*","b*e*" { replace "b*" "a*"; push; push; .reparse } but there is a problem with the replace if there is another b* pattern. but I could solve that with some fancy putting and getting. This system preserves the lookahead tokens (which is ofcourse necessary) phantom tokens within blocks are nice. This can be used to enforce what sort of things can go in that block, eg, lexrules, textrules Since the nom* engine or pep* is completely text based, there is only one data type, so [:digit:]+ matches a string of digits, but they remain text. Classes are very simple such as [a-z] or [abc] or [:alnum:] so you cant actually combine them like [a-gxyz]. that wont work. This is because nom classes are (too) simple. TOKENS textrules* can only be used inside a block like [:alpha:]+ { ... } because the compiler first has to read a block of text to match multi-character text. lexrules can occur anywhere in the lexing setion, or in blocks in the parsingsection. Actions can be multiple LHS* left-hand-side of the ebnf rule, before the '=' RHS* right-hand-side of the rule alt* an alternation of sequences, eg: a | b | c var* a variable like $counter $lines $server. can be user defined. attvar* a numeric variable like $1,$2 etc refering to an token attribute class* a simple class of characters [a-z] [abcd] [:space:] charquoted* is a single quoted character like: 'x' quoted* text between quotes: 'and' interp* makes special vars interpolate in text sequence* a sequence/list of tokens before the '=' in a rule rsequence* a sequence of parse tokens after the '=' lattribute* attribute of token on LHS of rule. token* one grammar token (alphabetic word) action* print,delete,quit etc can go in the lex block or rule block attrule* an assignment to a token attribute, eg @1 = $1.$2; ruleblock* code within the {...} after a rule* rule* one grammar parse rule like 'command = name semicolon ;' ruleset* a list of grammar rules lexrule* lexing rule, eg: word: [:alpha:]+ ; lexruleset* a set of lexing rules textrule* lexing rule involving text like 'and' textruleset* a set of textrules (and lexrules) - equivalent to ruleblock* for the lexing blocks. notset* used in lookaheads for negativity. andset* using AND logic with classes/charquoted/quoted example: [a-z] and begins 'x' orset* an OR set of classes,quotes etc eg, 'a'|'word'|[a-z]|'b' charset* an OR set of chars eg: 'a'|'b'|'\n' :=* for attribute assignment. literal tokens: to for lexing up to and including end delimiter between for lexing before an end delimiter begins ends for text ends-with and for AND logic ... many others {} grouping () for grouping +(...) lookahead groups. | alternation or OR logic + for classes and lexing = for grammar reduction : for tokenisation (lexing) assignment ; for statement end PROPOSED PHRASES # create a time variable var $time; var $server = 'ssh://etc'; PHRASES lexing rules are indicated by the ':' assignment, and grammar rules by the '=' assignment. The following phrases appear to be compiling ------ # single line comments allowed # multiline comments between these # print 'hello'; exit 3; quit; delete; # at eof, delete the pattern space, print text and exit with code '4' EOF { delete; print 'yes'; exit 4; } # delete all instances of 'green' in the pattern space text. delete 'green'; # delete one char from the left of the pattern space ltrim; print 'line: $line, char: $char'; # interpolate line number with $line etc # use the accumulator counter print 'counter is $count'; # double quotes are allowed. print "hello"; # print text with a newline at the end println "hi"; # interpolate special variables in the print string. println "the line count is $line"; # make token 'capword' if the text begins with A-Z [:alpha:]+ { capword: begins [A-Z]; } lit: 'x'; lit: [0-4]|'a'|'b'|';' ; literal: [(){}] ; # braces as literal tokens begins '<' and ends '>' { print "tag"; tag: *; } # make token 'word' for all alpha numeric sequences word: [:alnum:]+ ; name: * ; # default lex rule match empty { exit; } match 'abcd' { print 'hi'; exit; } match not empty { print 'Extra char on line $line'; exit 2; } [A-Z]+ { # match a,aa,aaa,aaaa etc, same as [a] alist: only 'a'; # match a,ba,ab,aa,bb, etc, same as [ab] ablist: only 'ab'; list: [ab]; the same } [:digit:]+ { # 'not only doesnt word because it compiles to ![0] 0number: begins '0' and not only [0]; } # for all alphabetic sequences, if the text begins with '<' # and ends with '>' then, if the text begins with '<' make a # "link" parse token, and if not, make a "tag" parse token [:alpha]+ { match begins '<' and ends '>' { link: begins ' a = b < x y | p q > c; # lookahead syntax with +(...) a = ex '*' ex +('/'|'*') ; # look ahead with negative rules but tokens must be quoted which # is silly unless we are dealing with literal tokens a b = c d +(not "f" and not "j"); a = c d +(not ';' and not '.'); o = colour shape +(';' | block) ; option = name digit ';' ; # use literal char token in parse rule object = colour ':' shape; # lit token, but must define earlier () = space word; # just delete tokens in the parse section # check if the stack contains only a list token at end of file eof { stack (list) { print "list found\n"; }} # check if the parse stack is list or number or float. eof { parse (list|number|float) { print "list found\n"; }} EOF: words = words word; # only reduces at end of stream EOF { name = first second; } ,,,, HISTORY 25 may 2026 working on lookahead rules with a rule code block. nearly complete accept for lookahead attribute copy. Just need to get 'push;' list from the LHS token, but may need a variable. Or do a fancy "clop;" etc using .reparse continually until only "push;" left???? 24 may 2026 need to turn "charquoted*" into "token*" on the RHS of parse rules. Then 'not token*' becomes 'nottoken' optionals seem easier. the hardest is lookahead with rule blocks. Optionals can have a block, put it can only have actions and lexrules not attrules because we dont know how long the sequence is. 23 may 2026 made begin blocks and vardefs etc made alternation groups, working for stack(altgroup) but need to parse with " lhs = rsequence (altgroup| etc ". This is so I can build the nom compiled code. I can store the compiled code in the altgoup token. made a println printline function. made print and println work with interpolated text (itext*) token. reformed the comparison syntax to allow no comparison == operator. made some debug rules in the error section. todo: - begin blocks - variable declarations with "var $name;" or "var $name := 'text'; " - var decs should go in the begin block. 22 may 2026 made a string interpolation token interp* need to make begin blocks. redo $1=='green' parsing to make it more flexible. need to do var declarations in the begin block. * made a check attribute value and variable syntax like this ------- "green" == $2 { ...} [:space:] == $2 { ...} not begins "the" == $story { ...} ,,, 21 may 2026 lookahead rules with no ruleblock seem to be compiling well. Need to add ruleblock, also alternations. Also, need to add +(not token) syntax, and +(not ';' and not x) which is a negative lookahead syntax. 19 may 2026 made @1,@2 etc. work had the idea of lookahead grouping in rules eg >> a = b (c|d|e); so b will reduce to a, but only if b in followed by c,d or e The parse stack would be: sequence '=' rsequence lookahead ; This would compile as >> "b*c*","b*d*","b*e*" { replace "b*" "a*"; push; push; .reparse } but there is a problem with the replace if there is another b* pattern. 18 may 2026 wrote the example script /eg/s.url.pss which shows lots of nice syntagma syntax. implemented unequal length alternation lists with no following code block, such as >> a = b c d | e f | g | h | i j ; The compilation technique is nothing short of amazing even to me, who wrote it. The parse-reduction is actuall 'tail-wise' so that the branches of the alternation start reducing when the ';' literal token is seen. Each branch has a list of 'pop's saved in the preceding '|' token attribute, which also indicates how many tokens are in that branch. For example, with "...| h | i j ..." the "ij" branch has "pop;pop;" saved in the previous '|' literal token and the "h" branch has "pop;" saved in its previous | token. So, the nom code, compares the 2 pop lists in each '|' token and if they are different (meaning the token sequence lists are of different lengths) then it immediately compiles the nom code for "i j" and saves it in the ';' token attribute (actually it adds it to that attribute). So the following >> pop;pop; "i*j*"{ ...LHS...} is added to the ';' attribute and parsing continues. But in order to get the code for the LHS* token it actually has to use a "tape variable", which is just a named tape array cell at the top of the tape. This is because of the following parse sequence >> LHS '=' rsequence | rsequence | ... | resequence | rsequence ';' Because of the tail-reduction, nom has no idea where LHS is on the stack, and we have to do tail-reductions because of code blocks 17 may 2026 making attrules for assigning token attributes with @1 := "$2 .. $3"; lots of progress. alternations with code blocks working. rewrote rule parsing, which is now much better and allows alternation. I think the nom//until command should really have a class argument as well as text, eg: until [abc]; 15 may 2026 added double quotes eg token: "a"|"b"; still cant do alternation in parse rules like: >> colour = r g b | c m k ; but the alternation notes section for a way to do it. tidied up parsing of 'a' to 'b' etc. made 'match' sometimes optional (need to complete). made ignore rule. made better grammar* final token parsing. still need a way to match parse stack at eof? or try: ------ eof { token = token { print 'yes'; }} ,,,, 14 may added 'only': only 'a' means [a]+ add AND logic, eg: 123number: [:digit:] AND begins [123] which lexes the token '123number' if the text consists only of digits and begins with 1,2 or 3. added "begins" and "ends" and "not begins" and "begins not" and so on. 13 may 2026 I think this is almost good enough to write a sed syntax checker as an example of what it can do. also need to do, actual composition rules like >> obj = colour shape { $0 := $1.'\n'.$2 ; } lots of new syntax, eg: match, star '*' match empty {} {} = space word ; delete token sequences word: * ; # default lexing rules, matches everything even empty 12 may 2026 started to adapt this from the toybnf.pss script. Alot of progress, all sorts of lexing syntax is now working - see the phrases section above. lots and lots of progress - literal tokens, actions like print,exit,delete etc *# begin { # I need this variable for variable length sequence alternation # such as: a = b c | d ; mark "LHS"; clear; ++; mark "LHSpush"; clear; ++; } read; put; # line-relative char numbers, but this is overridden by # the [:space:] hoover. [\n] { nochars; } # multiline comments follow the format of nom. (* ... *) look nicer # but I may want to do something with () later "#" { (eof) { clear; .reparse } read; !B"#*" { "#\n" { clear; .reparse } whilenot [\n]; } B"#*" { clear; add "starting at line "; lines; put; clear; until "*#"; !E"*#" { clear; add "unterminated multiline comment '#* ... *#'\n"; get; print; quit; } } put; clear; add "comment*"; push; .reparse } # ignore white-space [:space:] { while [:space:]; clear; } # literal tokens, () for lookahead token set grouping # many of these literal tokens contain "pop;" list which is # put there by the rsequence token rules and the notset token # so I will clear the attribute [@0+:{}|()] { add "*"; push; --; put; ++; .reparse } # these are used for optionals. like () and +() and | they # can also contain a pop; list which indicates the length of the # rsequence which follows. '<','>' { add "*"; push; --; put; ++; .reparse } # lex '=' '==' etc '=' { while [=]; add "*"; push; --; put; ++; .reparse } # I store unequal alternation sequence compiled code here ';' { clear; add " "; put; clear; add ";*"; push; } # alternation corresponds directly to noms ',' operator # I store pop; lists here, so I need to add the "," nom OR # operator by hand '|' { clear; put; clear; add "|*"; push; } # the star means everything or anything, not sure about this? '*' { clear; add "!''"; put; clear; add "star*"; push; } # variables # examples: $1 $2 or $name # "$" { clear; while [:alnum:]; put; [:digit:] { nop; clip; !"" { clear; add "Attribute values ($1,$2,$3...) maximum $9\n"; print; quit; } get; # mushroom replacement technique replace "9" "++;8;--"; replace "8" "++;7;--"; replace "7" "++;6;--"; replace "6" "++;5;--"; replace "5" "++;4;--"; replace "4" "++;3;--"; replace "3" "++;2;--"; replace "2" "++;1; --"; replace "1" " get"; add ";"; # remove extra space from lone get. " get;" { clop; } put; clear; add "attvar*"; push; .reparse } # special variables. Maybe should have a different syntax "line","char","counter","text" { # integer accumulator "counter" { clear; add "count;"; } # automatic number of lines read from input "line" { clear; add "lines;"; } # automatic number of characters read from input "char" { clear; add "chars;"; } # text of current tape cell "text" { clear; add "get;"; } put; clear; add "var*"; push; .reparse } # I can make the fetch code here, or make it when the var* token # is actually used. Same applies above. I am relying on # replace '; get;' '; put;'; # for assignment?? clear; add 'mark "here"; go "'; get; add '"; get; go "here";'; put; clear; add "var*"; push; .reparse } # digits for token attribute assignment 1-9, [1-9] { put; clear; add "digit*"; push; .reparse } # [:digit:] { while [:digit:]; put; clear; add "number*"; push; .reparse } [:alpha:] { # add the default nom parse token delimiter '*' while [:alpha:]; put; # these are keywords, but I dont like the capital errors # case insensitive. This means that tokens cant use these # words?? lower; "begin","parse","stack","only","and","begins","ends", "var","match","txt","empty","not","to", "check","ignore","next","between","twixt","lit","literal","eof", "print","println","trim","ltrim","rtrim","delete","del","exit","quit" { # put the nom command in the attribute # fix: divide into 'commands' and others. # but should function work on variables, like trim($1) etc???? "ltrim" { clear; add "clop"; put; clear; add "ltrim"; } "rtrim" { clear; add "clip"; put; clear; add "rtrim"; } "trim" { clear; add "clip; clop"; put; clear; add "trim"; } "exit","quit" { clear; add "quit"; put; clear; add "exit"; } "del","delete" { clear; add "clear"; put; clear; add "delete"; } "var" { clear; add "declare"; } "parse" { clear; add "stack"; } "and" { clear; add "."; put; clear; add "and"; } "begins" { clear; add "B"; put; clear; add "begins"; } "ends" { clear; add "E"; put; clear; add "ends"; } "not" { clear; add "!"; put; clear; add "not"; } "empty" { clear; add "''"; put; clear; add "empty"; } "eof" { clear; add "(eof)"; put; clip; clop; } "twixt" { clear; add "between"; } "literal" { clear; add "lit"; } add "*"; push; .reparse } # case sensitive clear; get; # normal token add "*"; put; clear; add "token*"; push; } "'" { until "'"; put; "''" { clear; add "empty single quote\n"; print; quit; } !E"'" { clear; add "unfinished single quote\n"; print; quit; } clip; clop; clip; # either 'x' or '\n' etc "","\\" { clear; add "charquoted*"; push; .reparse } clear; add "quoted*"; push; .reparse } # try to allow double quotes '"' { until '"'; '""' { clear; add "empty double quote\n"; print; quit; } !E'"' { clear; add "unfinished double quote\n"; print; quit; } # convert to single quotes clip; clop; unescape '"'; escape "'"; put; clear; add "'"; get; add "'"; put; clip; clop; clip; # either 'x' or '\n' etc "","\\" { clear; add "charquoted*"; push; .reparse } clear; add "quoted*"; push; .reparse } "[" { until "]"; put; "[]" { clear; add "empty class\n"; print; quit; } !E"]" { clear; add "unfinished class\n"; print; quit; } clear; add "class*"; push; .reparse } !"" { put; clear; add "! [syntagma]\n"; add " bad character '"; get; add "'"; add " at line:"; lines; add " char:"; chars; add "\n"; add " I just can't go on... sorry, goodbye"; print; quit; } parse> # show the parse-stack reductions. a doubled hash makes it easier # to remove from the output with sed '/^##/d' add "## line:"; lines; add " char:"; chars; add " "; print; clear; unstack; print; stack; (eof) { add " EOF"; } # show last attribute value if required for debugging. add " ("; --; get; ++; add ")"; replace "\n" "\n## "; add "\n"; print; clear; # --------------- # ERROR parsing. search for 'one token' etc to find these # ------------------- # errors: one token pop; # ------------------- # errors: two tokens pop; !B"var*".!B"attvar*".!B"lattribute*".!B"token*".!B"lit*" { E"*:*" { clear; add " # ---------------------------------------- # Syntagma : # ':' is the lexing assignment operator. It should be # preceded by a token name or the keyword 'lit'. # ':' is also used in the attribute assignment token := # Please dont use 'lit' any other keyword as a token name # because the syntagma script will not compile, sorry. # example: lit: '.'|','; # correct # example: name: '.'|','; # correct # example: @1 := '$1 / $2';# correct # wrong: var: [a-z]+; # var is a keyword # # keywords are: # begin,parse,stack,only,and,begins,ends,var,match,txt,empty,not,to, # check,ignore,next,between,twixt,lit,literal,eof, # print,println,trim,ltrim,rtrim,delete,del,exit,quit # ---------------------------------------- "; replace "\n " "\n"; add "\n"; add " ?* - "; get; add "\n"; add " :* - "; ++; get; --; add "\n"; print; quit; } } "begins*class*" { clear; add "# The 'begins' keyword cannot be combined with text classes \n"; add "# on line "; lines; add "\n"; print; quit; } "|*)*" { clear; add " # ---------------------------------------- # Syntagma : # an alternation | followed by a group bracket ) is # probably an error. What do you thing? # ---------------------------------------- "; replace "\n " "\n"; add "\n"; add " |* - "; get; add "\n"; add " )* - "; ++; get; --; add "\n"; print; quit; } # ------------------- # errors: three tokens pop; B"+(*lookgroup*",B"(*altgroup*" { !"+(*lookgroup*".!"(*altgroup*" { !E")*".!E"|*" { replace "*" " "; ++; ++; ++; put; --; --; --; clear; add " # ---------------------------------------- # Syntagma: # brackets appear to be mismatched: +( and ( should be # terminated with ) # ---------------------------------------- "; replace "\n " "\n"; add "\n"; add " (* or +(* - "; get; add "\n"; add " group* - "; ++; get; --; add "\n"; add " ?* - "; ++; ++; get; --; --; add "\n"; add "parse stack - "; ++; ++; ++; get; --; --; --; add "\n"; print; quit; }} } B"<*altgroup*".!"<*altgroup*" { !E">*".!E"|*" { replace "*" " "; ++; ++; ++; put; --; --; --; clear; add " # ---------------------------------------- # Syntagma: # brackets appear to be mismatched: [ should be # terminated with ] # ---------------------------------------- "; replace "\n " "\n"; add "\n"; add " <* - "; get; add "\n"; add " altgroup* - "; ++; get; --; add "\n"; add " ?* - "; ++; ++; get; --; --; add "\n"; add "parse stack - "; ++; ++; ++; get; --; --; --; add "\n"; print; quit; } } # ------------------- # errors: four tokens pop; # ------------------- # errors: five tokens pop; # ------------------- # errors: six tokens pop; # ------------------- # errors: seven tokens pop; # ------------------- # errors: 8 tokens or less pop; # ------------------- # errors: 9 tokens or less pop; # ------------------- # errors: 10 tokens or less pop; # ------------------- # errors: 11 tokens or less pop; # no lexing (eof) { # incomplete programs "token*","sequence*","notset*","nottoken*","notsequence*","var*","attvar*", "attrule*","tomatch*","betweenmatch*","pattern*","orset*", "charset*","andset*","class*","quoted*","charquoted*","number*" { swap; add " is a '"; get; add "'"; add "\n\n"; print; quit; } "LHS*=*rsequence*+(*lookgroup*)*{*ruleblock*" { clear; add " # ---------------------------------------- # Syntagma: # lookahead tokens with code # ---------------------------------------- "; replace "\n " "\n"; add "\n"; add " LHS* - "; get; add "\n"; add " =* - "; ++; get; add "\n"; add " rsequence* - "; ++; get; add "\n"; add " +(* - "; ++; get; add "\n"; add " lookgroup* - "; ++; get; add "\n"; add " )* - "; ++; get; add "\n"; add " {* - "; ++; get; add "\n"; add " ruleblock* - "; ++; get; add "\n"; print; quit; } "LHS*=*rsequence*<*altgroup*>*rsequence*", "LHS*=*rsequence*<*rsequence*>*rsequence*", "LHS*=*rsequence*<*notset*>*rsequence*" { clear; add " # ---------------------------------------- # Syntagma: # optionals between <...> # ---------------------------------------- "; replace "\n " "\n"; add "\n"; add " LHS* - "; get; add "\n"; add " =* - "; ++; get; add "\n"; add " rsequence* - "; ++; get; add "\n"; add " <* - "; ++; get; add "\n"; add "rseq/group* - "; ++; get; add "\n"; add " >* - "; ++; get; add "\n"; add " rsequence* - "; ++; get; add "\n"; print; quit; } # print each token and attribute value. "textmatch*{*" { clear; add "textmatch* - "; get; add "\n"; add " {* - "; ++; get; --; add "\n"; print; quit; } # left hand side of parse rule. "LHS*=*" { clear; add " LHS* - "; get; add "\n"; add " =* - "; ++; get; add "\n"; print; quit; } # print each token and attribute value. +( should have a list of pops # this should help debugging lookahead syntax "(*altgroup*)*","(*altgroup*|*","<*altgroup*>*","<*altgroup*|*" { clear; add " # ---------------------------------------- # Syntagma: # alternation groups are used for optionals, lookaheads, and # alternation within a rule. Usually each 'branch' of the alternation # needs to have the same number of tokens or literal characters. # ---------------------------------------- "; replace "\n " "\n"; add "\n"; add " <* or (* - "; get; add "\n"; add " altgroup* - "; ++; get; --; add "\n"; add ">* |* or (* - "; ++; ++; get; --; --; add "\n"; print; quit; } # print each token and attribute value. +( should have a list of pops # this should help debugging lookahead syntax "+(*lookgroup*)*","+(*lookgroup*|*" { clear; add " +(* - "; get; add "\n"; add "lookgroup* - "; ++; get; --; add "\n"; add " |* or )* - "; ++; ++; get; --; --; add "\n"; print; quit; } # alternation groups and optionals "LHS*=*rsequence*(*altgroup*)*","LHS*=*rsequence*<*altgroup*>*", "LHS*=*rsequence*(*rsequence*)*","LHS*=*rsequence*<*rsequence*>*" { clear; add " # ---------------------------------------- # Syntagma: # nearly a parse rule. [..] is used for optionals and # (..) is used for grouping alternatives # ---------------------------------------- "; replace "\n " "\n"; add "\n"; add " LHS* - "; get; add "\n"; add " =* - "; ++; get; add "\n"; add "rsequence* - "; ++; get; add "\n"; add " <* or (* - "; ++; get; add "\n"; add " altgroup* - "; ++; get; add "\n"; add " >* or )* - "; ++; get; add "\n"; print; quit; } # debug lookahead groups "LHS*=*rsequence*+(*lookgroup*)*;*" { clear; add " # ---------------------------------------- # Syntagma: # ---------------------------------------- "; replace "\n " "\n"; add "\n"; add " LHS* - "; get; add "\n"; add " =* - "; ++; get; add "\n"; add "rsequence* - "; ++; get; add "\n"; add " +(* - "; ++; get; add "\n"; add "lookgroup* - "; ++; get; add "\n"; add " )* - "; ++; get; add "\n"; add " ;* - "; ++; get; add "\n"; print; quit; } "textmatch*==*var*" { clear; add " # ---------------------------------------- # Syntagma syntax: # the program is incomplete. # ---------------------------------------- "; replace "\n " "\n"; # print each token and attribute value. add "textmatch* - "; get; add "\n"; add " ==* - "; ++; get; --; add "\n"; add " var* - "; ++; ++; get; --; --; add "\n"; print; quit; } "LHS*=*rsequence*" { clear; add " # ---------------------------------------- # Syntagma: # you wrote a partial program. add a ';' to complete the rule # and a lexing rule as well. # ---------------------------------------- "; replace "\n " "\n"; add " LHS* - "; get; add "\n"; add " =* - "; ++; get; --; add "\n"; add "rsequence* - "; ++; ++; get; --; --; add "\n"; print; quit; } "{*action*}*" { clear; add " # ---------------------------------------- # Syntagma syntax: # the program is incomplete. # ---------------------------------------- "; replace "\n " "\n"; # print each token and attribute value. add " {* - "; get; add "\n"; add " action* - "; ++; get; --; add "\n"; add " }* - "; ++; ++; get; --; --; add "\n"; print; quit; } "(*notset*)*" { clear; add " # ---------------------------------------- # Syntagma syntax: # a 'notset' is for negative lookaheads and groups # example: (not (a b) and not (b c)) # ---------------------------------------- "; replace "\n " "\n"; # print each token and attribute value. add " (* - "; get; add "\n"; add " notset* - "; ++; get; --; add "\n"; add " )* - "; ++; ++; get; --; --; add "\n"; print; quit; } "rule*","ruleset*" { clear; add "\n"; add "# ----------------------------------------\n"; add "# Syntagma friendly advice: \n"; add "# You need at least 1 lexing rule with your parsing rules\n"; add "# Example (a well-known esoteric language):\n"; add "# lit: '['|']'; # lex literal tokens []\n"; add "# inst: [-+><.,]; # lex instructions -+><.,\n"; add "# block = '[' inst ']' | '[' prog ']' | '[' ']'; # a parse rule \n"; add "# prog = inst inst | inst block | prog inst | prog block; \n"; add "# eof { ()=prog { print 'valid BF program \\n'; exit;}} \n"; add "# ----------------------------------------\n\n"; get; add "\n\n"; print; quit; } "textrule*","textruleset*" { clear; add " # ---------------------------------------- # Syntagma syntax therapy: # text-rules are for using within blocks either in the lexing phase # or the parsing phase, here is an example: # [:alpha:]+ { shape: 'circle'|'square'; word:*; } # ---------------------------------------- "; replace "\n " "\n"; get; add "\n\n"; print; quit; } # interpolated text "print*itext*" { clear; add " # ---------------------------------------- # Interpolated text: # example: # print '$1:$2'; # ---------------------------------------- "; replace "\n " "\n"; ++; get; --; add "\n\n"; print; quit; } } push;push;push;push;push;push;push;push;push;push;push; # end of error parsing #----------------------- # 1 token parsing pop; # currently ignoring comments but it would be nice to transfer # to compiled nom code. "comment*" { clear; .reparse } #----------------------- # 2 token parsing pop; "not*token*" { clear; add '!"'; ++; get; --; add '"'; put; clear; add "nottoken*"; push; .reparse } "(*nottoken*","+(*nottoken*","(*notsequence*","+(*notsequence*" { clear; add "(*notset*"; push; push; .reparse } # a phantom beginblock "begin*{*" { add "beginblock*"; push; push; push; .reparse } # a phantom beginblock "beginblock*action*","beginblock*lexrule*","beginblock*vardef*" { clear; get; !"" { add "\n"; } ++; get; --; put; clear; add "beginblock*"; push; .reparse } # integrate the begin block into the script. "start*lexrule*" { clear; get; ++; get; --; put; clear; add "lexruleset*"; push; .reparse } # some simple literal token combinations # +( will be the lookahead group token. This will also store the # list of pop;pop;... just like = and | and ( - if I do alternation groups # which I will. "+*(*" { clear; put; add "+(*"; push; .reparse } # this is the token attribute assignment operator ":*=*" { clear; add ":=*"; push; .reparse } # simplifying parse rules with context token unification # example: begins 'x' and ends 'y' { # example: 'a'|'b'|[1-9] { "andset*{*","charset*{*","orset*{*","quoted*{*","star*{*", "empty*{*","charquoted*{*","class*{*" { clear; add "textmatch*{*"; push; push; .reparse } # simplifying parse rules with context token unification # example: begins 'x' and ends 'y' == $1 {...} # example: "green" == $colour {...} # compile: clear; mark "here"; go "colour"; get; go "here"; "green" {...} # example: begins "green" == $3 {...} # compile: clear; ++;++; get; --;--; B"green" {...} # comparisons of variables with textmatches like classes, strings, etc # or just comparison with the pattern-space B"andset*",B"orset*",B"quoted*",B"star*",B"empty*",B"charquoted*",B"class*" { E"==*",E"var*",E"attvar*" { E"==*" { clear; add "textmatch*==*"; } E"var*" { clear; add "textmatch*var*"; } E"attvar*" { clear; add "textmatch*attvar*"; } push; push; .reparse } } # example: $count '1' (or) == empty (or) $1 [0-9] B"==*",B"var*",B"attvar*" { E"andset*",E"orset*",E"quoted*",E"star*",E"empty*",E"charquoted*",E"class*" { push; clear; add "textmatch*"; push; .reparse } } # reverse the order of comparisons "var*textmatch*","attvar*textmatch*" { clear; get; ++; swap; --; put; clear; add "textmatch*var*"; push; push; .reparse } # reinsert the comparison operator == and copy attribute # the comparison operator only serves to make the parse stack more # comprehensible. "textmatch*var*" { clear; ++; get; ++; put; --; clear; put; --; clear; add "textmatch*==*var*"; push; push; push; .reparse } # use a lexblock phantom token here? no because that allows # empty lex rule blocks which seems silly. # context-induced parse-token simplification "class*and*","quoted*and*","charquoted*and*" { clear; add "andset*and*"; push; push; .reparse } # this is a nice way to ensure that only the right sort of # tokens can go into a block {...} that follows a parse-reduction rule # I am not sure if I should allow lex rules here but I will for now. # example: a = b c { exit; } "ruleblock*action*","ruleblock*textrule*","ruleblock*attrule*", "ruleblock*lexrule*" { # join token with newline unless the 1st is a phantom ruleblock clear; get; !"" { add "\n"; } ++; get; --; put; clear; add "ruleblock*"; push; .reparse } # interpolate variables into text. "print*quoted*" { add "interp*"; push; push; push; .reparse } "print*charquoted*" { clear; add "print*quoted*"; push; push; .reparse } "println*quoted*" { add "interp*"; push; push; push; .reparse } "println*charquoted*" { clear; add "print*quoted*"; push; push; .reparse } # This is way to interpolate variables into a string, "quoted*interp*" { clear; get; # need to normalise quotes for interpolation B"'".E"'" { clip;clop; } B'"'.E'"' { clip;clop; } put; clear; add 'add "'; get; add '"'; put; # special line and char and counter 'variables' # the number of lines read from the input stream replace "$line" '"; lines; add "'; # the number of chars read from the input stream replace "$char" '"; chars; add "'; # access the pep machine accumulator replace "$counter" '"; count; add "'; # text is the text in the current tape cell replace "$text" '"; get; add "'; # get the parse-stack? # replace "$stack" "'; ++;++;++;put;--;--;--; d;stack;swap;get; add '"; # can I replace any variable here? # the $n variables which are token attribute values replace "$1" '"; get; add "'; replace "$2" '"; ++; get; --; add "'; replace "$3" '"; ++;++; get; --;--; add "'; replace "$4" '"; ++;++;++; get; --;--;--; add "'; replace "$5" '"; ++;++;++;++; get; --;--;--;--; add "'; replace "$6" '"; ++;++;++;++;++; get; --;--;--;--;--; add "'; # an optimisation!! remove empty add commands # replace "add '';" ""; replace 'add "";' ''; # remove extra space from lone get. " get;" { clop; } add ";"; put; clear; add "itext*"; push; .reparse } # LHS token attribute assignment # example: @3 # compile: ++;++; put; --;--; # example: @4 # compile: ++;++;++; put; --;--;--; "@*digit*" { clear; add "@"; ++; get; --; replace "@9" "++;@8;--"; replace "@8" "++;@7;--"; replace "@7" "++;@6;--"; replace "@6" "++;@5;--"; replace "@6" "++;@5;--"; replace "@5" "++;@4;--"; replace "@4" "++;@3;--"; replace "@3" "++;@2;--"; replace "@2" "++; @1; --"; replace "@1" "put"; add ";"; put; clear; add "lattribute*"; push; .reparse } # variable length alternation sequences will compile code into # this '{' token attribute, so I want to make sure that it is # empty. no, fix: "rsequence*{*" { clear; ++; put; --; add "rsequence*{*"; # dont reparse because you get an infinite loop } # set up rsequence parsing, also in alternation-groups # example: a = b c ( e f | g h ) i j ; # alternation group # example: a = b c +( e f | g h ); # lookahead group # example: a = b < e | h > x y; # rsequence in and after optional "=*token*","|*token*","(*token*",")*token*", "+(*token*","<*token*",">*token*" { push; # store pop list in = attribute clear; --; add "pop;"; put; ++; clear; add '"'; get; add '"'; # dont double-wrap not-tokens in quotes B'"!"'.E'""' { clip; clop; } put; # reverse not ends with. fix: B'E!' { clop; clop; put; clear; add "!E"; get; } put; clear; add "rsequence*"; push; .reparse } # just put a pop; list in | this is used by lookgroups etc "|*charquoted*" { clear; add "pop;"; put; clear; add "|*charquoted*"; } # set up rsequence parsing with literals, also in alternations # see the 3 token rule for '|*charquoted*' etc # also for lookahead groups # example: '=' rsequence = '=' name ; # example: '(' rsequence = '(' name ; "=*charquoted*","(*charquoted*","+(*charquoted*","<*charquoted*" { push; # convert 'x' to x* for literal tokens clear; get; # fix: also handle negated literal characters? these are # useful in lookaheads and other circumstances. # but I think I need a separate token. notcharquoted* # example: !";" -> !";*" ???? clip; clop; add "*"; # fix: # B"'",B"!'" { add "'"; } put; # store pop list in = or ( or +( or [ attribute clear; --; add "pop;"; put; ++; clear; add '"'; get; add '"'; put; clear; add "rsequence*"; push; .reparse } # get the next character into the pattern space or nothing if EOF # example: next; # compile: !(eof) { read; } "next*;*" { clear; add "!(eof) { read; }"; put; clear; add "lexrule*"; push; .reparse } "between*to*","between*not*","between*ends*","between*begins*" { replace "between*" ""; clip; put; clear; add "cant mix 'between' and '"; get; add "' key words (at line "; lines; add ")'\n"; print; quit; } "to*between*","to*not*","to*ends*","to*begins*" { replace "to*" ""; clip; put; clear; add "cant mix 'to' and '"; get; add "' key words (at line "; lines; add ")'\n"; print; quit; } # reduce number of tokens # example: [a-z] to -> parse: pattern*to* "class*to*","charquoted*to*","quoted*to*" { clear; add "pattern*to*"; push; push; .reparse } # example: to '/end' -> parse: to*pattern* # no classes here, because nom://until cant do it. "to*charquoted*","to*quoted*" { clear; add "to*pattern*"; push; push; .reparse } # example: [a-z] between -> parse: pattern*between* "class*between*","charquoted*between*","quoted*between*" { clear; add "pattern*between*"; push; push; .reparse } # example: between [:space:] -> parse: between*pattern* "between*charquoted*","between*quoted*" { clear; add "between*pattern*"; push; push; .reparse } # the 'match' keyword is actually optional, it is just supposed # to emphasis that some text is being matched. "match*class*","match*quoted*","match*charquoted*", "match*eof*","match*empty*","match*orset*","match*andset*", "match*tomatch*","match*betweenmatch*" { clop;clop;clop;clop;clop;clop; push; get; --; put; ++; clear; .reparse } # syntactic sugar, # example: only 'abc' or 'a' # compile: [abc] or [a] "only*quoted*","only*charquoted*" { clear; ++; get; --; B"B","E","!" { clear; add "cant combine 'only' with 'begins/ends/not'\n"; print; quit; } clip; clop; put; clear; add "["; get; add "]"; put; clear; add "class*"; push; .reparse } # allow negation of classes etc "not*class*","not*quoted*","not*charquoted*","not*empty*","not*eof*" { replace "not*" ""; ++; ++; put; --; --; clear; add "!"; ++; get; --; put; clear; ++; ++; get; --; --; push; .reparse } # allow negation of tokens, wrap in quotes "not*token*" { clear; add '!"'; ++; get; --; add '"'; put; clear; add "token*"; push; .reparse } # text begins with and text ends with. "begins*quoted*","begins*charquoted*","ends*quoted*","ends*charquoted*" { B"ends*" { replace "ends*" ""; } B"begins*" { replace "begins*" ""; } push; --; get; ++; get; # 'begins-not' needs to be 'not-begins' in nom etc B"E!" { clop; clop; put; clear; add "!E"; get; } B"B!" { clop; clop; put; clear; add "!B"; get; } # print; quit; --; put; ++; clear; .reparse } "comment*comment*" { clear; get; add "\n"; ++; get; --; put; clear; add "comment*"; push; .reparse } # how to include comments #* "comment*lexrule*","lexrule*comment*","lexruleset*comment*" { clear; get; add "\n"; ++; get; --; put; clear; add "lexruleset*"; push; .reparse } *# "token*token*","sequence*token*" { # count tokens to calculate "push;" later a+; clear; get; ++; get; --; put; clear; add "sequence*"; push; .reparse } # allow literal chars in sequences if they have already been # declared with lit: [abc]; (or) lit: ';'|':'; # eg: option = word number ';' ; "token*charquoted*","sequence*charquoted*" { # count tokens to calculate "push;" later a+; # convert 'x' to x* for literal tokens clear; ++; get; clip; clop; add "*"; put; --; clear; get; ++; get; --; put; clear; add "sequence*"; push; .reparse } # allow literal chars to begin sequences if they have already been # declared with lit: [abc]; (or) lit: ';'|':'; # eg: option = '(' obj ')' ; # charquoted.sequence should not occur. "charquoted*token*","charquoted*charquoted*","charquoted*sequence*" { # count tokens to calculate "push;" later a+; # convert 'x' to x* for literal tokens clear; get; clip; clop; add "*"; put; clear; get; ++; get; --; put; clear; add "sequence*"; push; .reparse } # eg: opt = '(' ')' ; "charquoted*charquoted*" { a+; # convert 'x' to x* for literal tokens clear; get; clip; clop; add "*"; put; clear; ++; get; clip; clop; add "*"; put; --; clear; get; ++; get; --; put; clear; add "sequence*"; push; .reparse } # need to construct the LHS here. using 'stack' is much easier # but feels a bit lazy, and I quite like being reminded how many # tokens I am pushing. # example: a b c = # compile: "clear; add 'a*b*c*'; push;push;push; .reparse" # or: "clear; add 'a*b*c*'; stack; .reparse" "token*=*","sequence*=*" { # later have to transform this count number into # push; or push;push; etc clear; get; a+; count; put; clear; # reset the token counter for the RHS zero; clear; add 'clear; add "'; get; add '#;'; # 6 token limit for left-hand-side which is more than enough # look-ahead or context? replace "1#;" '"; push;'; replace "2#;" '"; push;push;'; replace "3#;" '"; push;push;push;'; replace "4#;" '"; push;push;push;push;'; replace "5#;" '"; push;push;push;push;push;'; replace "6#;" '"; push;push;push;push;push;push;'; add " .reparse"; put; # save into top of tape for variable length alternations mark "here"; go "LHS"; put; go "here"; clear; add "LHS*=*"; push; push; .reparse } #* no, old rule, remove "token*;*","sequence*;*" { clear; get; a+; count; put; clear; zero; add "RHS*"; push; .reparse } *# # just simplify parse rules, while maintaining the separation # between the lexing and parsing sections. "lexrule*rule*" { clear; add "lexruleset*rule*"; } "lexruleset*rule*" { clear; add "lexruleset*ruleset*"; } "lexruleset*ruleset*".(eof) { clear; add "# -------------------------------------\n"; add "# nom script created by www.nomlang.org/eg/syntagma.pss\n\n"; add "begin { nop; }\nread; put; \n"; get; # if the parser doesn't consume or delete character from the # input stream, then it is an error. stop the show. add "\n!'' { \n"; add " put; clear; add 'unlexed character \"'; get; add '\" ';\n"; add " add 'at line '; lines; add ' of input.\\n'; \n"; add " add 'All characters in the input should be lexed or ignored\\n'; \n"; add " print; clear; zero; a-; a-; quit; \n"; add "}"; add "\n\nparse>\n"; add "# show the parse-stack reductions.\n"; add 'add "## line:"; lines; add " char:"; chars; '; add 'add " "; print; clear; \n'; add 'unstack; print; stack; (eof) { add " EOF"; } \n'; add '# show last attribute if required.\n'; add '# add " ("; --; get; ++; add ")"; \n'; add '# replace "\\n" "\\n## ";\n'; add 'add "\\n"; print; clear;\n'; ++; get; --; put; clear; add "grammar*"; push; .reparse } # lists of textrules (eg: keyword:'to'|'is'|'a';) # if we mix lexrules with text then they become textrules "textrule*action*","textruleset*action*", "textrule*textrule*","textrule*lexrule*","textruleset*textrule*", "lexrule*textrule*","lexruleset*textrule*","textruleset*lexrule*" { # dont add a newline to a phantom block clear; get; !"" { add "\n"; } ++; get; --; put; clear; add "textruleset*"; push; .reparse } "lexrule*lexrule*","lexruleset*lexrule*","action*lexrule*", "lexruleset*lexrule*","lexrule*action*","lexruleset*action*" { clear; get; add "\n"; ++; get; --; put; clear; add "lexruleset*"; push; .reparse } "rule*rule*","ruleset*rule*","rule*action*", "action*rule*","ruleset*action*" { clear; get; add "\n"; ++; get; --; put; clear; add "ruleset*"; push; .reparse } "delete*;*","trim*;*","ltrim*;*","rtrim*;*" { clear; get; add "; put; "; put; clear; add "action*"; push; .reparse } "exit*;*" { clear; get; add ";"; put; clear; add "action*"; push; .reparse } "action*action*" { clear; get; add "\n"; ++; get; --; put; clear; add "action*"; push; .reparse } # do not allow actionblock to contain attribute rules. "actionblock*action*","actionblock*lexrule*" { clear; get; add "\n"; ++; get; --; put; clear; add "actionblock*"; push; .reparse } #----------------------- # 3 token parsing pop; "notset*and*notsequence*","notset*and*nottoken*" { clear; get; ++; ++; add "."; get; --; --; put; clear; add "notset*"; push; .reparse } # an actionblock* cannot contain attrules* because we dont know the # length of the sequence. "altgroup*>*{*",">*rsequence*{*" { push; push; push; clear; put; add "actionblock*"; push; .reparse } "declare*var*;*" { # the var already has fetch code in it... clear; ++; get; --; replace 'mark "here"; go' 'mark'; replace 'get; go "here";' ''; add '++;'; put; clear; add "vardef*"; push; .reparse } # reverse the order "var*==*textmatch*","attvar*==*textmatch*" { clear; get; ++;++; swap; --;--; put; clear; add "textmatch*==*var*"; push; push; push; .reparse } # A phantom textruleset to start the block "==*attvar*{*","==*var*{*","rsequence*)*{*" { push; push; push; put; add "textruleset*"; push; .reparse } # Let check for empty brackets (because of the phantom token above. # fix: put in the error section? "{*textruleset*}*","{*ruleblock*}*" { ++; swap; "" { add "Empty block braces {} found at line "; lines; add "\n"; print; quit; } swap; --; } # for deleting tokens and maybe checking, I was using 0 but I need that # for a number. # example: () = a b ; # compile: "a*b*" { clear; .reparse } "(*)*=*" { clear; add "clear; .reparse"; put; clear; add "LHS*=*"; push; push; .reparse } #* # a lookahead grouping, for a single token sequence. this is not as # useful as (a|b|c) for lookaheads. # example: a = b (c d); # compile: "b*c*" { replace "b*c*d*" "a*c*d*"; push; push; .reparse } # or: B"b*".E"c*" { replace "b*" "a*"; push; push; .reparse } now look at the more complicated example: a b = c d e (a b|c e); # must be equal length alternations compile: pop;pop;pop; B"c*d*e*".!"c*d*e*" { # add a start marker like '#' E"a*b*",E"c*e*" { # add start marker, somehow replace "#c*d*e*" "a*b*"; push;push;push;push; # !! now need to copy attributes from a b and c e to new # positions. this will be challenging. .reparse } } push;push;pus; *# # sequence alternations within (..) and [...] are called altgroups # example: ('.' x | ',' y ) "(*rsequence*|*","<*rsequence*|*" { replace "rsequence*" "altgroup*"; push; push; push; clear; --; --; get; put; ++; ++; clear; .reparse } "+(*rsequence*)*","+(*rsequence*|*" { replace "rsequence*" "lookgroup*"; push; push; push; clear; --; --; add "E"; get; # reverse not-ends-with, for not tokens for example B'E!' { clop; clop; put; clear; add "!E"; get; } put; ++; ++; clear; .reparse } # I need this avoid a class with lexing alternations (charset* token) # because a charset = charset | charset; "rsequence*|*charquoted*","altgroup*|*charquoted*","lookgroup*|*charquoted*" { push; push; # convert 'x' to "x*" for literal tokens # convert !'x' to !'x*' for negated literal tokens clear; get; clip; clop; add "*"; put; clear; add '"'; get; add '"'; put; # store pop list in = attribute clear; --; add "pop;"; put; ++; clear; add "rsequence*"; push; .reparse } # part of the new rule parsing code. An rsequence is a sequence # of tokens on the right hand side of the = # example: = a b c # compile: = rseq # example: + ( '.' b | c d ) # compile: +(*rsequence*|*rsequence*)* "=*rsequence*token*","|*rsequence*token*", "(*rsequence*token*","+(*rsequence*token*","<*rsequence*token*", ")*rsequence*token*",">*rsequence*token*" { # save the context token push; # store pop list in the '=' or '|' attribute. This will be used # for compilation later, but also to check rsequence lengths clear; --; get; add "pop;"; put; ++; # wrap sequence in quotes clear; get; clip; ++; get; add '"'; --; put; clear; add "rsequence*"; push; .reparse } # rule sequences with literals # example: = a 'c' # compile: = rseq # example: ( a b '#' "=*rsequence*charquoted*","|*rsequence*charquoted*", "(*rsequence*charquoted*","+(*rsequence*charquoted*", "<*rsequence*charquoted*" { # save the context token push; # convert 'x' to x* for literal tokens clear; ++; get; clip; clop; add "*"; put; --; # store pop list in the '=' or '|' attribute. This will be used # for compilation later, but also to check rsequence lengths clear; --; get; add "pop;"; put; ++; # wrap sequence in quotes clear; get; clip; ++; get; add '"'; --; put; clear; add "rsequence*"; push; .reparse } # the second item can be a string because we compile to 'until' # but second item cant be a class because of 'untils' limitations # this compiles a incomplete snippet that will be completed later # example: '[' to ']' # example: [:;] to '/end' # compile: '[' { until ']'; put; "pattern*to*pattern*" { clear; get; add ' { until '; ++; ++; get; --; --; add '; put; '; put; clear; add "tomatch*"; push; .reparse } # this compiles a incomplete snippet that will be completed later # example: '[' between [:space:] # compile: '[' { whilenot [:space:]; put; # example: [:;] until '/' # bug: this is allowing 'a' between 'ab' because everything is a # pattern. "pattern*between*pattern*" { clear; ++; ++; get; # convert from quoted to class B"'".E"'" { clip; clop; "]" { clear; add "\\]"; } put; clear; add "["; get; add "]"; put; } --; --; clear; get; add ' { whilenot '; ++; ++; get; --; --; add '; put; '; put; clear; add "betweenmatch*"; push; .reparse } # and logic, but this cannot be mixed with OR | logic # remember that the quoted* class* and charquoted* attributes can # actually contain 'not' logic and 'begins with' logic etc, so these # tokens are somewhat badly managed. # fix: I can remove all the class.and.quoted rules etc because this # is delt with by: # >> andset and = class and | quoted and | charquoted and ; "andset*and*quoted*","andset*and*class*","andset*and*charquoted*", "class*and*quoted*","class*and*class*","class*and*charquoted*", "quoted*and*quoted*","quoted*and*class*","quoted*and*charquoted*", "charquoted*and*quoted*","charquoted*and*class*", "charquoted*and*charquoted*" { clear; get; ++; get; ++; get; --; --; put; clear; add "andset*"; push; .reparse } "delete*quoted*;*","delete*charquoted*;*" { clear; add "replace "; ++; get; --; add " '';"; put; clear; add "action*"; push; .reparse } # print statements # example: print 'error at line: $line \n'; # compile: clear; add 'error at line:'; lines; add "\n"; print; clear; "print*itext*;*" { clear; add "clear; "; ++; get; --; add " print; clear;"; put; clear; add "action*"; push; .reparse } # the same but adds a newline "println*itext*;*" { clear; add "clear; "; ++; get; --; add " add '\\n'; print; clear;"; put; clear; add "action*"; push; .reparse } # example: print 'x'; "print*charquoted*;*" { clear; add "clear; add "; ++; get; --; add "; print; clear;"; put; clear; add "action*"; push; .reparse } # delete from the input stream all following matching chars "ignore*class*;*" { clear; add "# ignore-rule \n"; add "while "; ++; get; add "; "; get; --; add " { clear; }"; put; clear; add "lexrule*"; push; .reparse } # eg: EOF: name = capital lowerchars; # example: eof: print "hi"; "eof*:*rule*","eof*:*action*" { replace ":*" "{*"; add "}*"; push; push; push; push; .reparse } # simplify lex parsing, # textrules and lexrules can only occur in the lexing phase of # the syntagma script. "{*textrule*}*","{*lexrule*}*","{*lexruleset*}*" { clear; add "{*textruleset*}*"; push; push; push; .reparse } # indent code in braces "{*textruleset*}*","{*action*}*","{*ruleset*}*", "{*ruleblock*}*","{*beginblock*}*" { push; push; push; add "\n"; --; --; get; replace "\n" "\n "; put; ++; ++; clear; pop; pop; pop; } # orsets "quoted*|*quoted*","quoted*|*charquoted*","quoted*|*class*" { clear; get; add ","; ++; ++; get; --; --; put; clear; add "orset*"; } "charquoted*|*quoted*","class*|*quoted*" { clear; get; add ","; ++; ++; get; --; --; put; clear; add "orset*"; } "orset*|*quoted*","orset*|*charquoted*","orset*|*class*" { clear; get; add ","; ++; ++; get; --; --; put; clear; add "orset*"; } # but these should be able to be expressed by classes like [ab\n] # charsets eg: 'a'|'b'|'\n' # eg: [a-z]|'x'|'y' "charquoted*|*charquoted*","charquoted*|*class*","charset*|*charquoted*", "class*|*charquoted*","charset*|*class*","class*|*class*" { clear; get; add ","; ++; ++; get; --; --; put; clear; add "charset*"; } # eg: exit 4; # compile: zero; a+; a+; a+; a+; quit; "exit*digit*;*" { clear; add "zero; "; ++; get; --; add "#"; # a rather silly trick, todo, negative numbers replace "5#" "4# a+;"; replace "4#" "3# a+;"; replace "3#" "2# a+;"; replace "2#" "1# a+;"; replace "1#" "0# a+;"; replace "0#" ""; add " quit;"; put; clear; add "action*"; push; .reparse } #----------------------- # 4 token parsing pop; # allow negation of sequences of tokens on the right-hand-side of # a parse rule. These can be used in "notsets" which are and logic # sets of negated tokens or sequences of tokens. # example: e = e '*' e +(not ('*' e) and not ('/' e)); "not*(*rsequence*)*" { clear; ++; ++; add "!"; get; --; --; put; # put the pop; list in the previous invisible token ( ) | = etc # clear; ++; get; --; --; put; ++; clear; add "notsequence*"; push; .reparse } # like awks begin blocks "begin*{*beginblock*}*" { clear; add "begin {"; ++;++; get; --;--; add "\n}\n"; put; clear; add "start*"; push; .reparse } # alternation group parsing. need to check for unequal length sequences. # example: (a b | c '.') # compile: "a*b*","c*.*" "altgroup*|*rsequence*|*","altgroup*|*rsequence*)*", "altgroup*|*charquoted*|*","altgroup*|*charquoted*)*", "altgroup*|*rsequence*>*","altgroup*|*charquoted*>*" { # a push list is already in | - see 2 token rule for charquoted. replace "|*rsequence*" ""; replace "|*charquoted*" ""; push; push; # workspace is clear. get the pop; lists in ( and | .The (* token # is just before the altgroup* token, but not visible here. --; get; --; --; !(==) { clear; add "\n"; add "The sequences in the alternation group were of unequal length\n"; add "(rule on line "; lines; add ") \n"; add "This is currently not allowed in alternation groups (x|y|x) \n"; print; quit; } ++; ++; ++; clear; --; --; get; ++; ++; add ","; get; --; --; put; ++; ++; clear; .reparse } # lookahead group parsing. need to check for unequal length sequences. # example: a b | c '.' | (or) a b | c '.' ) # compile: E"a*b*",E"c*.*" "lookgroup*|*rsequence*|*","lookgroup*|*rsequence*)*" { # here do "lookgroup|charquoted) as well" but need to put # a push list in | - see 2 token rule replace "|*rsequence*" ""; replace "|*charquoted*" ""; push; push; # workspace is clear. get the pop; lists in +( and | .The +(* token # is just before the lookgroup* token, but not visible here. --; get; --; --; !(==) { clear; add "\n"; add "The sequences in the lookahead alternation were of unequal length\n"; add "(rule on line "; lines; add ") \n"; add "This is not allowed in lookahead groups +( ...) \n"; print; quit; } ++; ++; ++; clear; --; --; get; ++; ++; add ",E"; get; --; --; put; ++; ++; clear; .reparse } # make a phantom 'ruleblock*' token, to help with parsing. A phantom # token is a token created with an empty attribute value, and without actually # parsing anything from the input stream. It must be created in a # particular context, and must avoid interfering with other parse rules. # I need to use this also in the lexblocks because it is so good "LHS*=*rsequence*{*","LHS*=*alt*{*","+(*lookgroup*)*{*","(*altgroup*)*{*" { push; push; push; push; clear; put; add "ruleblock*"; push; .reparse } #* # assign text to LHS token attributes by getting attributes from the RHS # example: a = b c { @1 := "$1 : $2"; } compile: pop;pop; "b*c*" { clear; get; add " : "; ++;get;--; put; clear; add "a*"; push; .reparse } push;push; *# # example: @1 := "$1 : $2"; # compile: clear; get; add " : "; ++;get;--; put; clear; "lattribute*:=*quoted*;*" { clear; ++; ++; add "clear; add "; get; # special line and char and counter 'variables' # the number of lines read from the input stream replace "$line" "'; lines; add '"; # the number of chars read from the input stream replace "$char" "'; chars; add '"; # access the pep machine accumulator replace "$counter" "'; count; add '"; # text is the text in the current tape cell replace "$text" "'; get; add '"; # print the parse-stack replace "$stack" "'; ++;++;++;put;--;--;--; d;stack;swap;get; add '"; # the $n variables replace "$1" "'; get; add '"; replace "$2" "'; ++; get; --; add '"; replace "$3" "'; ++;++; get; --;--; add '"; replace "$4" "'; ++;++;++; get; --;--;--; add '"; replace "$5" "'; ++;++;++;++; get; --;--;--;--; add '"; replace "$6" "'; ++;++;++;++;++; get; --;--;--;--;--; add '"; replace "$7" "'; ++;++;++;++;++;++; get; --;--;--;--;--;--; add '"; replace "$8" "'; ++;++;++;++;++;++;++; get; --;--;--;--;--;--;--; add '"; replace "$9" "'; ++;++;++;++;++;++;++;++; get; --;--;--;--;--;--;--;--; add '"; # put; # now get @n code --; --; add ";\n"; get; # an optimisation!! remove empty add commands replace "add '';" ""; replace 'add "";' ''; put; clear; add "attrule*"; push; .reparse } # new LHS/RHS rule parsing # in the parse token attributes for '=' and '|', just preceding # the rsequence, we have stored the pop list 'pop;pop;etc'. we # can compare these to check if the sequences are the same length. # if they are different lengths, then the compilation procedure is # quite different, and in the case of '{' possibly non-sensicle. # if they are unequal we will store a flag in the 1st "UNEQUAL" # this needs some rethought... # example: a = b c | e f ; # compile: pop;pop; "b*c*","e*f*" { clear; add "a*"; push; .reparse } push;push; # example: a = b | e f ; # compile: # pop; "b*" { clear; add "a*"; push; .reparse } push; # pop;pop; "e*f*" { clear; add "a*"; push; .reparse } push;push; # # as can be seen, the second compilation is more tricky # because we have to separate into 2 blocks. I believe that the # 2nd example requires a variable LHS stored on the tape, because we # need to grab that var as soon as we find an unequal sequence.... "rsequence*|*rsequence*;*","rsequence*|*alt*;*" { # save token sequence in tape cell above ';' attribute ++;++;++;++; put; --;--;--;--; clear; --; get; ++; ++; # tape test # a trick to keep the poplist but flag the alternation as having # unequal length sequences. # Here I could compile uneven sequences into the '{' token attribute # and remove "|*alt*" Then when completing the rule, I check '{' for # compiled code and include it. # !(==) { --; --; replace "pop;" "unequal;"; put; ++; ++; } # -------------------------------- # attempting to compile unequal alternations to the ';' attribute !(==) { clear; # get the pop; list from the '|' token attribute --; ++; add "\n"; get; add "\n"; # add the token match list ++; get; add " {\n "; mark "here"; go "LHS"; get; go "here"; add "\n}"; ++; swap; get; put; # print; quit; --; --; # get the pop; list from '|' attribute and make a push; list clear; add "\n"; get; replace "pop;" "push;"; ++; ++; swap; get; --; --; # put all compiled code into new ';' attribute put; --; clear; add "rsequence*;*"; push; push; .reparse } # print; quit; # restore token sequence. dont need to reparse clear; --; ++;++;++;++; get; --;--;--;--; } "rsequence*|*rsequence*;*","rsequence*|*alt*;*" { # compose alternation clear; get; add ","; ++; ++; get; --; --; put; # copy ';' attribute down, this may contain compiled code # for unequal length alt sequences clear; ++; ++; ++; get; --; --; put; --; clear; add "alt*;*"; push; push; .reparse } # I think variable length sequences in alternations for rules # that have a composition block {} is non sensical so I will disallow # it here "rsequence*|*rsequence*{*","rsequence*|*alt*{*" { # save token sequence in { or ; attribute ++;++;++; put; --;--;--; clear; --; get; ++; ++; !(==) { clear; add "\n"; add "The sequences in the alternation were of unequal length\n"; add "(alternation on line "; lines; add ") \n"; add "This is not allowed in parse-token reduction rules that \n"; add "have a following block\n"; print; quit; } # restore token sequence. dont need to reparse clear; --; ++; ++; ++; get; --;--;--; } # tail-reduction of RHS token sequences before '{' or ';' # this is quite elegant because the rsequences have already been # wrapped in quotes. # example: ... c d | e g { # compile: "c*d*","e*g*" { "rsequence*|*rsequence*{*","rsequence*|*alt*{*" { clear; get; add ","; ++; ++; get; --; --; put; clear; add "alt*{*"; push; push; .reparse } # this could also be an unequal alternation list with code in # the ';' attribute. # compile a complete rule. The pop;pop; list is stored in the '=' # LHS should already have its compiled code # NOTE: that the ';' token attribute will contain code for unequal # length sequences, and so should be added here. # example: a = d e ; # compile: pop;pop; "d*e*" { clear; add "a*"; push; .reparse } "LHS*=*rsequence*;*" { clear; ++; get; add "\n"; ++; get; add " "; --; --; add "{\n "; get; add "\n}\n"; put; # here: build the push;push; list and add to nom code clear; ++; get; --; replace "pop;" "push;"; swap; get; # add the unequal sequence compiled code (from the ';' attribute) ++; ++; ++; get; --; --; --; put; clear; add "rule*"; push; .reparse } # normally the rsequences can be same or different lengths. # the compilation for unequal length sequences is pretty special. # it involves creating separate blocks for each branch and # compiled nom code is saved in the ';' attribute and copied with # that token. # example: a = b c | e f ; # compile: # pop;pop; "b*c*","e*f*" { clear; add "a*"; push; .reparse } push;push; # example: a = b | e f ; # compile: # pop; "b*" { clear; add "a*"; push; .reparse } push; # pop;pop; "e*f*" { clear; add "a*"; push; .reparse } push;push; # # as can be seen, the second compilation is more tricky "LHS*=*alt*;*" { #* remove: # check for unequal length sequences in the alternation # this is obsolete code, since unequal sequences are compiled clear; ++; get; B"unequal;" { clear; add "\n"; add "The sequences in the alternation were of unequal length\n"; add "(alternation on line "; lines; add ") "; add "... \n"; print; quit; } --; *# # build code with pop;pop; list and token match list clear; ++; get; add "\n"; ++; get; add " "; --; --; add "{\n "; get; add "\n}\n"; put; clear; # here: build the push;push; list and do swap;get; ++; get; --; replace "pop;" "push;"; swap; get; # add the unequal sequence compiled code (from the ';' attribute) ++; ++; ++; get; --; --; --; put; clear; add "rule*"; push; .reparse } # eg: match '<' to '>' { tag: ''|''; } # compile: # '<' { until [>]; put; 'green','blue','x' # { clear; add 'tag*"; push; .reparse } } "tomatch*{*textruleset*}*","betweenmatch*{*textruleset*}*", "tomatch*{*action*}*","betweenmatch*{*action*}*" { clear; add "# lex-rule \n"; get; replace "{" "{\n "; # not needed here??? replace "while !" "whilenot "; # identing is done above ++; ++; get; --; --; add '\n}'; put; clear; add "lexrule*"; push; .reparse } # the second item can be a string because we compile to 'until' # but second item cant be a class because of 'untils' limitations # example: register: '[' to ']' ; # example: register: [:;] to '/end' ; # compile: '[' { until ']'; put; clear; add "register*"; } "token*:*tomatch*;*" { clear; ++; ++; get; --; --; add ' clear; add "'; get; add '"; push; .reparse }'; put; clear; add "lexrule*"; push; .reparse } # example: register: '[' between [:space:] ; # compile: '[' { whilenot [:space:] ; put; clear; add "register*"; } "token*:*betweenmatch*;*" { clear; ++; ++; get; --; --; add ' clear; add "'; get; add '"; push; .reparse }'; put; clear; add "lexrule*"; push; .reparse } # this allows a default token for all text, in this context # I want the * to create a default token name even if the # pattern space is empty. But in 'match * { ... }' it only matches # if pattern space is not empty, silly???? fix # example: shape: * ; # compile: !'' { clear; add "shape*"; push; .reparse } "token*:*star*;*" { clear; add "clear; add '"; get; add "'; push; .reparse"; put; clear; add "lexrule*"; push; .reparse } # example: lit: [,.;]; # compile: [,.;] { add "*"; push; .reparse } # example: lit: ','|':'|'x' ; # compile: ',',':','x' { add "*"; push; .reparse } "lit*:*class*;*","lit*:*charset*;*","lit*:*charquoted*;*" { clear; ++; ++; get; --; --; add " { add '*'; push; .reparse }"; put; clear; add "lexrule*"; push; .reparse } # for empty do 'match empty { etc }' # eg: EOF { name = capital lowerchars; } "eof*{*rule*}*","eof*{*ruleset*}*" { clear; add "(eof) {"; ++; ++; get; --; --; add "\n}\n"; put; clear; add "rule*"; push; .reparse } # example: eof { print 'yes'; } "eof*{*action*}*" { clear; add "(eof) {"; ++; ++; get; --; --; add "\n}\n"; put; clear; add "action*"; push; .reparse } # eg: EOF { letter: [a-z]; print 'bye'; exit 2; } "eof*{*textruleset*}*" { clear; add "(eof) {"; ++; ++; get; --; --; add "\n}\n"; put; clear; add "lexrule*"; push; .reparse } # lex tokens with AND and OR | logic # example: keyword: 'is'; # compile: 'is' { clear; add "keyword*"; push; .reparse } # example: logic: 'is'|'or'|'and'; # compile: 'is','or','and' { clear; add "keyword*"; push; .reparse } # example: 0number: [:digit] AND begins '0' # compile: [:digit:].B'0' { clear; add "0number*"; push; .reparse } "token*:*quoted*;*","token*:*orset*;*","token*:*andset*;*" { clear; ++; ++; get; --; --; add ' { put; clear; add "'; get; add '"; push; .reparse }'; put; clear; add "textrule*"; push; .reparse } # example: char: [:alpha:]; # compile: [:alpha:] { clear; add "char*"; push; .reparse } "token*:*class*;*" { clear; ++; ++; get; --; --; add ' { clear; add "'; get; add '"; push; .reparse }'; put; clear; add "lexrule*"; push; .reparse } # example: space: ' '; # compile: ' ' { clear; add "space*"; } "token*:*charquoted*;*","token*:*charset*;*" { clear; ++; ++; get; --; --; add ' { clear; add "'; get; add '"; push; .reparse }'; put; clear; add "lexrule*"; push; .reparse } #* fix: remove "andset*{*textruleset*}*","andset*{*action*}*", "charset*{*textruleset*}*","charset*{*action*}*", "orset*{*textruleset*}*","orset*{*action*}*", "quoted*{*textruleset*}*","quoted*{*action*}*", "star*{*textruleset*}*","star*{*action*}*", "empty*{*textruleset*}*","empty*{*action*}*", "charquoted*{*textruleset*}*","charquoted*{*action*}*", "class*{*textruleset*}*","class*{*action*}*" { clear; get; add " {"; ++; ++; get; --; --; add "\n}"; put; clear; add "lexrule*"; push; .reparse } *# # example: "match 'ok' { print 'bye!'; exit 0; } # compile: 'ok' { ... } "textmatch*{*textruleset*}*","textmatch*{*action*}*" { clear; get; add " {"; ++; ++; get; --; --; add "\n}"; put; clear; add "lexrule*"; push; .reparse } #----------------------- # 5 token parsing pop; # allow negation of sequences of tokens on the right-hand-side of # a parse rule. These can be used in "notsets" which are and logic # sets of negated tokens or sequences of tokens. # example: e = e '*' e +(not ('*' e) and not ('/' e)); B";*",B"<*",B">*",B"(*",B")*",B"|*",B"=*" { E"not*(*rsequence*)*" { clear; ++; ++; add "!"; get; --; --; put; # put the pop; list in the previous token ( ) | = etc clear; ++; get; --; --; put; ++; clear; add "notsequence*"; push; .reparse } } # this should not interpolate the quoted text "declare*var*:=*quoted*;*","declare*var*:=*charquoted*;*" { # the var already has fetch code in it... clear; ++; get; --; replace 'mark "here"; go' 'mark'; replace 'get; go "here";' ''; add "add "; ++; ++; ++; get; --; --; --; add '; ++;'; put; clear; add "vardef*"; push; .reparse } # example: word: [:alpha:]+ ; # compile: [:alpha:] { while [:alpha:]; put; clear; add "word*"; } # example: word: ![a-z]+ ; # compile: ![:alpha:] { whilenot [:alpha:]; put; clear; add "word*"; } "token*:*class*+*;*" { clear; ++; ++; get; add ' { while '; get; --; --; # while ![a-z]; is not valid nom syntax (sadly) replace "while !" "whilenot "; add '; put; clear; add "'; get; add '"; push; .reparse }'; put; clear; add "lexrule*"; push; .reparse } # eg: [a-z]+ { colour: 'green'|'blue'|'x'; } # compile: # [a-z] { # while [a-z]; put; # 'green','blue','x' { clear; add 'name*"; push; .reparse } # } "class*+*{*textrule*}*","class*+*{*textruleset*}*", "class*+*{*lexrule*}*","class*+*{*lexruleset*}*" { clear; get; add ' {\n while '; get; replace "while !" "whilenot "; add '; put;'; # identing is done above ++; ++; ++; get; --; --; --; add '\n}'; put; clear; add "lexrule*"; push; .reparse } #----------------------- # 6 token parsing pop; # a syntax to check the value of a token attribute # I am avoiding reversing the test because that will match "textmatch{...}" # example: "green" == $2 { ... } # compile: clear; ++; get; --; "green" { ... } "textmatch*==*attvar*{*textruleset*}*","textmatch*==*var*{*textruleset*}*" { # change '++;++; get; --; etc' to '++; swap; --;' # change 'go "xxx"; get; go "here" etc' to 'go "xxx"; swap; ' # then add a swap at the end of the block. This preserves the token sequence? clear; ++;++; get; replace "get;" "swap;"; put; --;--; clear; add "clear;\n"; ++;++; get; add '\n'; --;--; get; add ' { '; ++;++;++;++; get; --;--;--;--; add "\n}\n"; ++;++; get; --;--; put; clear; add "textrule*"; push; .reparse } # compile a complete rule with a following block. # The pop;pop; list is stored in the '=' # LHS should already have its compiled code. The code in the block # should be compiled before the LHS code. # example: a = d e { print 'reduced!\n'; } # compile: # pop;pop; "d*e*" { # clear; add "reduced!\n"; print; clear; # clear; add "a*"; push; .reparse # } "LHS*=*rsequence*{*ruleblock*}*","LHS*=*alt*{*ruleblock*}*" { # make push list and store in ';' attribute clear; ++; get; replace "pop;" "push;"; ++; ++; put; --; --; --; clear; ++; get; add "\n"; ++; get; add " "; --; --; add "{ "; # add block code ++; ++; ++; ++; get; --; --; --; --; add "\n "; # add LHS code get; add "\n}\n"; ++; ++; ++; get; --; --; --; put; clear; add "rule*"; push; .reparse } #* # compile a complete rule with alternation and a following block. # example: a = d e | f g { print 'reduced!\n'; } # compile: # pop;pop; "d*e*","f*g*" { # clear; add "reduced!\n"; print; clear; # clear; add "a*"; push; .reparse # } *# #----------------------- # 7 token parsing pop; "stack*(*rsequence*)*{*textruleset*}*","stack*(*altgroup*)*{*textruleset*}*", "stack*(*rsequence*)*{*ruleblock*}*","stack*(*altgroup*)*{*ruleblock*}*" { clear; add "clear; unstack;\n"; ++;++; get; ++;++;++; add " {"; get; add "\n}\nstack;"; --;--;--;--;--; put; clear; add "rule*"; push; .reparse } # parse optionals where there is not following sequence # andsets may be of some use here, but I wont worry for now. # example: a = b c [ d | e]; # this is compiled into 2 separate nom blocks. # actually just delegate to rs < altgroup > rs ; "LHS*=*rsequence*<*altgroup*>*;*","LHS*=*rsequence*<*rsequence*>*;*", "LHS*=*rsequence*<*notset*>*;*" { # make a push list for rsequence and optionals, save in ';' clear; ++; get; ++; ++; get; ++; ++; ++; replace "pop;" "push;"; put; --; --; --; --; --; --; clear; # get pop; list and sequence ++; get; add "\n"; ++; get; add " {\n "; --; --; get; add "\n}\n"; # now get optional pop; list and sequence alternation ++; ++; ++; get; add "\n"; # begins-with rsequence --; add "B"; get; add ".!"; get; add " {\n " ; # ends-with optional sequence alternation # I believe the replace below is safe because "," wont occur anywhere # else in the compiled code. Also, works for [rsequence] but not # for andsets. ++; ++; add "E"; get; replace '","' '",E"'; add " {\n "; # add LHS compiled code --; --; --; --; get; add "\n }\n}\n"; # get the saved push list ++; ++; ++; ++; ++; ++; get; add "\n"; --; --; --; --; --; --; put; clear; add "rule*"; push; .reparse } # I may have to keep the LHS push; list in a variable # because I need to access it separately to align the tape # pointer to the start of the look up group; Or use stack? "LHS*=*rsequence*+(*lookgroup*)*;*","LHS*=*rsequence*+(*andset*)*;*" { E"andset*)*;*" { # this is a bit dubious..fix: clear; ++;++;++;++; get; replace "!'" "!E'"; # also add a 1 pop list for the andset. put; --; clear; add "pop;"; put; --;--;--; } # temporarily fix the LHS clear; get; replace "clear; add " ""; replace " .reparse" ""; put; # make an rsequence push; list from '=' and store in ';' clear; ++; get; replace "pop;" "push;"; ++;++;++;++;++; put; --; --; --; --; --; --; clear; # make a lookgroup push; list from '+(' and store in ')' clear; ++; ++; ++; get; replace "pop;" "push;"; ++; ++; put; --; --; --; --; --; clear; # construct pop; list at top of parse block # this list consist of rsequence length + lookahead length ++; get; ++; ++; get; add "\n"; # match the rsequence # example: B"a*b*c*".!"a*b*c*" { --; add "B"; get; add ".!"; get; add " {\n "; # match the lookahead group # example: E"x*y*",E"p*q*" { ++; ++; get; add " {\n"; # build the replace command, and push list. The push list is # the LHS length + Lookahead length. # example: replace "a*b*c*" "new*"; push;push;push; # I am just going to check for rsequence in lookahead and halt if true. # but this clobbers the current attribute add " put; replace "; --; --; swap; clop; swap; add '"*'; get; add ' ""; !(==) {\n'; add " clear; add 'lookahead contains reduction sequence.\\n';\n"; add " add 'This is an error condition. Please modify \\n';\n"; add " add 'the syntagma grammar. \\n'; print; quit;\n"; add " }\n"; add ' replace "'; get; --; --; add " "; # here I could try to use a trick to avoid multiple replace # but I cant do it, because LHS has '"a*b*"; push;push;' # example: # replace "*a*b*" "****a*b*"; # replace "a*b*" "new*"; # replace "****a*b*" "*a*b*"; # push;push;push; # This trick should avoid replacing sequences that dont start # the workspace. But a ^ anchor for replace would be better. # build the push list get; add "\n "; ++; ++; ++; ++; ++; get; add " .reparse"; #* # ?? copy down all attributes in lookgroup\n"; # need to realign to the end of rsequence...do this by # subtracting the push; list in LHS, but this push list also # has the name,... A dodgy strategy: add "add ";get LHS attrib, get +( attribute, now we have >> add "l*g*"; push;push;pop;pop; >> replace '"; push;' '";clear;push;'; now we have >> add "l*g*";clear;push; ... pop; >> replace "push;" "++;"; replace "pop;" "--;"; now we have >> add "...";clear;--;--;++;++; and this will realign the pointer? *# add "\n }\n}\n"; # build final push; list from ')' and ';' # which is rsequence + lookahead lengths get; ++; get; add "\n"; # print; --; --; --; --; --; --; put; # clear; add "LHS*=*rsequence*+(*lookgroup*)*;*"; clear; add "rule*"; push; .reparse } # --------------------- # 8 token parsing pop; # parse optionals # example: a = b c [ x y | e f ] c; # this is compiled into 2 separate nom blocks. "LHS*=*rsequence*<*altgroup*>*rsequence*;*", "LHS*=*rsequence*<*rsequence*>*rsequence*;*", "LHS*=*rsequence*<*notset*>*rsequence*;*" { # make a push list for rsequence and optionals, save in ';' clear; ++; get; ++; ++; get; ++; ++; get; ++; ++; replace "pop;" "push;"; put; --; --; --; --; --; --; --; clear; # get pop; list and sequence ++; get; add "\n"; ++; get; add " {\n "; --; --; get; add "\n}\n"; # now get optional pop; list and sequence alternation ++; ++; ++; get; add "\n"; # begins-with rsequence --; add "B"; get; add ".!"; get; add " {\n " ; # ends-with optional sequence alternation # I believe the replace below is safe because "," wont occur anywhere # else in the compiled code. Also, works for [rsequence] but not # for andsets. ++; ++; add "E"; get; replace '","' '",E"'; add " {\n "; # add LHS compiled code --; --; --; --; get; add "\n }\n}\n"; # get the saved push list ++; ++; ++; ++; ++; ++; get; add "\n"; --; --; --; --; --; --; put; clear; add "rule*"; push; .reparse } # --------------------- # 9 token parsing pop; # lookahead with a rule block. To achieve this we need to copy attributes # from the lookahead tokens to their new positions in the parse stack, # if the LHS sequence is shorter than the RHS sequence (rsequence). # This copy procedure is somewhat verbose. I think I will prohibit the # LHS token sequence being longer than the RHS sequence, because it # complicates the attribute copy, and it doesn't seem very useful anyway. # The tokens '=' and '+(' contain a list of pops which indicates the # length of the following sequence, or alternations group members. # I will probably use a variable to hold the pop; or push; list for # the LHS since there is nowhere to save it. # example: a = b c +(x y | p q) { @1 := "$1/$2"; println 'found a'; } # example: n m = b c +('.' | ',') { @1 := "$1/$2"; println 'found a'; } "LHS*=*rsequence*+(*lookgroup*)*{*ruleblock*}*" { clear; # first create a complete pop; list for rsequence and lookgroup # and save it in the '{' token. ++; get; ++; ++; get; ++; ++; ++; put; mark "poplist"; --; --;--; --; --; --; add "\n"; # build block test eg: B"b*c*" { E"x*y*",E"p*q*" { ... ++; ++; add "B"; get; add " {\n "; ++; ++; get; add " {\n"; # build the replace, eg: replace "b*c*" "a*"; --; --; add " replace "; get; add " "; --; --; swap; replace "clear; add " ""; replace " .reparse" ""; # ------------------------------------ # some serious juggling here. need the push later to # calculate the length difference between the LHS and rsequence. replace '"; push;' '"; #push;'; swap; get; add "\n "; # build code to: save new token sequence lower on tape and push later. mark "here"; go "poplist"; swap; replace "pop;" "++;"; swap; get; add " put; "; swap; replace "++;" "--;"; swap; get; go "here"; # add compiled ruleblock code add "\n # ----------"; add "\n # code block in {}"; ++;++;++;++;++;++;++; # re-indent the code swap; replace "\n" "\n "; swap; get; --;--;--;--;--;--;--; # copy lookahead token attributes. this is the trikiest part. add "\n # ----------"; add "\n # copy lookahead attributes"; # first create a "diff" pop; list (difference in length between LHS # and RHS sequences.) # compare LHS and rsequence push/pop lists # we need to build this somewhere else. add "\n #"; #* swap; # lets pretend 'LHS' contains just push; list for LHS ++; get; # eg: "pop;pop;push;push;push;" for "a b= c d e..." replace "push;push;pop;pop;" ""; replace "push;pop;" ""; # now only "pop;" list. eg "pop;pop;" replace "pop;" "++;"; swap; get; add " get;"; swap; replace "++;" "--;"; swap; get; add "put; "; # save code eg: "++;++; get; --;--; put;" (for 2 token difference) put; # get +( pop; list (lookahead length) # if 1 lookahead tokens "pop; { go "diff"; get; } # if 2 lookahead tokens "pop;pop;" { go "diff"; get; add "\n"; add "++;"; get; add "--;"; add "\n"; } # etc for more lookahead tokens --; swap; *# #replace '"; push;' '";\npush;'; swap; get; # build code to: get saved new token sequence from tape and push. add "\n clear; "; mark "here"; go "poplist"; swap; replace "--;" "++;"; swap; get; add " get; "; swap; replace "++;" "--;"; swap; get; go "here"; add " stack;\n }\n}\n"; # get the pop list and convert to push list ++;++;++;++;++;++; swap; replace "--;" "push;"; swap; get; --;--;--;--;--;--; put; clear; add "rule*"; push; .reparse } # --------------------- # 10 token parsing pop; # some errors at eof. no see above (eof) { nop; } (eof) { "start*","action*","grammar*" { clear; get; add "\n\n"; print; quit; } # if no parse rules, make an empty one and a grammar "lexruleset*","lexrule*" { push; add "\n# empty rule added to grammar\n"; put; clear; add "rule*"; push; .reparse } add "\n[strange parse]\n"; print; quit; } push;push;push;push;push;push;push;push;push;push;