Pep & Nom

home | documentation | examples | translators | download | journal | blog | all blog posts

text is king

Everything important in the field of human knowledge and study begins and ends with text: plain text, alphabetic, or grapheme cluster text, ascii text, utf8, utf16, unadorned bytes of linguistic semantics. If it is not text, on a computer, then you probably can't see it. If the data is not text when you get it, at some point you are going to have to convert it into text.

The founders of unix understood this. They based their system on filters of streams, and those streams were often text. All the most important data interchange formats are text: xml, json, and comma separated values. In LISP, code is data, and data is code, but it all starts off as text; the ascii or utf8 source code.

Compilers are just mapping relations of one language to another, and each of those languages is text. A text file is a list data structure where each line is an item in the list, and the separator is the newline character. A FORTH program is a list data structure where each word is an item in the list and white-space is separator.

Text is the original “open data” format. Nobody tried, or could be bothered to buy the ascii standard. That would be like trying to patent a steering-wheel. This simple open data format breed other ones which had more linguistic structure: xml, json, csv, html, markdown. The entire web runs on text protocols. You can, or could, sit at a telnet terminal and have awkward conversations with web-servers, or mail-servers.

Everything I have said is so obvious it may seem not worth saying, but the power and importance of text is somehow overlooked, or forgotten. The core tools for manipulating raw text are quite primitive. If a programmer is asked to parse data from a text string he or she will often try to compose a regular expression that does the job. This seems strange considering that regular expressions are really only capable of parsing the simplest of formal text patterns, that is “regular languages” . Moreover, regular expressions can be quite cryptic, and the programmer generally has no idea how they are compiled into executable code, and even these regular expressions only became popular outside of the Unix environment in the late 1990's through the use of perl to write CGI webscripts.

Pep, nom and syntagma are designed to provide a better way to deal with text patterns. The syntagma language which I am currently implementing (in the nom script at /eg/syntagma.pss ) uses a syntax similar to extended backus-naur form grammar rules. Already, I find it much easier to write than nom itself. Writing nom* is similar (but slightly easier) to writing an intermediate language format like assembler. It is possible, but error prone and it is harder to see the flow of logic and ideas than looking at a high level language. The syntagma language compiles into nom, and it’s capabilities and ideas all come from nom and the pep machine . Since syntagma compiles to nom, it is therefore possible to translate a syntagma program to any language for which a nom translator exists (eg: rust, go, lua, perl, dart, and others). This can be done as follows:

translating a syntagma script to lua

   pep -f syntagma input.syn > temp.pss
   pep -f /tr/nom.tolua.pss temp.pss > temp.lua
   echo "abcd" | lua temp.lua   # and run it with some input
  

The power of syntagma is that it permits the expression of grammatical ideas in a concise, readable language. A simple JSON “syntax checker ” /eg/s.json.syn written in syntagma is about 20 lines long. The equivalent, written in nom is at least 100 lines and the logic of the grammar is much less evident in the nom script compared to the syntagma script.

Even at this early stage, I feel that in the future I will be writing many more scripts in syntagma* than in nom* . As usual this blog post has become too specific and technical, rather than philosophical and general.