A syntax for IRON (by )

Back in 2004 I jotted some notes on requirements for IRON types.

Since then I've been drifting somewhat towards looser typing, in the Lisp model; having that as the underlying system provides for more expressive programming power, while optional type declarations as assertions, where required, can bring back the statically checkable safety and runtime efficiency of a strict type system.

But that's not what I'm posting about in the current insomniac haze - I've been thinking about written syntax.

IRON is a data model for values. Although I'm still deciding how the mutable data structures like queues fit into things (specifications of them are definitely needed for TUNGSTEN, but whether they count as part of IRON or not is something I'm still debating), I think I may have settled on a basic syntax for written values.

Now, the key requirement here is that IRON is, in the manner of S-expressions, usable to express just about anything - from source code to boring data. Creating a written data syntax that's pleasant enough to use day in and day out is quite a challenge. s-expressions come pretty close, but are deficient in a few areas. YAML is pretty good, but I wouldn't want to write source code in it.

The main thing I'm adding over s-expressions is Smalltalk-like syntax, which I will explain in detail below.

So, without further ado, here's a basic IRON syntax.

Representation

Herein I define representations of IRON values in terms of sequences of characters. How those characters are encoded is another matter entirely.

Atoms

Integers

IRON integers can be anything from negative to positive infinity, subject only to the constraints of implementation. An implementation must be able to support at least what you'd get from 32 two's-complement signed bits, but should be able to support arbitrary precision integers, if memory permits.

There are several representations available:

  • A sequence of decimal digits, with an optional leading - symbol, represents an integer in base 10
  • The sequence 0x followed by a sequence of hexadecimal digits represents an integer in base 16. Negative integers can be represented with a leading -0x, not 0x-.
  • The sequence 0b followed by a sequence of binary digits represents an integer in binary. Again, negative binary numbers are written starting with -0x not 0x-.

Floating point numbers

Floats can be written in decimal, distinguishing themselves from integers with the presence of a . at some point; they may also optionally have an exponent appended, of the form e followed by decimal integer with optional leading -.

Rational numbers

Rational numbers consist of a signed integer known as the numerator, and a positive non-zero integer known as the denominator. They are written using any of the above integer syntaxes, with the numerator first, then a /, then the denominator.

Characters

An IRON character can be any Unicode character. I'm going to have to look in the unicode specs for the exact terminology, but by 'character' I explicitly disallow control codes, things like BOMs, surrogates, and combining characters. Those are things used to implement characters and representational details of strings, not characters themselves.

A character can be represented in several ways:

  • A sequence of the form #' followed by the character then terminated by '. Eg, #'x'
  • A sequence of the form #ucs( followed by the UCS codepoint in hexadecimal then a closing ). Eg, #ucs(12ee)

Booleans

Booleans may have only two values, true and false.

Boolean true can be written #t and false #f.

Nil

The nil value can be written #nil.

Symbols

An IRON symbol is a complex beast in some ways, and very simple in others.

At heart, it is a list of strings. s-expression symbols are just strings, but IRON symbols are contained within namespaces, which are hierarchial sequences of names. In fact, IRON symbols are names in the CARBON directory, but that's irrelevant here.

However, the internal structure of symbols is generally unimportant, except in certain low-level operations. The important operation upon symbols is testing them for equality, and as such, implementations are encouraged to "intern" symbols into a global hash table, and just use pointers to each symbol's unique representation in memory as the symbol value, so that identity comparison is just a pointer test.

Two symbols should compare equal if and only if they are identical lists of identical strings (case sensitive).

Symbols may not contain whitespace characters of any kind. No component of a symbol may start with a digit.

There are a few written syntaxes for symbols:

  • The full unambigious absolute path: every element of the path, in order, separated by / characters, bracketed by < and >. Special characters (/,\,<, and >) in each path component can be escaped with prefixed \ characters.

  • Relative symbols, which refer to the current namespace, are just written as a sequence of one or more path components, separated by /, with :, / and \ characters escaped with prefixed \ characters. The actual value of the symbol is found by appending the supplied path components to the current namespace.

  • Prefixed symbols, which refer to a declared namespace, are written as the namespace name (which has the syntax of a symbol path component) followed by : then one or more path components, separted by /. As before, :, / and \ characters may be escaped with a prefixed \. The actual value of the symbol is found by appending the supplied path components to the namespace bound to the supplied namespace name.

The declaration of the current namespace, or named namespaces, is explained later.

The empty list

The empty list, as in s-expressions, is written as ().

Core compound types

Pairs

IRON has pairs, also known as cons cells, just as Lisp does.

Likewise, it has Lisp's syntax for pairs (and syntactic sugar for lists).

A basic pair can be written as ( followed by the first value, then ., then the second value, then ).

Linked lists of pairs are written just as in s-expressions. ( followed by a space-separated list of elements then ). Improper lists can be written as ( followed by a space-separted list of elements then . followed by the tail and then ).

Maps

We inherit more from YAML than s-expressions in having a map type in the core. Maps are notionally sets of pairs, with the constraint that no two pairs in the set may share the first element. They might be implemented as a hash table, but there are many situations in which they should not be.

The written representation for them is { followed by zero or more elements, written as space-separated pairs of values; the elements themselves are separated by spaces, and the map is terminated with }.

Records

Records, however, inherit more from Smalltalk, at least in syntax.

The notional representation of a record is as a symbol followed by a list of values. The symbol is known as the 'type' of the record, and list of zero or more values known as the 'fields'.

However, the written representation is somewhat special, and attaches special meaning to colons in the last component of the type symbol.

The simplest representation is for records whose type symbol has no colons in the last component. These are written as a [ followed by the type symbol (in any of the symbol representations listed above), then some whitespace, then the space-separated list of fields, terminated with a ].

For example, [+ 1 2 3].

However, if the last component of the type symbol contains colons, then the last character of the component must itself be a colon, or else the symbol cannot be used as a record type symbol. The last component of the type symbol can be considered as a concatenated sequence of colon-terminated strings known as the field names.

The written representation of a record with such a type symbol again starts with [, but it is now followed by the type symbol, but only including the first field name. This can be written in any of the symbol syntaxes listed above, but it does not necessarily represent a symbol, merely part of a symbol. After this type symbol fragment we have the value of the first field, separated by whitespace. There can then follow a whitespace-separated list of zero or more fields, represented as the field name (written as if it were a symbol relative to a current namespace, even though it is only a fragment of a symbol) followed by the field's value.

For example, [if: [= x 1] then: [' hello] else: [' goodbye]] represents a record with the type symbol if:then:else: (relative to the current namespace), and fields [= x 1], [' hello], and [' goodbye], each of which is itself a record.

Alternatively, [<foo/bar/baz:> 1 bam: 2] represents a record with the type symbol foo/bar/baz:bam: and fields 1 and 2.

Vectors

Vectors are pretty similar to lists - but, don't forget, IRON has no explicit list type, just pairs. And pairs are, at best, a special case of vectors.

There are two basic kinds of vectors: general and heterogenous.

General vectors contain lists of arbitrary values, while heterogenous vectors place restrictions on their contents.

General vectors are written as #< followed by a whitespace-separated list of values, terminated by >. For example, #<a b 1 2 3>.

Heterogenous vectors are usually written as # followed by a type name then <, the whitespace-separated list of values, then >. Valid type names are float, symbol, u8, s8, u16, s16, u32, s32, u64, s64, or char. The u8 and friends represent particular limited integer types - unsigned or signed (two's complement) integers of the specified number of bits. For example, #u8<1 2 3 4>, #symbol<a b c>, or #char<#'a' #'b' #'c'>

However, vectors of characters can be represented more compactly by just enclosing the verbatim character sequence in " characters, after escaping any \ or " characters in the string by prefixing them with \. For example, the last example can be written "abc".

Other types

In general, other types can be represented by introducing extra syntax of the form # followed by a name then optionally (, some content, then ). Currently used names are t, f, nil, and ucs.

The top level

Given a sequence of characters to parse, the IRON written form parser considers the sequence to be some whitespace followed by a single value. After the value has been successfully parsed, any remaining characters will be left for later consumption if the sequence is a sequential-access stream. If the sequence is a fixed-length string or other random-access character sequence, then it is an error for anything other than whitespace to remain.

Namespaces

Symbols may be represented compactly in the IRON written form by reusing common prefixes of the symbol path, by declaring them as namespaces.

The IRON written form parser maintains a current namespace environment, consisting of a default namespace symbol and a map from namespace names to namespace symbols. At the start of parsing an IRON written form, the default namespace is <> (the empty symbol) and the map is {} (the empty map).

Before a value is parsed, the current namespace environment is saved, and restored after the parsing of the value.

The default namespace may be changed at any point in the parsing of a value where whitespace may appear, by introducing the syntax !defns followed by actual whitespace then a symbol which thereafter becomes the new default namespace. This namespace applies until overridden with another !defns declaration, or until parsing of the current value ends and the previous namespace environment is restored.

Named namespace bindings may be created, or existing ones overridden, by introducing the syntax !ns followed by actual whitespace, then a namespace name, some more actual whitespace, then a symbol which is thereafter bound to that name in the namespace map. Again, this binding will remain in effect until overridden, or the namespace environment is restored by the end of the current value.

Namespace bindings in the whitespace before a value in the top-level character sequence given to the parser are considered part of the value parsed, but ones found in whitespace separating elements of a record, map, list, vector, or other compound value are considered to be part of the compound value rather than simply part of the next value, and as such only "disappear from scope" at the end of that value.

Pages: 1 2

No Comments

No comments yet.

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales