Magic Pipes (by alaric)
One of the neat things in Unix is that you can make 'shell pipelines' of commands, from a suite of tools that come with most Unix systems, by feeding the output of one command into the input of another.
For example, the du
tool lists the space taken by each file or subdirectory beneath you if you run it like du -sk *
. My home directory yields:
608 SN9C325 Series WebCam Driver OSX 1[1].0.zip
1776 SONiX USB 2[1][1].0 PC Camera Driver OSX 1.1.zip
4 alaric-adventure
113120 chicken-scheme
152 dwm
39306 kitten-technologies.co.uk
13462 n2n
3472 netbsd.lzma
76 pdb
398596 perplexity-pkgsrc
22296 projects
2352 sonix201_for_MAC.rar
8 test
2 test.c
15892 test.ugarit
38 tmp
2 ugarit.conf
2 ugarit2.conf
But often I want to see them sorted by size. The sort
tool can do that - if invoked as sort -n
it will read lines in, find a number at the start of each line, and sort the lines by those numbers. So I can feed the output of du -sk *
into sort -n
with one command, du -sk * | sort -n
, where the |
tells the shell to pipe the output of the first command into the second. Lo, the output of the command is:
2 test.c
2 ugarit.conf
2 ugarit2.conf
4 alaric-adventure
8 test
38 tmp
76 pdb
152 dwm
608 SN9C325 Series WebCam Driver OSX 1[1].0.zip
1776 SONiX USB 2[1][1].0 PC Camera Driver OSX 1.1.zip
2352 sonix201_for_MAC.rar
3472 netbsd.lzma
13462 n2n
15892 test.ugarit
22296 projects
39306 kitten-technologies.co.uk
113120 chicken-scheme
398596 perplexity-pkgsrc
If I wanted it largest first, I could use sort -nr
instead to reverse the sort.
Now, the set of tools that come with Unix for making up such pipelines tend to be line-oriented, and have relatively crude support for operations within the line - sorting by a number midway through each line, as might be found in a complex table, is hard work, especially as there's a number of ways the table could be formatted (fixed width columns, CSV with any of a number of quoting conventions, etc).
And line-oriented tools are a pain when dealing with files written in formats that deserve the title "language", such as HTML, XML, or source code, as they tend to have complex structure that can be rearranged over lines to suit the asthetic tastes of the programmer.
Indeed, such a case came up on IRC lately, with somebody wanting to quickly analyse the dependencies between Chicken eggs, which are directories with certain specially-named metadata files inside that list the other eggs they require. He cobbled something together with grep
that found most of the dependencies, but doing it properly and handling dependency lists that went onto multiple lines and so on would need proper s-expression parsing.
But a thought came to me. Why not put together some shell commands to match the line-oriented grep
, sort
, sed
, awk
and friends, but treating input and output files as lists of s-expressions? They'd be trivial to write.
Here's an initial idea for a set:
filter
accepts a Scheme expression on the command line, and useseval
to evaluate it on each s-expression read from stdin in turn (binding the read s-expression toINPUT
), and outputs the s-expression if the expression evaluates to a true value.echo "a b c (foo)" | filter "(pair? INPUT)"
would output(foo)
.map
accepts a Scheme expression on the command line, and likewise useseval
to evaluate it on each s-expression read from stdin in turn (binding the read s-expression toINPUT
), and outputs the result of the evaluation.echo "(1 2) (3 4)" | map "(car INPUT)"
would output1 3
.fold
accepts two Scheme expressions on the command line, referred to askons
andknil
, and evaluatesknil
witheval
to obtain the initial accumulator value. It then evaluateskons
in turn for each s-expression read from stdin, withINPUT
bound to the read s-expression andSTATE
bound to the current accumulator value, and makes the result of the evaluation be the new accumulator value. When standard input reports EOF, the final accumulator value is output.echo "1 2 3" | fold "(+ INPUT STATE)" "0"
would output6
(the sum of the input numbers).sort
accepts two Scheme expressions on the command line, but both are optional, with defaults. One we shall callextract
which defaults toINPUT
, the other we shall callcompare
which defaults to(< A B)
. It reads all the s-expressions from standard input into a list, then sorts it, usingcompare
as the comparison function, withA
bound to the result of evaluatingextract
withINPUT
bound to the first s-expression being compared, andB
bound to the result of evaluatingextract
withINPUT
bound to the second s-expression being compared. The resulting list is then output. If we use-c
to introduce the comparison function and-e
to introduce the extraction function, thenecho "(food \"cheese\") (death \"burning\") (cheese \"edam\")" | sort -e "(cadr INPUT)" -c "(string<? A B)"
would output(death "burning") (food "cheese") (cheese "edam")
- sorting by the second element of each list as a string.flatten
reads s-expressions from its input, and if they are proper lists, outputs each member of the list in turn, otherwise outputs the whole s-expression. Note that it does no recurse down into lists, it only flattens the top level.echo "1 (2) (3 4) (5 (6))" | flatten
would output1 2 3 4 5 (6)
.group
accepts a Scheme expression on the command line, which is the grouping function. It reads an s-expression from the input and evaluates the grouping function withINPUT
bound to the input s-expression, and saves the result as the "current group value", and puts the input s-expression into a list called the "current group". It then iterates over the remaining s-expressions in the input, again evaluating the grouping function for each; if the result is different to the current group value then the current group is output as a list s-expression, then the new group value becomes the current group value and the current line becomes the only member of the new current group. Soecho "1 2 foo bar 3 4" | group "(symbol? INPUT)"
would output(1 2) (foo bar) (3 4)
.
Clearly, for security, we'll need to just use plain R5RS read
to input s-expressions - no special read syntax that might execute arbitrary code can be allowed, so a sandboxed readtable will be needed in implementations like Chicken that have a lot of special read syntax. But the full read
can be used for parsing the Scheme expressions supplied as arguments, in order to take advantage of the nice syntactic sugar they provide (as it's arbitrary code we're reading anyway).
As well as the arguments listed above, there should probably be arguments that all the tools accept, to set things like input and output character encodings for strings, to supply a Scheme expression to evaluate before anything else happens to define or load any utility functions, and to choose between print
as the output function (with a newline after each s-expression) or a pretty-printer, etc. Also, anything the Scheme expressions the user provides output to standard-output-port
should be sent to stderr so they appear on the console rather than ending up in the output stream.
Can anyone think of any important tools I've missed? map
does the job of much of sed
, awk
, cut
, etc. in line-based shell pipelines. (consider how the match
pattern-matching macro inside a map
compares to sed
...) I'm wondering about an extract
that uses some abbreviated syntax (something like an xpath?) to extract a subexpression out of every s-expression - there could be multiple matches, so it should output a list of matches for every input s-expression, meaning that inputs that have no matches are represented by an empty list in the output. If this information is not required, a quick extra flatten
in your pipeline will discard it.
Perhaps we need some tools to convert from non-sexpr notations and back again - perhaps a parse
tool that takes a regexp and a Scheme expression, matches each input line against the regexp, then outputs the result of evaluation the Scheme expression with $1
etc. bound to the groups within the regexp, and another tool that outputs nothing to stdout itself, but evaluates a command-line-supplied Scheme expression upon each line with standard-output-port
actually sending to stdout, so people can use things like the fmt
egg to format s-expressions into raw text output (or just display
for simple needs).
Maybe the next step would be s-expression-outputting versions of ls
and ps
to make them easier to handle 🙂
By Ben, Thu 25th Jun 2009 @ 11:51 am
Check out Windows PowerShell. Instead of returning text, it returns objects, which can then be operated on by the next command in sequence.
Actually looks quite shiny.
By @ndy, Tue 30th Jun 2009 @ 7:54 am
sort -t, -k3
In the cheese example, what is cadr? | sort -e "(cadr INPUT)" -c "(string<? A B)" Should it be cdr?
By alaric, Tue 30th Jun 2009 @ 9:41 am
Ah, cadr is a shorthand function for (car (cdr ...))
If INPUT is (a b), then (car INPUT) is a, (cdr INPUT) is (b), and (cadr INPUT) is (car '(b)) which is just b.
By alaric, Thu 13th Aug 2009 @ 10:12 am
foof (author of the aforementioned fmt egg) made a good suggestion on IRC: As well as supporting the
-e
flag for a sort expression, for the common case of-e "(foo INPUT)"
, we can have a shorthand of-k foo
, wherefoo
is any function (eg,cadr
).By Michal, Thu 6th Sep 2012 @ 8:47 am
Hello I have found name of my webcam in your example. This is only one place where I can see something related to the camera and OSX. Do you know anything about driver for the camera for MAC OSX? Please give me a hint where I can look for driver.