Magic Pipes (by )

One of the neat things in Unix is that you can make 'shell pipelines' of commands, from a suite of tools that come with most Unix systems, by feeding the output of one command into the input of another.

For example, the du tool lists the space taken by each file or subdirectory beneath you if you run it like du -sk *. My home directory yields:

608     SN9C325 Series WebCam Driver OSX 1[1].0.zip
1776    SONiX USB 2[1][1].0 PC Camera Driver OSX 1.1.zip
4       alaric-adventure
113120  chicken-scheme
152     dwm
39306   kitten-technologies.co.uk
13462   n2n
3472    netbsd.lzma
76      pdb
398596  perplexity-pkgsrc
22296   projects
2352    sonix201_for_MAC.rar
8       test
2       test.c
15892   test.ugarit
38      tmp
2       ugarit.conf
2       ugarit2.conf

But often I want to see them sorted by size. The sort tool can do that - if invoked as sort -n it will read lines in, find a number at the start of each line, and sort the lines by those numbers. So I can feed the output of du -sk * into sort -n with one command, du -sk * | sort -n, where the | tells the shell to pipe the output of the first command into the second. Lo, the output of the command is:

2       test.c
2       ugarit.conf
2       ugarit2.conf
4       alaric-adventure
8       test
38      tmp
76      pdb
152     dwm
608     SN9C325 Series WebCam Driver OSX 1[1].0.zip
1776    SONiX USB 2[1][1].0 PC Camera Driver OSX 1.1.zip
2352    sonix201_for_MAC.rar
3472    netbsd.lzma
13462   n2n
15892   test.ugarit
22296   projects
39306   kitten-technologies.co.uk
113120  chicken-scheme
398596  perplexity-pkgsrc

If I wanted it largest first, I could use sort -nr instead to reverse the sort.

Now, the set of tools that come with Unix for making up such pipelines tend to be line-oriented, and have relatively crude support for operations within the line - sorting by a number midway through each line, as might be found in a complex table, is hard work, especially as there's a number of ways the table could be formatted (fixed width columns, CSV with any of a number of quoting conventions, etc).

And line-oriented tools are a pain when dealing with files written in formats that deserve the title "language", such as HTML, XML, or source code, as they tend to have complex structure that can be rearranged over lines to suit the asthetic tastes of the programmer.

Indeed, such a case came up on IRC lately, with somebody wanting to quickly analyse the dependencies between Chicken eggs, which are directories with certain specially-named metadata files inside that list the other eggs they require. He cobbled something together with grep that found most of the dependencies, but doing it properly and handling dependency lists that went onto multiple lines and so on would need proper s-expression parsing.

But a thought came to me. Why not put together some shell commands to match the line-oriented grep, sort, sed, awk and friends, but treating input and output files as lists of s-expressions? They'd be trivial to write.

Here's an initial idea for a set:

  • filter accepts a Scheme expression on the command line, and uses eval to evaluate it on each s-expression read from stdin in turn (binding the read s-expression to INPUT), and outputs the s-expression if the expression evaluates to a true value. echo "a b c (foo)" | filter "(pair? INPUT)" would output (foo).
  • map accepts a Scheme expression on the command line, and likewise uses eval to evaluate it on each s-expression read from stdin in turn (binding the read s-expression to INPUT), and outputs the result of the evaluation. echo "(1 2) (3 4)" | map "(car INPUT)" would output 1 3.
  • fold accepts two Scheme expressions on the command line, referred to as kons and knil, and evaluates knil with eval to obtain the initial accumulator value. It then evaluates kons in turn for each s-expression read from stdin, with INPUT bound to the read s-expression and STATE bound to the current accumulator value, and makes the result of the evaluation be the new accumulator value. When standard input reports EOF, the final accumulator value is output. echo "1 2 3" | fold "(+ INPUT STATE)" "0" would output 6 (the sum of the input numbers).
  • sort accepts two Scheme expressions on the command line, but both are optional, with defaults. One we shall call extract which defaults to INPUT, the other we shall call compare which defaults to (< A B). It reads all the s-expressions from standard input into a list, then sorts it, using compare as the comparison function, with A bound to the result of evaluating extract with INPUT bound to the first s-expression being compared, and B bound to the result of evaluating extract with INPUT bound to the second s-expression being compared. The resulting list is then output. If we use -c to introduce the comparison function and -e to introduce the extraction function, then echo "(food \"cheese\") (death \"burning\") (cheese \"edam\")" | sort -e "(cadr INPUT)" -c "(string<? A B)" would output (death "burning") (food "cheese") (cheese "edam") - sorting by the second element of each list as a string.
  • flatten reads s-expressions from its input, and if they are proper lists, outputs each member of the list in turn, otherwise outputs the whole s-expression. Note that it does no recurse down into lists, it only flattens the top level. echo "1 (2) (3 4) (5 (6))" | flatten would output 1 2 3 4 5 (6).
  • group accepts a Scheme expression on the command line, which is the grouping function. It reads an s-expression from the input and evaluates the grouping function with INPUT bound to the input s-expression, and saves the result as the "current group value", and puts the input s-expression into a list called the "current group". It then iterates over the remaining s-expressions in the input, again evaluating the grouping function for each; if the result is different to the current group value then the current group is output as a list s-expression, then the new group value becomes the current group value and the current line becomes the only member of the new current group. So echo "1 2 foo bar 3 4" | group "(symbol? INPUT)" would output (1 2) (foo bar) (3 4).

Clearly, for security, we'll need to just use plain R5RS read to input s-expressions - no special read syntax that might execute arbitrary code can be allowed, so a sandboxed readtable will be needed in implementations like Chicken that have a lot of special read syntax. But the full read can be used for parsing the Scheme expressions supplied as arguments, in order to take advantage of the nice syntactic sugar they provide (as it's arbitrary code we're reading anyway).

As well as the arguments listed above, there should probably be arguments that all the tools accept, to set things like input and output character encodings for strings, to supply a Scheme expression to evaluate before anything else happens to define or load any utility functions, and to choose between print as the output function (with a newline after each s-expression) or a pretty-printer, etc. Also, anything the Scheme expressions the user provides output to standard-output-port should be sent to stderr so they appear on the console rather than ending up in the output stream.

Can anyone think of any important tools I've missed? map does the job of much of sed, awk, cut, etc. in line-based shell pipelines. (consider how the match pattern-matching macro inside a map compares to sed...) I'm wondering about an extract that uses some abbreviated syntax (something like an xpath?) to extract a subexpression out of every s-expression - there could be multiple matches, so it should output a list of matches for every input s-expression, meaning that inputs that have no matches are represented by an empty list in the output. If this information is not required, a quick extra flatten in your pipeline will discard it.

Perhaps we need some tools to convert from non-sexpr notations and back again - perhaps a parse tool that takes a regexp and a Scheme expression, matches each input line against the regexp, then outputs the result of evaluation the Scheme expression with $1 etc. bound to the groups within the regexp, and another tool that outputs nothing to stdout itself, but evaluates a command-line-supplied Scheme expression upon each line with standard-output-port actually sending to stdout, so people can use things like the fmt egg to format s-expressions into raw text output (or just display for simple needs).

Maybe the next step would be s-expression-outputting versions of ls and ps to make them easier to handle 🙂

5 Comments

  • By Ben, Thu 25th Jun 2009 @ 11:51 am

    Check out Windows PowerShell. Instead of returning text, it returns objects, which can then be operated on by the next command in sequence.

    Actually looks quite shiny.

  • By @ndy, Tue 30th Jun 2009 @ 7:54 am

    sort -t, -k3

    In the cheese example, what is cadr? | sort -e "(cadr INPUT)" -c "(string<? A B)" Should it be cdr?

  • By alaric, Tue 30th Jun 2009 @ 9:41 am

    Ah, cadr is a shorthand function for (car (cdr ...))

    If INPUT is (a b), then (car INPUT) is a, (cdr INPUT) is (b), and (cadr INPUT) is (car '(b)) which is just b.

  • By alaric, Thu 13th Aug 2009 @ 10:12 am

    foof (author of the aforementioned fmt egg) made a good suggestion on IRC: As well as supporting the -e flag for a sort expression, for the common case of -e "(foo INPUT)", we can have a shorthand of -k foo, where foo is any function (eg, cadr).

  • By Michal, Thu 6th Sep 2012 @ 8:47 am

    Hello I have found name of my webcam in your example. This is only one place where I can see something related to the camera and OSX. Do you know anything about driver for the camera for MAC OSX? Please give me a hint where I can look for driver.

Other Links to this Post

RSS feed for comments on this post.

Leave a comment

WordPress Themes

Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales
Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales