LeoNerd's programming thoughts: 2010

2010/12/30

Perl - IO::Async - version 0.34

There's been four releases of IO::Async since I last wrote about version 0.30. Here's a rough summary of the more important changes and additions between then and version 0.34:

New Notifier class IO::Async::Timer::Absolute, to invoke events at a fixed point in the future.

New Notifier class IO::Async::PID, to watch a child process for exit(2).

New Notifier class IO::Async::Protocol::LineStream, to implement stream protocols that use lines of plain text.

New method on IO::Async::Protocol that wraps connect(2) functionallity, allowing for simpler network protocol client modules.

IO::Async::Loop->connect's on_connect_error and IO::Async::Loop->listen's on_listen_error continuations now both receive errno information.

New direct name resolution methods on IO::Async::Resolver for getaddrinfo(3) and getnameinfo(3). The resolver is now directly accessible from the IO::Async::Loop.

IO::Async::Resolver supports deadline timeouts.

IO::Async::Stream->write supports taking a CODE reference to dynamically generate data for the stream on-demand.

IO::Async::Stream->write supports an on_flush callback.

The IO::Async::Loop->new magic constructor now caches the loop. This is useful for wrapping modules, other event system integration, etc..

Documentation has been rearranged to add new EVENTS sections, documenting the events that Notifier classes can fire either as callbacks in coderefs, or as methods on subclasses.

Various bugfixes, other documentation additions

2010/12/24

Event loops and Jenga; or 24 Advent Calendar Events in One Go

There are many event loops systems in Perl. Do they play together?

I was thinking about this recently, at my LPW2010 talk about IO::Async. In the hackathon the following day, I managed to write IO::Async::Loop::POE; a way to run IO::Async atop POE.

So I started thinking further; if you can run one event loop system on top of another, how high can we stack them? Can we build a tower, putting each atop the previous, growing taller. Each new layer we try to add would start to get harder, more difficult, increasing the chances the whole thing came crashing down. Sortof like a Jenga tower.

So what would a Perl event loop Jenga tower look like?

My attempt looks like this: (326 lines, jenga.pl)

The output looks something like this:

$ perl jenga.pl
AnyEvent resolved 127.0.0.1:80
Glib reads Hello world!
POE reads Hello world!
POE resolved 127.0.0.1:80
IO::Async reads Hello world!
AnyEvent reads Hello world!
AnyEvent listener accepted
POE listener accepted
IO::Async resolved 127.0.0.1:80
AnyEvent connected received
POE connected received
IO::Async listener accepted
IO::Async connected received
Glib child exited 0
POE child exited 0
IO::Async child exited 0
AnyEvent child exited 0
Glib timer
POE timer
IO::Async timer
AnyEvent timer
^CIO::Async SIGINT
AnyEvent SIGINT
POE SIGINT
Stopping...

That's 24 events. Count them. It combines Glib, POE, IO::Async and AnyEvent. It performs a basic filehandle read, a child process watch, and a timed wait in each of these four systems. Because Glib lacks signal watching, only the other three perform this. The other three are also used to perform name resolution, socket listening, and socket connecting.

Everyone seems to be doing Advent Calendar blogs this year. 24 daily posts, each showing one small thing. Someone suggested I should write a Perl Event systems advent calendar. So perhaps here, consider this to be one. Except it has 24 windows all in one go.

As it turns out, it's possible to make this tower a little higher. There's a module to run Event beneath Glib; that is, it replaces the core polling function of Glib to use Event instead. And I suspect it may just about be possible to run Tk on Glib, and the POE on Tk.

At some point in the new year, I have some plans to turn this one-program script into a more useful resource of examples and translations. The Rosetta Stone for Unix provides a cross-reference for looking up Unix concepts between different systems. I feel that a similar attempt at Perl event loops could be quite useful too.

2010/12/19

Perl - CPS - version 0.11

CPS is a Perl module that provides several utilities for writing programs in Continuation Passing Style.

In a nutshell, CPS is a method of control flow within a program, which uses values passed around as continuations as a replacement of the usual call/return semantics. In Perl, this is done by passing CODE references; calling a function and passing in a CODE reference to be invoked with the result, rather than having the function return it. While at first this doesn't seem to be very useful, most of its power comes from the observation that the function doesn't need to invoke its continuation immediately; if it performs some IO operation or similar, it can perform this in an asynchronous fashion, invoking its continuation later. This style of coding is often associated with nonblocking or asynchronous event-driven programming. It is typical of such event systems as IO::Async.

A typical problem with implementing CPS control flow, is that all of the usual Perl control-flow mechanisms are built for immediate call/return semantics, where the use of CPS gets in the way. The CPS module provides utility functions for implementing control flow in a continuation passing style, by providing a set of functions that replace the usual Perl control-flow keywords. For example the Perl control structure of a foreach loop

foreach my $frob ( @frobs ) {
   my $wibble = mangle( $frob );
   say "Mangled $frob looks like $wibble";
}
say "All done";

becomes a call to CPS::kforeach

use CPS qw( kforeach );

kforeach( \@frobs, sub {
   my ( $frob, $knext ) = @_;
   kmangle( $frob, sub {
      my ( $wibble ) = @_;
      say "Mangled $frob looks like $wibble";
      $knext->();
   } );
}, sub {
   say "All done";
} );

We haven't really gained anything by doing this though. If the process of mangling a frob involves some IO tasks, perhaps talking to some remote server, then we'll spend most of our time waiting for it to respond when we could be sending multiple requests and waiting on their responses. We could likely save some time by running them concurrently.

I gave a talk at LPW2010 about Continuation Passing Style and CPS, the slides of which are available here.

After discussing CPS and IO::Async at LPW, I was talking with mst about his IPC::Command::Multiplex module. He came up with the idea for another control-flow function; a combination of kpar and kforeach, which I called kpareach. In use it looks exactly like kforeach, except that it starts the loop body for each item all at the same time, in parallel, rather than waiting for each one to finish before invoking the next.

This is a new addition to CPS version 0.11, which is now on CPAN. It is also one of the first new control-flow structures that doesn't have a direct plain-Perl counterpart; a demonstration of the usefulness of CPS in event-driven programming.

2010/11/21

General Updates

I have no big specific updates today. Instead, a list of lots of little things I've been working on:

IO::Socket::IP now has preliminary non-blocking connect support (version 0.05_003). This isn't quite a perfect solution because of blocking name resolvers, but see also Net::LibAsyncNS below.

Created a new CPAN dist, wrapping libasyncns, called Net::LibAsyncNS. This allows a simple way to asynchronise name resolver lookups.

Have fixed a few bugs in IO::KQueue, relating to dodgy handling of Perl scalars in the udata field. Some memory leak bugs still exist, but I believe these to be the kernel's fault. See below.

Spent some time on Freebsd-hackers@ arguing about kqueue and managing user data pointers. Long story short I believe kqueue API itself is missing a feature, making generic wrapping of it impossible from any high-level language, or properly by C libraries also.

Both talks I submitted for LPW2010 were accepted; on the subjects of CPS and IO::Async.

Net::Async::HTTP now has SSL support and can stream response body content as it arrives, rather than waiting for the whole response (version 0.08).

That's all for a quick update, but I may write about any or all of these topics in more detail later...

2010/10/31

Perl - Test::Identity

Today I uploaded a new module to CPAN; Test::Identity. It's possibly the quickest module I've ever written, from when I decided to write it, to when it was actually uploaded:

(18:02) sub Test::Identity::identical { is refaddr $_[0], refaddr $_[1], $_[2] } <== I'm about to write such in a module, unless anyone can suggest me a module that already has it
...
(19:21) * GumbyPAN CPAN Upload: Test-Identity-0.01 by PEVANS

I won't spend a long time explaining why, I'll just quote the docs:

This module provides a single testing function, identical. It asserts that a given reference is as expected; that is, it either refers to the same object or is undef. It is similar to Test::More::is except that it uses refaddr, ensuring that it behaves correctly even if the references under test are objects that overload stringification or numification.

It also provides better diagnostics if the test fails:

$ perl -MTest::More=tests,1 -MTest::Identity -e'identical [], {}'
1..1
not ok 1
#   Failed test at -e line 1.
# Expected an anonymous HASH ref, got an anonymous ARRAY ref
# Looks like you failed 1 test of 1.

$ perl -MTest::More=tests,1 -MTest::Identity -e'identical [], []'
1..1
not ok 1
#   Failed test at -e line 1.
# Expected an anonymous ARRAY ref to the correct object
# Looks like you failed 1 test of 1.

2010/10/18

Perl - IO::Socket::IP

IO::Socket::IP is a subclass of IO::Socket that provides a protocol-independent way to use IPv4 and IPv6 sockets. It provides an API compatible with the IPv4-only IO::Socket::INET, but does so in a way that ensures properly transparent IPv6 support.

The following example shows it has an identical API to the ::INET module:

use IO::Socket::IP;

my $sock = IO::Socket::IP->new(
   PeerHost => "www.google.com",
   PeerPort => "www",
) or die "Cannot construct socket - $@";

At this point, $sock is just another IO::Socket-derived filehandle, and supports all the usual methods and IO functionality. The only difference here is that, where IO::Socket::INET would have used a legacy gethostbyname call and made a PF_INET socket, IO::Socket::IP will use getaddrinfo (via Socket::GetAddrInfo), and will use either PF_INET or PF_INET6 as appropriate.

It's not yet a complete 100% API clone of ::INET though. While it supports all the methods, there are a few constructor arguments not yet supported, being Blocking and Timeout. The way that getaddrinfo returns a list of candidate addresses, to be tried in order, makes nonblocking support hard to do, and complicates the model for what timeout really means. For nonblocking connect support, better solutions already exist, such as IO::Async's Connector, which has always supported IPv6 via getaddrinfo. As for timeouts, eventually IO::Socket::IP should support it, but for now a local'ised $SIG{ALRM} and alarm call should suffice.

For almost all use cases, switching to using IO::Socket::IP should be a simple matter, because of the API similarities. Just adding an extra dependency (because ::IP isn't core), and substituting the package name in source should be enough.

Finally, in case you don't want to pull in an extra hard dependency, you might consider the following fragment I've used quite successfully:

   my $class = eval { require IO::Socket::IP
                         and "IO::Socket::IP" } ||
               do   { require IO::Socket::INET
                         and "IO::Socket::INET" };

   $socket = $class->new(
      PeerHost => $host,
      PeerPort => $port,
   ) or die "Cannot connect - $@\n";

There. Now you have no excuse for not being IPv6-ready.

2010/09/23

Perl - IO::Async - version 0.30

Yesterday, I put the next version of IO::Async on CPAN; version 0.30. This was primarily an update to add some new features, though also a few minor bugfixes and documentation updates were included too. Here I want to focus on a few of these new features.

The first of these new features is nothing groundbreaking in itself, but feeds into the others. It's simply the addition of IO::Async::Socket, a notifier subclass to contain a socket that isn't necessarily a stream (primarily SOCK_DGRAM or SOCK_RAW sockets such as UDP, PF_PACKET or PF_NETLINK). This neatens up a few rough edges with trying to put such sockets directly in IO::Async::Handle objects.

The second main new feature is the creation of the IO::Async::Protocol class, and IO::Async::Protocol::Stream subclass. These derive directly from IO::Async::Notifier rather than IO::Async::Handle, and are intended to be abstract containers of code, and not perform any IO operations directly. Instead, they contain a Handle or Stream object as a child notifier. By exposing an API identical to IO::Async::Stream, the IO::Async::Protocol::Stream should be a drop-in replacement for any modules trying to implement a network protocol.

With the addition of IO::Async::SSL, not every stream-like connection can be represented by IO::Async::Stream, so separating the transport layer from the protocol layer is required. This wasn't possible by subclassing, whereas object containment makes it much simpler.

Net::Async::FTP, Net::Async::HTTP, and Net::Async::IRC have all been updated to use it, and most other use cases should be simple to change.

The final main change is that $loop->connect and IO::Async::Listener now support direct on_stream or on_socket continuations, which will be provided an instance of Stream or Socket directly, rather than requiring the invoked code to wrap one. This can then be easily configured as a IO::Async::Protocol's transport.

Having made this change, it leads the way to transparent SSL support across all protocols, and possibly other concerns like SOCKS proxies, by extending the arguments to $loop->connect or Listener. But that's for another post...

Finally, I should announce that I've now started a channel on irc.perl.org called #ioasync, as the official IRC home for IO::Async. Feel free to drop by if you have any issues, comments, questions,...

2010/09/20

Perl - overload::substr

overload allows an object class to provide methods which Perl should use to implement certain operators, like numerical addition or string concatenation. One operator that overload doesn't allow to be provided, is substr.

overload::substr allows this to be overloaded. This allows objects that behave like a string, to specify to Perl how they will handle the substr operator.

$ cat example.pl 
#!/usr/bin/perl

use strict;
use feature qw( say );

package ExampleString;

use overload::substr;

sub new { return bless [ @_ ]; }

sub _substr
{
   my $self = shift;
   my ( $offs, $len, $replace ) = @_;

   return sprintf ">> %s between %d and %d <<", $self, $offs, $offs+$len;
}

package main;

my $str = ExampleString->new( "Hello, world" );

say substr( $str, 2, 5 );

$ perl example.pl
>> ExampleString=ARRAY(0x86dd9c8) between 2 and 7 <<

The module is still in its early days yet, but the basics appear to be working on all Perl versions back to 5.8. I also want to try extending it, so that split() and regexp matches with m// and substitutions with s/// also use the substr operation. The identity that

$1 == substr( $str, $-[1], $+[1] - $-[1] )

is sure to be useful here.

I need a good example to show it off with sometime. I have in mind a string-alike object with real positional cursors, which remember their contextual position even after edits in other parts of the string. But more on that later...

2010/09/06

Module name suggestions: A proper IO::Socket for IPv4/IPv6 duallity

I currently don't have a good name for a module I'd like to write, because I think it is very much required right now.

We have IO::Socket::INET. It wraps PF_INET, thus making it IPv4 only.

We have IO::Socket::INET6. It wraps either PF_INET or PF_INET6, despite its name. It also uses Socket6, thus restricting it to only working on machines capable of supporting IPv6.

Thus any author wanting to write code to communicate to the internet (apparently that's some new fad everyone's talking about this week) is presented a moral dilema: Support IPv6 at the cost of not working on older v4-only machines, or support older machines but be incapable of using IPv6.

I originally partially solved this problem some years ago by the creation of Socket::GetAddrInfo, a module that presents the interface of RFC2553's getaddrinfo(3) and getnameinfo(3) functions. This however is not enough for actually connecting and using sockets.

I'd therefore like to propose a new IO::Socket subclass that uses these and only these functions, for converting between addresses and name/service pairs.

use IO::Socket::YourNameHere;

my $sock = IO::Socket::YourNameHere->new(
   PeerHost    => "www.google.com",
   PeerService => "www",
);

printf "Now connected to %s:%s\n", $sock->peerhost, $sock->peerservice;

...

Since it would use Socket::GetAddrInfo, it can transparently support IPv4 or IPv6. Since it would only use Socket::GetAddrInfo, it will work in a v4-only mode on machines incapable of supporting IPv6, and will not be restricted to only IPv4 or IPv6 if and when some new addressing family comes along to replace IPv6 one day; as v6 is now trying to do with v4.

In order to provide an easy transition period, I'd also support additional IO::Socket::INET options where they still make sense; e.g. accepting {Local/Peer}Port as a synonym for {Local/Peer}Service. The upshot here ought to be that you can simply

sed -i s/IO::Socket::INET/IO::Socket::YourNameHere/

and suddenly your code will JustWork on IPv6 in a good >90% of cases.

Can anyone suggest me a better module name for this?

Edit 2010/09/07: We seem to be settling on IO::Socket::IP for this currently.

Edit 2010/09/23: We did indeed settle on IO::Socket::IP; this is now up on CPAN, and will be the subject of a future posting...

This cross-posted from module-authors@perl.

2010/08/15

Test to assert object identity

I've just copypasted the following test function into about the fifth different test script:

use Scalar::Util qw( refaddr );
sub identical
{
   my ( $got, $expected, $name ) = @_;

   my $got_addr = refaddr $got;
   my $exp_addr = refaddr $expected;

   ok( !defined $got_addr && !defined $exp_addr ||
          $got_addr == $exp_addr,
       $name ) or
      diag( "Expected $got and $expected to refer to the same object" );
}

Rather than continuing to copypaste it around some more, can anyone suggest a standard Test:: module that contains it? Failing that, if it's really honestly the case that nobody has yet felt it necessary to provide one, could someone suggest a suitable module to contain it?

This behaviour cannot be implemented using, say, is( $obj, $expected ) because that will attempt to compare numerical equality, which of course will fail for any object that overloads numberification or numeric comparison operators.

2010/08/10

Perl - Config::XPath - new version 0.16

Config::XPath is a Perl module for accessing configuration files using XPath queries. It provides some wrapping around XML::XPath for convenience in using a single config file, and easily fetching string, list or map values from it. It plays nicely alongside, for example, Module::PluginFinder (about which I shall write more another day) for easily building powerful configuration-driven plugin-based programs.

use Config::XPath;
use Module::PluginFinder;

my $conf = Config::XPath->new( filename => 'foomangler.conf' );
my $finder = Module::PluginFinder->new(
                search_path => 'FooMangler::Plugin',
                typefunc    => 'TYPE',
             );

my %plugins;

foreach my $plugin_conf ( $conf->get_sub_list( '/plugin' ) ) {
   my $name = $plugin_conf->get_string( '@name' );
   my $type = $plugin_conf->get_string( '@type' );

   $plugins{$name} = $finder->construct( $type, $plugin_conf );
}

Given a config file that perhaps looks like

<foomangler>
  <plugin type="hello" name="hello_world">
    <message>Hello, world</message>
  </plugin>
</foomangler>

We can implement a plugin for this system quite simply, and have it be automatically discovered by the plugin system, instances created, and passed in its configuration from the config file:

package FooMangler::Plugin::Hello;
use constant TYPE => "hello";

sub new
{
   my $class = shift;
   my ( $config ) = @_;

   my $message = $config->get_string( 'message' );
   ...
}

As well as providing one-shot reading support, it also has a subclass Config::XPath::Reloadable which allows for convenient reloading of config files. It itself keeps track of which XML nodes it has already seen, based on some defined key attribute, so it can determine additions and deletions. It will invoke callback functions when items are added or deleted, or their underlying config may have changed.

use Config::XPath::Reloadable;

my $conf = Config::XPath::Reloadable->new( filename => 'foomangler.conf' );
my $finder = Module::PluginFinder->new( ... );

$SIG{HUP} = sub { $conf->reload };

my %manglers;

$conf->associate_nodeset( '/mangler', '@name',
   add => sub {
      my ( $name, $mangler_conf ) = @_;
      my $type = $mangler_conf->get_string( '@type' );

      $manglers{$name} = $finder->construct( $type, $mangler_conf );
   },

   keep => sub {
      my ( $name, $mangler_conf ) = @_;

      $manglers{$name}->reconfigure( $mangler_conf );
   },

   remove => sub {
      my ( $name ) = @_;

      delete $manglers{$name};
   },
);

Now, whenever a SIGHUP signal is received, the config file is re-read. The configurations for all the current manglers are updated, new ones added, and old ones deleted.

I've just uploaded a new release, 0.16. This release finally gets rid of the awkward Error-based exceptions, instead using plain-old Carp-based string exceptions. This removes a dependency on the old, deprecated, and unsupported Error distribution.

I've also manually set the configure_requires element to set the required version of Module::Build down to 0.2808, which is what Perl 5.10.0 shipped with, rather than let it pick its own version, where it sets it to 0.36. Hopefully this should lead to no awkward "please upgrade Module::Build" on clean-slate installs. If this comes out OK I might start applying that by default across all my dists (where appropriate). It does seem a little awkward, but then I can't really think of a neater way for it to detect that - hard for it to know, for example, about random methods or functionality invoked during the Build.PL file itself, or bugs/features implicitly relied upon. Something to think about for next time, I feel...

2010/07/30

Perl - List::UtilsBy

List::UtilsBy is a module containing a number of list utility functions which all take a block of code to control their behaviour. Among its delights are a neat wrapping of sort by a custom accessor, optimisation, and rearrangement functions. The functions in this module are a loose collection of functions I've written or found useful over the past few months or so. I won't give a full overview here, you can read the docs yourselves; but I will give a brief description of a few functions.

One frequent question we often get in #perl on Freenode concerns how to sort a list of items by some property of the items, perhaps the value of an object accessor or the result of some regexp extraction. Sometimes the answer comes in variants on a theme of

@sorted_items = sort { $a->accessor cmp $b->accessor } @items;

@sorted_items = sort { ( $a =~ m/^(\d+)/ )[0] <=> ( $b =~ m/^(\d+)/ )[0] } @items;

Sometimes a mention of the Schwartzian Transform comes.

I decided to take this often-use pattern and find a nicer way to represent it. The result is the sort_by functional.

@sorted_items = sort_by { $_->accessor } @items;

@sorted_items = sort_by { ( $_ =~ m/^(\d+)/ )[0] } @items;

As well as neatness of code, this also has advantage of invoking the accessor only once per item, rather than once per item pair.

An operation I've often wanted to perform is to search a list of items for the best match by some metric. For this, there is max_by and variations.

$longest = max_by { length $_ } @strings;

$closest = min_by { distance_between( $_->location, $here ) } @places;

Finally, as a replacement for the often-used pattern

@array = grep { wanted $_ } @array;

We have

extract_by { not wanted $_ } @array;

As noted in the documentation, this is implemented by spliceing the unwanted elements out, not by assigning a new list, so this is safe to use on lists containing weak references, or tied variables.

2010/07/19

My current Perl project - Circle

Recently chromatic wrote that we should "tell the world, what are you working on with Perl?"

So, to answer this then, my current project is Circle, an IRC client. Actually, it's much more than an IRC client, but that will do as a first approximation.

Rather than being Just Another IRC client, this one is split into two programs; a backend server that runs on some machine somewhere, likely your co-located shell hosting box, or home server. This maintains all the connections to the IRC networks, persists the scrollback and so on; it is the guts of the logic. Then there's the frontend program, a lightweight GTK application that draws the UI for the backend logic. The frontend doesn't really understand IRC, the backend has no knowledge of GTK. Several readers may recognise something of the MVC pattern about this.

Without going into too much detail here (you can read the above link), this gives you the advantages of a real local native-UI client, plus the advantages of a persistent server. The UI interactions are local, no network latency or bandwidth to get in the way of line editing, backbuffer scrolling, window switching, and so on; yet all the data is persisted in the server so you can just disconnect the thin client and reconnect it from anywhere else.

A common way people usually solve this sort of problem is to run irssi in a screen session, and reconnect over SSH. The primary downside of this setup is it requires a low-latency, high-bandwidth connection to the server, as every keypress of the line editor to send your next line, will have to round-trip over that network. Every backbuffer operation, scrolling up and down, or switching between windows, has to redraw over the link. If that link has high latency, or low bandwidth, the user experience will suffer. If the network charges for bandwidth, you will end up paying many times to keep re-sending the same screenful of scrollback as you switch windows. By not having a real presence on the local desktop, irssi-in-screen also cannot take advantage of local desktop features such as notification sounds or highlight popups, nor can it access the local filesystem to perform DCC transfers or similar.

Another solution to remote persistant IRC is to run an IRC proxy server or bouncer, and point a regular IRC client at that. These either don't support backbuffer refills, or save and replay events, possibly by prefixing timestamps in the message text. They suffer many shortcomings by being a hacked-on proxy in front of an existing IRC client, which overall doesn't really support the disconnect/reconnect model. These solutions almost exclusively are also IRC-specific, and cannot integrate non-IRC (such as Instant Messaging) alongside.

Right now this is pretty-much all there is to it, though the design is such that it can accommodate much more. There's also a plain telnet-alike backend module, but it could quite easily accept Instant Messaging, Email, PIM, whatever. Right now, the only frontend is GTK, but nothing says one couldn't also be written for Qt, Windows, or any other GUI toolkit. I'm also slowly in the process of writing a terminal-mode one.

The code is available on CPAN:

Circle (backend)
Circle-FE-Gtk

Patches, they say, are Welcome.

2010/07/01

Good code, bad tests

I've been working on IO::Async::SSL recently. Both the previous upload, and the next one I shall shortly do, are fixes for failing smoke tests. Moreover, they're fixes for failing tests, of correct code.

The first concerns the semantics of the END block. The code looked roughly like

use Test::More
system( 'socat -help' ) == 0 or plan skip_all => "No socat";
plan tests => 3;

open my $kid, "-|", "socat", .... or die;
END { kill TERM => $kid }

The idea of this code was to ensure that socat is no longer running when the test exits. Of course, if the socat program isn't installed, the entire test is skipped. And the END block is still run. The result of this is that since $kid is undefined, the containing perl process is sent SIGTERM instead, and the test harness exits with an error. Ooops.

Now what I actually wanted here was

END { kill TERM => $kid if defined $kid }

I wonder if this situation would warrant a new block-alike, perhaps lets call it ENDX, which is only executed at END time if the line declaring it was actually reached in code. Perhaps it can be hacked up by

my @ends;
sub ENDX(&) {
   push @ends, $_[0];
}
END { $_->() for @ends }
...
ENDX { kill TERM => $kid };

The second failure is a curious one that starts off looking like an OS-specific bug in the code. All the Linux smoke boxes were giving fails on a particularly innocent-looking test

is( $client->peeraddr, $server->sockaddr, 'Client socket address' );

Under closer inspection, it seems that the way Linux returns socket addresses from the kernel doesn't initialise the "holes" in the address structure, whereas most other OSes do.

The result of this is that rendered as strings of bytes, the two addresses don't necessarily contain all the same bytes as each other, even though they represent the same address. The way to fix this one is to unpack the addresses, known to be AF_INET addresses, and use is_deeply instead:

use Socket qw( unpack_sockaddr_in );

is_deeply( [ unpack_sockaddr_in $client->peeraddr ],
           [ unpack_sockaddr_in $server->sockaddr ],
           'Client socket address' );

This doesn't look too neat this way, but perhaps there's some scope for considering a Test::Sockaddr module or somesuch, to neaten it up. Perhaps a little special-purpose though..

2010/06/03

Why you should never use indirect object notation

On the subject of indirect object notation, kappa writes "Do not use it when it causes problems".

It always causes problems.

$ perl -MO=Deparse -e 'package Container;
  sub new { shift->BUILD }
  sub BUILD { my $contained = new Contained() }'
package Container;
sub new {
    shift(@_)->BUILD;
}
sub BUILD {
    my $contained = new(Contained());
}
-e syntax OK

What you meant, perhaps, was

$ perl -MO=Deparse -e 'package Container;
  sub new { shift->BUILD }
  sub BUILD { my $contained = Contained->new() }'
package Container;
sub new {
    shift(@_)->BUILD;
}
sub BUILD {
    my $contained = 'Contained'->new;
}
-e syntax OK

Basically, indirect notation breaks any time you're actually writing an object class. If you start indirectly invoking constructors (likely), you'll accidentally hit your own. Any time you go adding a new method to your own class - bang; you've just broken any indirect calls to that named method on any other object you'd be using. That sounds excessively fragile to me. Code that used to mean one thing suddenly means a different thing when you add seemingly-unrelated code in another place. Fragile breakage by action-at-a-distance. It even depends on the order of the source code:

$ perl -MO=Deparse -e 'package Container;
  sub BUILD { my $contained = new Contained() }
  sub new { shift->BUILD }'
package Container;
sub BUILD {
    my $contained = 'Contained'->new;
}
sub new {
    shift(@_)->BUILD;
}
-e syntax OK

And given probably most of the time you're writing code, you're writing objects, right? Even if the function you're writing currently isn't in an object class, it probably uses objects. Maybe one day you'll decide that function is useful in an object, so you'll copypaste that code elsewhere. This probably means most of your code ought not use indirect notation. If already most of it isn't using it, just get out of the habit NOW, before it is too late, and pretend that indirect notation does not exist at all. You'll save yourself a lot of pain later on.

2010/05/27

Weasels in the Code

I've recently written the following utility method on an object:

sub _capture_weakself
{
   my $self = shift;
   my ( $code ) = @_;   # actually bare method names work too

   weaken $self;

   return sub { $self->$code( @_ ) };
}

It's quite a useful little method for creating a new CODE ref around either a given code ref, or a method on the object. This CODE ref captures a reference to the object, passing it as the first argument. This makes it useful for passing around elsewhere, perhaps as an event callback (as it happens, this method lives in IO::Async::Notifier). Because the object ref is stored weakly in this closure, it means the returned closure can safely be stored in the object itself without creating a cycle.

It's sufficiently useful that I feel sure this technique must have a name, but so far I'm failing to find one. I manage to keep reading the name here as "capture weasel". This gives me an idea - perhaps it could be said this closure has been weaseled; short for wea(k)sel(f)ed.

my $weasled_code = $notifier->_capture_weakself( sub {
     my $self = shift;
     my ( $x, $y ) = @_;
     ....
} );

$weaseled_code->( 123, 456 );

What does anyone think here? Does this technique have a name already? If not; does this seem suitable? I find it unlikely this name already exists somewhere else in CompSci (FOLDOC doesn't have a use in this sense), so it would be a good unique name...

2010/05/18

PF_PACKET, Linux Socket Filters, and IPv6

For diagnosing network-related problems, it's often useful to be able to capture packets transmitted or received by a machine. Linux implements a socket family, PF_PACKET, to this end. Sockets in this family receive raw datagrams containing packets received or transmitted on network interfaces. A network capture program creates such a socket, then sits in a loop receiving datagrams. Each datagram contains the bytes of the packet. The AF_PACKET address format gives information about, among other things, packet direction and interface number.

In situations where a machine is passing lots of traffic, such as a busy internet router, but the problem being diagnosed concerns only a narrow selection of this traffic, capturing every single packet would be very CPU-intensive. The application could inspect each packet and discard all of those not of interest to it. But since each packet requires a recvfrom(2) system call, this gets quite expensive in context switches, and wasted buffer space. A far more efficient scheme is to have the kernel filter the packets, or at least throw away most of the uninteresting ones.

In Linux, this is done using the SO_ATTACH_FILTER socket option. This option attaches a filter program, which executes on a simple virtual machine within the kernel. This machine is given access to inspect the bytes of the packet, and can return a value indicating whether the packet should be kept, or discarded. This machine is based on BSD's BPF mechanism, with some extensions.

This virtual machine is a register machine. It has a single accumulator (A), a single index register (X), and the various usual sorts of arithmetic and logical operations one would expect. To inspect the actual packet contents it has instructions to load data from the captured packet, in 8, 16, or 32bit unsigned quantities, into the accumulator. The program gives its answer by returning a number to the kernel, which should be the number of bytes of the packet to capture, or 0 to discard it entirely.

For example, lets consider the following example, which selects any TCP packets:

LD BYTE[9]
JEQ 6, 0, 2

LD len
RET A

RET 0

The LD BYTE[offs] instruction loads a byte from the packet (at offset 9, being where IPv4 headers keep their protocol number) into the A register. JEQ is a conditional jump instruction which compares the A register to the immediate constant (6, the protocol number of TCP). If they are equal it jumps to the first label; if not, the second. Both jump labels are unsigned integers, being the number of instructions forwards to skip. In the true case (i.e. the protocol is TCP), the length of the packet is loaded into the accumulator and returned from the filter. In the false case, the number 0 is returned, to discard the packet.

Of course, this filter isn't much use yet, because we don't know for sure it's even an IPv4 packet we've caught. The original BSD BPF doesn't define any metadata access scheme, so Linux has invented a way for programs to access this. It reserves the topmost 4KiB of packet buffer, to provide some virtual "registers" containing metadata. Since the load instruction offsets are 32bit integers, it is unlikely any real packet would ever be anywhere near this size, so in practice this works well.

In this extra data area, called the Ancillary Data, the first offset stores the ethertype of the packet (this field address has a symbolic name, SKF_AD_PROTOCOL). In IPv4's case, this will be 0x0800. We can extend our filter to look at this protocol number (not to be confused with IPv4's protocol field), and check it.

LD WORD[AD_PROTOCOL]
JEQ 0x0800, 0, 4

LD BYTE[9]
JEQ 6, 0, 2

LD len
RET A

RET 0

We can further extend our filter, looking only for a certain TCP port number (such as 80, for http). Presuming an IPv4 header with no extra options, it is 20 bytes long, and therefore, the TCP header will start at offset 20:

LD WORD[AD_PROTOCOL]
JEQ 0x0800, 0, 8

LD BYTE[9]
JEQ 6, 0, 6

LD HALF[20]
JEQ 22, 2, 0
LD HALF[22]
JEQ 22, 0, 2

LD len
RET A

RET 0

Of course, this filter isn't much use in the real world, because we made the rather large assumption that an IPv4 header would be 20 bytes long. This is where the index register becomes useful. We can load the index register, X, with the size of the IPv4 header, and access the TCP header relative to this. BPF provides a very special instruction, LDMSHX, for exactly this purpose. Given the address of a byte, it loads that byte, masking off the bottom 4 bits, and shifts up by 2 bits. This calculates the size in bytes of an IPv4 header, given the first byte.

LD WORD[AD_PROTOCOL]
JEQ 0x0800, 0, 9

LD BYTE[9]
JEQ 6, 0, 7

LDMSHX BYTE[0]
LD HALF[X+0]
JEQ 22, 2, 0
LD HALF[X+2]
JEQ 22, 0, 2

LD len
RET A

RET 0

Finally, this filter is only of any use on a SOCKET_DGRAM socket; a socket where the kernel will throw away the link-level header (such as the Ethernet or PPP framing), and present only the network-level header. A packet capturing program would very likely be interested in those link-level bytes too, so would be using a SOCKET_RAW socket instead. In this case, we don't directly know the offset of the IPv4 header, but once again, Linux comes to our rescue. It provides a "virtual view" over the part of the packet from the start of the network header. It provides a virtual offset, NET, where load instructions read relative to the start of the network header, rather than from the start of the packet buffer as a whole.

On a SOCKET_RAW socket, this filter would probably be appropriate to select TCP port 80 over IPv4:

LD WORD[AD_PROTOCOL]
JEQ 0x0800, 0, 9

LD BYTE[NET+9]
JEQ 6, 0, 7

LDMSHX BYTE[NET+0]
LD HALF[NET+X+0]
JEQ 22, 2, 0
LD HALF[NET+X+2]
JEQ 22, 0, 2

LD len
RET A

RET 0

This is internally implemented by carving up a further extra data area, 1MiB from the top of the packet buffer (defined by a constant SKF_NET_OFF). This offset gives a virtual view of the bytes in the packet, starting at the network protocol header.

Of course, I have been looking IPv4 quite specifically here. With the slowly-growing popularity of IPv6, it's inevitable that packet capture programs might want to capture IPv6 packets too.

IPv6 follows a different style from IPv4 in terms of its header options. In IPv4, all the IP-level options are stored in the header, one after another. The "header length" field in the header gives the total size of the header, with all these options. In IPv6, the header contains fewer fields (because things like fragmentation are now options). At the end of this header is the "next header" number, which is either a protocol number such as TCP or UDP, or gives an IPv6 extension header number (such as fragmentation control, or IPsec's authenticating header). Each option links on to the next with its own "next header" field.

This presents us something of a problem when it comes to packet capture filters. Recall how, for the JEQ instruction, the branch labels in both cases are unsigned integers? This is how BPF guarantees termination of a program in finite time - every jump has to be forwards. There can be no loops. Without loops, the program is guaranteed to terminate.

But now how do we parse these IPv6 headers? We can't write a while() loop in the program to walk down the headers, until we find a TCP one. Furtheremore, IPv6 doesn't define a standard header layout for all headers. Each header type puts its "next header" field in possibly a different place. Some headers are fixed-length, some carry a "header length" field of their own. It's a total mess, from the point of view of packet filtering.

What I would propose, is to create two new metadata constants:

A new Ancillary Data area field, SKF_AD_TRANSPROTO to store the transport level protocol.

A new data area offset, SKF_TRANS_OFF, to give a virtual view of the transport header.

This will allow us to very easily write a packet filter to capture TCP port 80, say, agnostic of IPv4 or IPV6. The filter program would look like:

LD WORD[AD_PROTOCOL]
JEQ 0x0800, 9, 0
JEQ 0x86dd, 0, 8

LD WORD[AD_TRANSPROTO]
JEQ 6, 0, 6

LD HALF[TRANS+0]
JEQ 22, 2, 0
LD HALF[TRANS+2]
JEQ 22, 0, 2

LD len
RET A

RET 0

We've now created a filter which can detect the transport protocol of TCP, and inspect the transport header, without having to directly calculate its offset from hard-coded knowledge of the network protocol.

I would like to propose that Linux adopts these two constants, and finds a way to implement them. I have some thoughts on implementation but I will defer these to a later post; as this one has gone on quite long enough already. :)

2010/04/30

Perl - ExtUtils::H2PM

I've spent a lot of time lately writing modules that wrap Linux kernel features in some way or another. They all boil down to basically the same thing - export a bunch of constants, and structure packing/unpacking functions. Various bits of extra fancy interface code are sometimes nice, but most of the time these can be written in Pure Perl once the base bits are done.

It's always annoyed me that one has to write an XS module just to obtain these. It's a lot of extra work and bother, for something that ought to be so simple. So instead, I started thinking about it, how can I make this much simpler from Perl?

What I came up with is ExtUtils::H2PM; a module to make it trivially-easy to write this sort of boring module to wrap some constants and structures from OS's header files.

As a brief example, consider the following

use ExtUtils::H2PM;

module "Fcntl";

include "fcntl.h";

constant "O_RDONLY";

write_output $ARGV[0];

This is it. This is all the code you, as a module author, have to write. You store that in some file, lets call it Fcntl.pm.PL. You let the build system go about converting that into Fcntl.pm; which on my system now looks like this:

package Fcntl;
# This module was generated automatically by ExtUtils::H2PM from -e

push @EXPORT_OK, 'O_RDONLY';
use constant O_RDONLY => 0;

1;

This is a plain standard Perl module that can be installed and used, to provide that constant.

We can also create pack/unpack-style functions to wrap structure types too. Consider

use ExtUtils::H2PM;

module "Time";

include "time.h";

structure "struct timespec",
  members => [
    tv_sec  => member_numeric,
    tv_nsec => member_numeric,
  ];

write_output $ARGV[0];

For this we obtain:

package Time;
# This module was generated automatically by ExtUtils::H2PM from -e

use Carp;
push @EXPORT_OK, 'pack_timespec', 'unpack_timespec';

sub pack_timespec
{
   @_ == 2 or croak "usage: pack_timespec(tv_sec, tv_nsec)";
   my @v = @_;
   pack "q q ", @v;
}

sub unpack_timespec
{
   length $_[0] == 16 or croak "unpack_timespec: expected 16 bytes";
   my @v = unpack "q q ", $_[0];
   @v;
}

1;

This was done entirely automatically too.

In a real use-case, you'd be using this to wrap things like socket option constants/structures, and so on; cases where existing Perl functions wrap existing syscalls, you simply want to provide new option values or structures. I've now written several modules mostly or entirely relying on ExtUtils::H2PM to build them, freeing me of having to write all that XS code to implement them.

Linux::SocketFilter

Socket::Netlink

Socket::Netlink::Route

Socket::Netlink::Taskstats

2010/04/26

Order matters even when it doesn't

Revision control diffs are most readable when they aren't noisy. Operations that disturb the order of many lines in the file create noise which makes it hard to read the interesting change. YAML specifies mappings (hashes, to us perl-types), that are unordered associations of keys to values. Even though YAML doesn't put an ordering on those, sometimes we'd like to pretend that it does, so as to preserve the order when we load a file, edit the data, then dump it back to the file.

At work we store a YAML document in Subversion, which describes a lot of details about IPsec tunnels. In an ideal world this would be the initial source of the information. The world, as you may have observed, is not yet ideal, so this file is in fact back-scraped from information in the actual config files, to keep it up to date. Naturally that's done in Perl.

The YAML file stores a big mapping, each entry itself being a record-like mapping, containing details in named keys. This causes great trouble for our load/edit/dump script, because YAML doesn't specify an ordering in mapping keys. They'll be dumped in "no particular order".
This wouldn't normally be a problem, except that because it's stored in Subversion, a commit changing one line of actual detail might suddenly produce hundreds of lines of false diff, because of reordered keys.

To solve this, I had to apply Much Evil Hackery. The YAML Perl module, it turns out, has a data structure tied to a hash, which remembers the order of keys. By subclassing YAML::Loader and replacing its method to read a mapping into a hash ref, we can force it to use this structure instead. This alteration is transparent to the perl code inbetween, it just sees a normal hash. However, YAML::Dumper sees the ordering and preserves it when it writes out.

The upshot: Load/edit/dump of trees of mappings in YAML preserves ordering, allowing cleaner commits into revision control.

This has been suggested as a wishlist bug against YAML; see also https://rt.cpan.org/Ticket/Display.html?id=56741

2010/04/06

Why I won't tell you what to do

Often in #perl we get people asking a question, whether explicitly or implied, that requires us to pick a solution for them. I try hard not to do this. The closest I'll get is to listen to their description of the problem, and suggest some things which I think might help. Hardly ever do I suggest just one thing.

I do this because it's hard for us to know the entire surrounding context of a problem. If anyone else is like me, then there'll be days, weeks, maybe even months of history behind it; various attempts they've tried already, other bits and pieces of code, system, whatever, that they haven't been able to explain in the 5 minutes they tried to give us the problem. You can't explain a 1-month problem in 5-minutes, nor can I give a solution to it in similar timeframe.

What I can do is name a few things that I'd include in a shortlist of things to think about in more detail, were I to find myself with a similar problem. They can then go away and think about these things. Maybe he already was aware of them, and we've just given some more confidence that those might be correct. Or maybe he wasn't, so we've given him something new to read about. Either way, we've helped guide the decision process, without outright saying "thou shalt do this" - because, without having that month of context around it; for all we know it could be completely wrong. But it's a good start.

2010/03/31

Perl - IO::Async

IO::Async is:

a Perl module distribution
a generic eventing framework
suitable for all kinds of IO-bound tasks
a way to achieve IO concurrency
portable to Linux, Solaris, BSDs, other POSIXes
mostly-working on cygwin
able to use OS-specific optimisations where appropriate
able to use Glib for running GTK2-based GUI applications

IO::Async serves as a base layer for all kinds of IO-heavy programs.

Its design is centred around passing code references as continuations; handlers to invoke when a certain event happens, or after a certain operation has completed. By relying on lexical variable capture in these code references, these continuations can easily store state in normal variables.

It's a different approach than that taken by POE, the other eventing system for Perl, but I find this approach fits better in my head. If you dislike storing state in a hash to pass it around disjoint named functions loosely connected by event names, you might just want to give it a try...