2013/10/23

Parallel Name Resolving using IO::Async

Perl has a variety of modules and frameworks that allow multiple, parallel operations at once. Some are specific to one kind of task, and some more generic. One of the larger general-purpose ones is IO::Async.

IO::Async provides an abstraction around the system name resolver, getaddrinfo(), allowing it to be called asynchronously to resolve a number of names at once, and returning results later as they arrive.

To do this, start with the resolver object itself. This can be obtained from the underlying IO::Async::Loop object. We do not actually need to keep a reference to the Loop, as the resolver will keep that itself.

use IO::Async::Loop;

my $resolver = IO::Async::Loop->new->resolver;

Next, call the getaddrinfo method on it, passing in the details of the name lookup required, and collect the result list. We need to pass a hint to the method so that it returns just one kind of socket address rather than iterating all the possible kinds. Since we only care about the IP address and not the service port number it doesn't matter too much what hint we pass, but one of the simpler ones is to ask for the stream socket type (i.e. TCP ports).

my @results = $resolver->getaddrinfo(
   host     => "www.google.com",
   socktype => 'stream',
)->get;

This method is intended for creating packed socket address structures for passing directly to connect() or bind(), so to obtain a human-readable string it will need converting back to a printable numeric form by using Socket::getnameinfo(). We need to pass in the NI_NUMERICHOST flag in order to have it return a plain numeric IP address instead of reverse resolving that address back into a name. The numeric address string itself will come in the second positional result from getnameinfo(), so we will have to use a list slice operator to return just that.

use Socket qw( getnameinfo NI_NUMERICHOST );

my @addrs = map { ( getnameinfo $_->{addr}, NI_NUMERICHOST )[1] }
            @results;

print "$_\n" for @addrs;

This yields the list of IP addresses for this one hostname:

2a00:1450:4009:809::1014
173.194.34.112
173.194.34.115
173.194.34.113
173.194.34.116
173.194.34.114

The reason for the get method here is that, like (almost) all of the IO::Async methods that perform a single asynchronous operation, the getaddrinfo method returns a Future. A Future is an object representing an outstanding operation that may not yet be complete. In this first simple example we simply wanted to wait for that operation to complete, so we forced it by calling the get method on it. This method waits for the Future to be complete then returns its result.

Of course, the entire reason for our using IO::Async was to perform multiple operations at the same time, and wait concurrently for them all to complete. So rather than calling get on each individual getaddrinfo future, we can combine them all together into a single future that needs them all to complete before it itself is considered completed.

my @hosts = qw( www.google.com www.facebook.com www.iana.org );

my @futures = map {
   my $host = $_;
   $resolver->getaddrinfo(
      host     => $host,
      socktype => 'stream',
   )
} @hosts;

my @results = Future->needs_all( @futures )->get;

my @addrs = map { ( getnameinfo $_->{addr}, NI_NUMERICHOST )[1] }
                @results;

print "$_\n" for @addrs;

This now yields:

2a00:1450:4009:809::1011
173.194.41.180
173.194.41.177
173.194.41.179
173.194.41.176
173.194.41.178
2a03:2880:f00a:401:face:b00c:0:1
31.13.72.65
2620:0:2d0:200::8
192.0.32.8

Oh dear. Unfortunately, the needs_all future has simply concatenated all of the individual results together, so we have lost track of which host has which addresses. To solve this, we can make each individual host future return not a list of its results, but a two-element list containing its hostname and an ARRAY ref of the IP addresses it resolved to. That way, when we fetch the results of the overall needs_all future we will have an even-sized name-value list, perfect for assigning into a hash.

To do this we can have each host future be a two-stage operation, consisting of first the getaddrinfo call, and then altering its result using a code block passed to the transform method.

my @futures = map {
   my $host = $_;
   $resolver->getaddrinfo(
      host     => $host,
      socktype => 'stream',
   )->transform(
      done => sub {
         my @results = @_;
         my @addrs = map { (getnameinfo $_->{addr}, NI_NUMERICHOST)[1] }
                         @results;
         return ( $host, \@addrs );
      }
   );
} @hosts;

my %addrs = Future->needs_all( @futures )->get;

use Data::Dump 'pp';
print STDERR pp(\%addrs);

Now we retain the mapping from hostnames to the list of IP addresses they resolved to:

{
  "www.facebook.com" => ["2a03:2880:f00a:201:face:b00c:0:1", "31.13.72.1"],
  "www.google.com"   => [
                          "2a00:1450:4009:809::1010",
                          "173.194.41.177",
                          "173.194.41.178",
                          "173.194.41.180",
                          "173.194.41.179",
                          "173.194.41.176",
                        ],
  "www.iana.org"     => ["2620:0:2d0:200::8", "192.0.32.8"],
}

Now, this post is fairly obviously written in response to Parallel DNS lookups using AnyEvent but from the perspective of IO::Async instead. Asides from the choice of event system, two important differences should be observed:

  • Through the use of futures, this example manages both the flow of control and data along with it. It does not need to declare variables that get captured by callback functions to cause data to flow separately from the way it uses an object to handle the flow of control. Each future object yields its result, and the individual futures can form linear flows by using transform or other methods to return different results, or needs_all or other methods to combine individual futures into larger ones.
  • Nowhere in the above did I mention DNS. This is intentional. IO::Async's getaddrinfo resolver really is an asynchronous wrapper around the underlying Socket function of the same name. Because of this it uses the system's standard name resolver as provided by libc, ensuring it will yield the same identical results as any other program using the system resolver, regardless of whether libc is configured to use DNS, files, LDAP, or any other resolution method. It also automatically handles IPv6 if the underlying system does; returning a mixture of IPv4 and IPv6 addresses in the host's preferred order. The caller does not need to be aware of the subtle distinctions of RFC 3484 sorting order, for example.