Rate this page

Flattr this

Fetch the content of a given URL in Perl using LWP::UserAgent

Tested on

Debian (Lenny, Squeeze)
Ubuntu (Lucid, Precise, Trusty)

Objective

To fetch the content located at a given URL in Perl using the module LWP::UserAgent

Scenario

Suppose that you wish to fetch the content of the URL http://www.example.com/ and write it to STDOUT. You have chosen to do this using the module LWP::UserAgent.

Method

The first step is to create an instance of the class LWP::UserAgent:

use LWP::UserAgent;

my $ua = new LWP::UserAgent;

There are a number of ways in which the behaviour of this object can be customised, either during construction or afterwards. For this example the default settings will suffice, but for anything more substantial you should conside:

See below for how to implement these variations.

The content of the URL can now be fetched using the get method of the user agent instance:

my $response = $ua->get('http://www.example.com/');

The response is an object of type HTTP::Response. (This is so even if the transfer was not performed using HTTP: the response is presented in the same manner for any URL scheme.) From this it is possible to determine whether the transfer was successful:

unless ($response->is_success) {
        die $response->status_line;
}

The raw content can be extracted from the content property of the response:

binmode STDOUT,':raw';
print $response->content;

however it is usually better to use the decoded_content method so that the character set (for text-based formats) and content encoding (if any) are handled transparently:

my $content = $response->decoded_content();
if (utf8::is_utf8($content)) {
	binmode STDOUT,':utf8';
} else {
	binmode STDOUT,':raw';
}
print $content;

(The reason for calling utf8::is_utf8 is to determine whether $content is a sequence of bytes or a sequence of characters. In the latter case it may contain wide characters, therefore STDOUT should be configured with an encoding layer that can handle wide characters.)

Here is the method as a complete Perl script:

#!/usr/bin/perl

use LWP::UserAgent;

my $ua = new LWP::UserAgent;
my $response = $ua->get('http://www.example.com/');
unless ($response->is_success) {
	die $response->status_line;
}
my $content = $response->decoded_content();
if (utf8::is_utf8($content)) {
	binmode STDOUT,':utf8';
} else {
	binmode STDOUT,':raw';
}
print $content;

Once an instance of LWP::UserAgent has been created it may be reused to make any number of requests.

Variations

Setting the user agent string

When a request is made using HTTP the user agent should identify itself to the server by means of a User-Agent header. For LWP::UserAgent this defaults to libwww-perl/x.xx (where x.xx is the version number), however you can specify an alternative string by calling the agent method of the user agent instance:

my $ua = new LWP::UserAgent;
$ua->agent('Examplebot/0.9');
my $response = $ua->get('http://www.example.com/');

Authentication

If the web site uses HTTP Basic or Digest authentication, and if the required username and password are known in advance, then you can present them to the server by constructing an HTTP::Request object then calling its authorization_basic method:

my $ua = new LWP::UserAgent;
my $req = new HTTP::Request(GET => 'http://www.example.com/');
$req->authorization_basic('user','xyzzy');
my $response = $ua->request($req);

Alternatively you could override the get_basic_credentials method of LWP::UserAgent. This requires somewhat more code, but allows for the possibility of interactive prompting if it is not known in advance that a password will be needed.

Kerberos authentication using the Negotiate authentication mechanism is handled automatically and transparently provided that the LWP::Authen::Negotiate plugin has been installed (it need not be explicitly loaded).

Sending and accepting cookies

By default LWP::UserAgent neither sends not accepts any cookies, but this can be changed by creating a ‘cookie jar’. This can either be temporary one stored in memory:

my $ua = new LWP::UserAgent;
$ua->cookie_jar({});
my $response = $ua->get('http://www.example.com/');

or a persistent one that is written to disc:

my $ua = new LWP::UserAgent;
$ua->cookie_jar({
    file => "$ENV{HOME}/.cookies.dat",
    autosave => 1});
my $response = $ua->get('http://www.example.com/');

Using a proxy server

The simplest way to support the use of a proxy server is to call the env_proxy method of the user agent instance:

my $ua = new LWP::UserAgent;
$ua->env_proxy;
my $response = $ua->get('http://www.example.com/');

This allows the use of environment variables such as http_proxy and no_proxy to control the behaviour of the user agent. The alternative if you want direct control is to call the proxy and no_proxy methods:

my $ua = new LWP::UserAgent;
$ua->proxy('http','http://192.168.0.1:8080/');
$ua->noproxy('localhost');
my $response = $ua->get('http://www.example.com/');

Further reading

Tags: http | perl