Fetch the content of a given URL in Perl using LWP::UserAgent
Content |
Tested on |
Debian (Lenny, Squeeze) |
Ubuntu (Lucid, Precise, Trusty) |
Objective
To fetch the content located at a given URL in Perl using the module LWP::UserAgent
Scenario
Suppose that you wish to fetch the content of the URL http://www.example.com/
and write it to STDOUT
. You have chosen to do this using the module LWP::UserAgent
.
Method
The first step is to create an instance of the class LWP::UserAgent
:
use LWP::UserAgent; my $ua = new LWP::UserAgent;
There are a number of ways in which the behaviour of this object can be customised, either during construction or afterwards. For this example the default settings will suffice, but for anything more substantial you should conside:
- whether a custom user-agent string should be provided;
- whether there is a need to support authentication;
- whether the user agent should send and accept cookies; and
- whether it might be necessary to access the URL through a proxy server.
See below for how to implement these variations.
The content of the URL can now be fetched using the get
method of the user agent instance:
my $response = $ua->get('http://www.example.com/');
The response is an object of type HTTP::Response
. (This is so even if the transfer was not performed using HTTP:
the response is presented in the same manner for any URL scheme.) From this it is possible to determine whether the transfer was successful:
unless ($response->is_success) { die $response->status_line; }
The raw content can be extracted from the content
property of the response:
binmode STDOUT,':raw'; print $response->content;
however it is usually better to use the decoded_content
method so that the character set (for text-based formats) and
content encoding (if any) are handled transparently:
my $content = $response->decoded_content(); if (utf8::is_utf8($content)) { binmode STDOUT,':utf8'; } else { binmode STDOUT,':raw'; } print $content;
(The reason for calling utf8::is_utf8
is to determine whether $content
is a sequence of bytes
or a sequence of characters. In the latter case it may contain wide characters, therefore
STDOUT
should be configured with an encoding layer that can handle wide characters.)
Here is the method as a complete Perl script:
#!/usr/bin/perl use LWP::UserAgent; my $ua = new LWP::UserAgent; my $response = $ua->get('http://www.example.com/'); unless ($response->is_success) { die $response->status_line; } my $content = $response->decoded_content(); if (utf8::is_utf8($content)) { binmode STDOUT,':utf8'; } else { binmode STDOUT,':raw'; } print $content;
Once an instance of LWP::UserAgent
has been created it may be reused to make any number of requests.
Variations
Setting the user agent string
When a request is made using HTTP the user agent should identify itself to the server by means of a User-Agent
header.
For LWP::UserAgent
this defaults to libwww-perl/x.xx
(where x.xx is the version number), however
you can specify an alternative string by calling the agent
method of the user agent instance:
my $ua = new LWP::UserAgent; $ua->agent('Examplebot/0.9'); my $response = $ua->get('http://www.example.com/');
Authentication
If the web site uses HTTP Basic or Digest authentication, and if the required username and password are known in advance, then you can
present them to the server by constructing an HTTP::Request
object then calling its
authorization_basic
method:
my $ua = new LWP::UserAgent;
my $req = new HTTP::Request(GET => 'http://www.example.com/');
$req->authorization_basic('user','xyzzy');
my $response = $ua->request($req);
Alternatively you could override the get_basic_credentials
method of LWP::UserAgent
.
This requires somewhat more code, but allows for the possibility of interactive prompting if it is not known in advance that a password will be
needed.
Kerberos authentication using the Negotiate authentication mechanism is handled automatically and transparently provided that the
LWP::Authen::Negotiate
plugin has been installed (it need not be explicitly loaded).
Sending and accepting cookies
By default LWP::UserAgent
neither sends not accepts any cookies, but this can be changed by creating a ‘cookie
jar’. This can either be temporary one stored in memory:
my $ua = new LWP::UserAgent; $ua->cookie_jar({}); my $response = $ua->get('http://www.example.com/');
or a persistent one that is written to disc:
my $ua = new LWP::UserAgent; $ua->cookie_jar({ file => "$ENV{HOME}/.cookies.dat", autosave => 1}); my $response = $ua->get('http://www.example.com/');
Using a proxy server
The simplest way to support the use of a proxy server is to call the env_proxy
method of the user agent instance:
my $ua = new LWP::UserAgent; $ua->env_proxy; my $response = $ua->get('http://www.example.com/');
This allows the use of environment variables such as http_proxy
and no_proxy
to control the
behaviour of the user agent. The alternative if you want direct control is to call the proxy
and
no_proxy
methods:
my $ua = new LWP::UserAgent; $ua->proxy('http','http://192.168.0.1:8080/'); $ua->noproxy('localhost'); my $response = $ua->get('http://www.example.com/');
Further reading
- Gisle Aas,
LWP
(module documentation) - Gisle Aas,
LWP::UserAgent
(module documentation)