Fetch the content of a given URL in Perl
Content |
Objective
To fetch the content located at a given URL in Perl
Methods
Overview
There are at least four different methods for fetching the content of a URL in Perl:
- using
LWP::UserAgent
(or one of its derivatives); - using
LWP::Simple
; - using
WWW::Curl
; or - using
IO::All
.
Of these, LWP::UserAgent
would be the author’s recommendation for general use. It supports a wide range of features, but these can mostly be ignored if you do not use them and its API is not excessively complicated when handling simple cases.
Using LWP::UserAgent
Features of LWP::UserAgent
include:
- support for the
http
,https
,gopher
,ftp
,news
,file
andmailto
URL schemes; - HTTP authentication (including Simple, Digest and Negotiate);
- sending and accepting cookies (either stored in memory or written to disc);
- use of a proxy server; and
- access to inbound and outbound HTTP headers.
Useful variants of LWP::UserAgent
include LWP::RobotUA
(a user agent that with built-in support for robots.txt
) and WWW::Mechanize
(for stateful navigation of a web site, with the ability to follow links and complete forms).
See Fetch the content of a given URL in Perl using LWP::UserAgent for further details.
Using LWP::Simple
LWP::Simple
provides a simplified interface to LWP::UserAgent
. Unfortunately it is rather too simple for many purposes, and if you do hit one of its limitations then it is usually necessary to start again with a different module. For this reason LWP::Simple
is probably best avoided when writing non-trivial programs, but for simple throw-away scripts its brevity may compensate for any shortcomings.
See Fetch the content of a given URL in Perl using LWP::Simple for further details.
Using WWW::Curl
WWW::Curl
provides a Perl binding to libcurl, a widely used file transfer library that can be used from many different programming languages. It presents two separate interfaces, WWW::Curl::Easy
and WWW::Curl::Multi
, but both are more complex to use than LWP::UserAgent
.
The functionality provided by WWW::Curl
is generally narrower but deeper than that of LWP::UserAgent
. For example, it supports a wider range of URL schemes (21 according to the libcurl website), but provides nothing comparable to LWP::RobotUA
or WWW::Mechanize
.
Using IO::All
IO::All
is a unifying framework for performing many different types of input/output through a common interface. In addition to files and URLs it can be used to interact with entities such as strings, sockets and processes. This is both a strength and a weakness.
For some types of program the ability to use URLs in the same manner as pathnames can be a useful convenience, and IO::All
allows this to be provided without adding any significant complexity to a program. However this functionality can be dangerous if the URL came from an untrusted source, so IO::All
is usually best avoided in security-sensitive applications such as CGI scripts.
Tags: perl