bget

Package: WA2L/edrc 1.5.57
Section: User Contributed Perl Documentation (3)
Updated: 2006-06-07
Index Return to Main Contents
 

NAME

bget - basic HTTP get tool  

DESCRIPTION

Basic tool to make HTTP GET requests and monitor the results. Unlike LWP GET, it does not require special Perl modules, and by virtue of being cruder makes HTTP headers easier to spy on.

Only URLs of the forms

     http://hostname/[localpart]
     http://hostname:port/[localpart]

are supported.

Options:

*
-a --autoname  

Save output automatically based on URI. Will not warn if the file already exists. This overrides the -o (--out) option. The prefered output name is everything after the last / in the URL, or 'dir-default' if the URL ends with a /.

*
-B --no-body   

Don't print the body of the response.

*
-b --browser NAME

What browser to emulate. Use --emulations to list available browser headers.

*
-c --cookie VALUE

Set the cookie header with VALUE.

*
-d --dontdechunk

As of version 1.2, when the response headers indicate a Transfer-Encoding of 'chunked', bget will rename the header (prefixing it with 'Xbget-') and unchunk the response. This is desirable so that chunked responses from HTTP/1.1 servers look right. In some cases it may be desirable to see raw output from the server however, so this behavior can be turned off.

*
-e --head

Make a HEAD request instead of a GET. Note that this does not imply -h (--heads) to print the headers, nor -B (--no-body) to supress printing any body content. (Some servers, eg www.yahoo.com, treat HEAD like a GET.)

*
-F --file FILE

Read URLs from FILE (one per line) instead of from command line. Use filename "-" for standard input.

If there are two URLs on a line, the first one is used as the referer URL. The referer will remain un use until the next line with two URLs.

If there is an additional field after the URL, that will be used as an -o (--out) output file until the next line with an output file. An output file should not begin with ``http:/'' or ``https:/''.

Fields on each line of the URL file are whitespace separated.

*
-f --follow

Follow redirects. If printing headers, the redirecting headers and the destination headers will be printed. (No loop detection is attempted.) If printing bodies and not saving via autoname, the redirecting body and the destination body will be printed. If saving via autoname, a new file will be opened for each request made. Some redirects (eg loops) may cause the autonaming to pick the same filename as a previous request, which will cause the earlier file to be clobbered.

*
-H --host HOST[:P]

Connect to HOST for request (useful for testing virtual hosts before a DNS change or use with -l for proxies).

*
-h --heads     

Print the response headers.

*
-L --language LANG

Use LANG for Accept-Language: header. See --languages for a small list.

*
-l --long      

Use long address on GET line (using the full http://... format, a MUST for HTTP/1.1 server compliance but handy with -H for proxies).

*
-o --out FILE

Write output to FILE. Unlike -a (--autoname) this will not use a different file for each request. The autoname option has precedence over this option. Filenames in a -F (--file) URL file will also override this.

*
-p --post STRING

Use STRING as a post form contents (forms of type application/x-www-form-urlencoded only).

*
-P --filepost FILE

Use contents of FILE as a post form contents. If the first line is of the form ``Content-Type: foo/bar'' it will be used to set the Content-Type: header. More than just the MIME type is allowed, but it must be all on one line. Typical POST content types are

   application/x-www-form-urlencoded

      Encoded like a typical CGI URL.

   multipart/form-data

      Each form element is in a separate MIME part; needed for
      file uploads. This type requires a boundary parameter on
      the Content-Type: header.

There is a similar allowance for setting the Transfer-Encoding: header. This must be on the second line if Content-Type: is set, and on the first line if not. When Transfer-Encoding: contains the string 'chunked', Content-Length: will not be set for a post. Note that apache 1.3.x (at least) does not allow chunked POST requests.

You may find the tool "mkpost" (available in the scripts section of CPAN) to be helpful in creating CGI interface files for this option. For other content types, like ``text/xml'' for XML-RPC interface requests, other tools will be needed.

*
-R --refer VALUE

Set the (initial) referer header with VALUE.

*
-r --request   

Print the request headers.

*
-s --status CODE     

After fetching a page --- including following redirects and printing bits of the response as controlled by other options --- if the HTTP status code is not exactly the one given, bget will exit (returning code 3 to the shell). Useful for looping until one hits a 404 or the like.

*
-t --time N        

Use Benchmark module to time making the command line request(s) N times.

*
-C --count N

Just like -t/--time, but optimizations apply: if neither heads nor bodies are requested, nothing will be fetched. If body is not requested only heads will be fetched.

*
-u --user USER:PW

Basic authentification in the form {username}:{password}.

*
-w --wait N        

Wait N seconds between fetching each URL.

*
-w --wait A,D      

Waits a random number of seconds, average A standard deviation D, between fetching each URL. Requires the Math::Random module. Useful for being subtle when fetching a lot of pages, along with emulating a browser and using per-page referer headers via the -F (--file) method.

*
--help

Show a help message and exit.

*
--version

Print version and exit.

*
--emulations

Print list of available browser emulations.

*
--languages

Print a sample of language codes.

 

Note

If -H (--host) is used with multiple URLs, all connections are made to the specified HOST (and port) even if different hosts are used in the URLs. This can be used to fetch files through a HTTP proxy if -l (--long) is also used.

With -L (--langauge) the Accept-Language: header will not be added if the browser has not been observed to use it.  

EMULATIONS

The following browsers are recognized for header emulation. This might not be the definitive list. Check --emulations for that. Some have comments to help identify them.
*
Amaya-8.1

Amaya is the W3C's combination browser page editor.

*
links-0.84

Text mode browser for Unix. <http://artax.karlin.mff.cuni.cz/~mikulas/links> Version 0.84 does not do cookies or referer headers, so we might misemulate it that way.

*
elinks-0.5pre4-linux

Forked from links, this is another text mode browser. Quirks include giving a bunch away about the system, including window size, in the User-Agent: and including a 'Referer' header in URLs entered by hand. The User-Agent for this is from a Redhat 7.1 x86 system in an 80x24 window. <http://elinks.or.cz/>

*
w3c-5.2.8

Command line web tool that uses libwww. <http://www.w3.org/ComLine/>

*
w3m-beta99

Text mode browser for Unix. <http://ei5nazha.yz.yamagata-u.ac.jp/~aito/w3m/>

*
Dillo-0.8.4

Dillo is Linux browser, under current development, that focuses on speed, small size, and protocol correctness. It can do cookies, but it defaults to not accepting them. It does not do Referer: headers, but bget may misemulate it on that point. <http://www.dillo.org/>

*
Linux-Mosaic-2.6

The browser that started the rush, compiled for Linux. This is an archaic browser. It doesn't do Host: headers or Cookies:. bget can misemulate the Cookies: part, but won't do the Host: header. Many modern sites require this for proper operation, so expect problems. The headers this thing spits out are longer even than the Lynx ones.

*
Qweb-1.3

Qweb was an early X11 style-sheet capable browser. Too bad it didn't do javascript (needed for some stylesheets) or even Host: headers. bget will misemulate this if you use Cookies, but won't supply a Host: header.

*
X11-Chimera-1.70

The name 'Chimera' has been used by two different browsers. This is the X11 Chimera developed at the University of Las Vegas, not the Mac Mozilla derivative Chimera. In authentic use this browser does not have cookies or use Referer: headers.

*
ApacheBench-1.3

ab, the benchmark tool that comes with the Apache httpd package.

*
Opera-3.60

An old version of a popular alternative browser for Windows.

*
Windows-Opera-7beta

More modern (2003) version of Opera.

*
Linux-Opera-6.11

As of Opera 6.x there is a linux version.

*
lwp-request-1.38

Lib WWW Perl module (these are the default headers).

*
wget-1.6

Command-line bulk page downloading tool for Unix.

*
NetBSD-curl-7.10.4-HTTP1.1

Command-line page upload/download tool for Unix. Prefers HTTP/1.1 but can do HTTP/1.0 upon request. Can do PUTs and DELETEs and other obscure things, too.

*
NetBSD-curl-7.10.4-HTTP1.0

Curl in HTTP/1.0 mode.

*
iCab-pre1.7

Popular alternative browser for Macs.

*
junkbuster-2

Once popular ad- and cookie-filtering proxy. Junkbuster does a bunch of header editing from the actual browser headers, and thus the headers out of it can vary considerably from this. It looks like Accept-* headers are not edited, allowing identification of the underlying browser sometimes. The Accept-* headers here come from a Netscape 4.7. By default, Junkbuster masquerades as Netscape 3.01 (GOLD) for Mac PPC.

*
Lynx-2.8.1

Popular text mode browser, predominately unix.

*
Linux-Mozilla-1.0.0

Mozilla is the open source version of Netscape 7. It exists for many platforms.

*
Linux-Phoenix-0.6-beta

Phoenix (formerly Firebird) is Mozilla with a different user-interface library. There are unix, windows and mac variants.

*
Konqueror-2.1.1

Konqueror is a mostly-Linux browser based on KDE.

*
OpenOffice-1.0.0

OpenOffice is a StarOffice relation, intended to be a free Unix ``Office'' compatible software bundle. It includes an HTML editor that can download pages to edit, but as such it does things like issue PROPFIND requests that are not emulated here.

*
WindowsNT-Explorer-5.0-as-4.0

Explorer 5.0 can be installed with a compatibility mode that emulates (or claims to emaulate) Explorer 4.0.

*
Windows98-Explorer-5.5
*
WindowsNT-ActiveDesktop

This is on a system with IE5.5 installed, but this identifies itself as IE4.01. This one is hard to do right, since in my tests I saw two requests for the test file. The first came with this UA, the second had this instead:

User-Agent: Mozilla/4.0 (compatible; MSIE 4.01; MSIECrawler; Windows NT)

The crawler version had an 'Accept-Language: us-en' as well as a different order to the headers (Accept: User-Agent:, Accept-Language: Accept-Encoding, Host:).

*
WindowsNT-Netscape6
*
WindowsNT-Explorer-5.5
*
Windows98-Explorer-4.0
*
WindowsNT-Explorer-5.0

Normal mode Windows NT IE 5.0.

*
WindowsNT-ExplorerOffline-5.0

IE can optionally crawl pages to cache them for offline browsing. This is Windows NT IE 5.01 in crawl mode.

*
WindowsNT-Netscape-4.6
*
MacPPC-Explorer-4.0

Mac PPC is System 7, 8 or 9 on PowerPC computers.

*
MacPPC-Netscape-4.0
*
MacPPC-Netscape-4.6
*
MacOSX-Safari-1.2.4

Safari is a Mozilla derivative that ships with OS X.

*
MacOSX-Explorer-5.2

Internet Explorer for OS X. (Comes with OS X?)

*
Linux-Netscape-3.0
*
Linux-Netscape-4.51
 

LANGUAGES

In HTTP standard languages use the ISO 639 two letter code, but can have an optional two letter country code for national variants. Generic English is 'en', American English is 'en-us', Irish English is 'en-ie', Australian English is 'en-au'.

Some other lanuages:

  af    Afrikaans
  sq    Albanian
  eu    Basque
  bg    Bulgarian
  be    Byelorussian
  ca    Catalan
  zh    Chinese
  zh-cn Chinese/China
  zh-tw Chinese/Taiwan
  hr    Croatian
  cs    Czech
  da    Danish
  nl    Dutch
  nl-be Dutch/Belgium
  fo    Faeroese
  fi    Finnish
  fr    French
  fr-be French/Belgium
  fr-ca French/Canada
  fr-fr French/France
  fr-ch French/Switzerland
  gl    Galician
  de    German
  de-at German/Austria
  de-de German/Germany
  de-ch German/Switzerland
  el    Greek
  hu    Hungarian
  is    Icelandic
  id    Indonesian
  ga    Irish
  it    Italian
  ja    Japanese
  ko    Korean
  mk    Macedonian
  no    Norwegian
  pl    Polish
  pt    Portuguese
  pt-br Portuguese/Brazil
  ro    Romanian
  ru    Russian
  gd    Scots Gaelic
  sr    Serbian
  sk    Slovak
  sl    Slovenian
  es    Spanish
  es-ar Spanish/Argentina
  es-co Spanish/Colombia
  ex-mx Spanish/Mexico
  es-es Spanish/Spain
  sv    Swedish
  tr    Turkish
  uk    Ukrainian

This list is from the default set of lanuages in Netscape 4.5. IE has a different set, including more country variations. Note that the country variations are frequently misused. A request with a language header like:

        Accept-Language: en-us, es-mx; q=0.7, fr-ca; q=0.3

Would specify a first choice language of US English, second choice Mexican Spanish, third choice Canadian French. If a content-negotiating server only has generic English, generic Spanish, and generic French, then by specification it should return a ``406 Not Acceptable'' error, since it has no languages that match. This could be seen as a deficiency of the spec, but that's the way it is.  

REVISION HISTORY

NEW IN VERSION 1.2

By supporting chunked transfer encodings, the author considers bget to be HTTP/1.1 compliant now. A word of warning, some emulations specify various allowed other encodings, like gziped content. You should be prepared to deal with these outside of bget.  

SEE ALSO

"mkpost" --- build bodies for HTTP CGI POST requests  

COPYRIGHT

Copyright 1999-2005 by Eli the Bearded / Benjamin Elijah Griffin. Released under the same license(s) as Perl.  

AUTHOR

Eli the Bearded originally wrote this to spy on headers and have a low cpu impact way to fetch files over http. It evolved from there.  

CPAN INFO

 

SCRIPT CATEGORIES

Web  

README

bget - basic HTTP get tool  

PREREQUISITES

This uses the "strict", "vars", "Socket", and "Carp" modules.  

COREQUISITES

This will try to use the "Benchmark" and "Math::Random" modules when run with certain options.  

OSNAMES

Should not be OS dependent. The autoname feature (-a / --autoname) assumes that "/" separates directories, however this should have minimal impact since it always tries to save in the currrent directory. Problems will likely only ensue if the automatically chosen name contains a directory separator for the current OS.


 

Index

NAME
DESCRIPTION
Note
EMULATIONS
LANGUAGES
REVISION HISTORY
SEE ALSO
COPYRIGHT
AUTHOR
CPAN INFO
SCRIPT CATEGORIES
README
PREREQUISITES
COREQUISITES
OSNAMES

This document was created by man2html using the manual pages.
Time: 16:52:26 GMT, August 28, 2024