Table of Contents

  1. Introduction
  2. Installing wget
  3. Basic Usage
  4. Advanced Options
  5. Scripting with wget
  6. Common Pitfalls & Troubleshooting
  7. wget vs. curl: When to Use Which?
  8. Real‑World Use Cases
  9. Security Considerations
    10 Conclusion
    11 Resources

Introduction

wget—short for World Wide Web GET—is a powerful, non‑interactive command‑line utility designed to retrieve files from the Internet using HTTP, HTTPS, and FTP protocols. Since its first release in 1996 as part of the GNU Project, wget has become a staple in the toolbox of system administrators, developers, DevOps engineers, and hobbyist power users alike.

Why does wget remain relevant in an era dominated by graphical download managers and sophisticated APIs? The answer lies in its simplicity, robustness, and automation‑friendly design:

  • Non‑interactive: Once launched, wget can run unattended, making it ideal for cron jobs, CI pipelines, and remote servers without a graphical interface.
  • Recursive capabilities: It can mirror entire websites, follow links, and rebuild directory structures automatically.
  • Fault tolerance: Built‑in retry mechanisms, resume support, and intelligent handling of network hiccups keep large downloads reliable.
  • Portability: Available on virtually every Unix‑like system, Windows (via Cygwin, WSL, or native ports), and even embedded devices.

This guide dives deep into wget’s feature set, from installation through advanced scripting, and equips you with practical examples you can copy‑paste into your own workflows.


Installing wget

Most Linux distributions ship wget by default, but if you need to install or upgrade it, follow the instructions for your platform.

Debian / Ubuntu

sudo apt-get update
sudo apt-get install wget

Fedora / CentOS / RHEL

# Fedora
sudo dnf install wget

# CentOS / RHEL 7
sudo yum install wget

macOS (Homebrew)

brew install wget

Windows

  • WSL (Windows Subsystem for Linux) – Install any Linux distribution and then follow the Linux steps.
  • Chocolateychoco install wget
  • Cygwin – Include the wget package during setup.

Verify the installation:

wget --version

You should see output similar to:

GNU Wget 1.21.3 built on linux-gnu.
...

Basic Usage

At its core, wget takes a URL and writes the retrieved content to a file in the current directory.

wget https://example.com/file.zip

Common Flags

FlagDescription
-O <file>Write output to file instead of the default filename.
-qQuiet mode (no output).
-nvNon‑verbose output (status messages only).
-cContinue a partially downloaded file.
-t <n>Set number of retries (default 20).
-T <seconds>Set network timeout.
-U <agent>Specify a custom User‑Agent string.

Example – Download with a custom filename and silent mode:

wget -q -O latest-release.tar.gz https://example.org/releases/v2.5.0.tar.gz

Advanced Options

Recursive Downloads & Mirroring

One of wget’s hallmark features is its ability to recursively traverse links and download entire directory trees or whole websites.

wget --recursive --no-parent https://example.com/docs/
  • --recursive (-r) tells wget to follow links.
  • --no-parent prevents climbing to parent directories.
  • --level=<n> limits recursion depth (default: infinite).

Mirroring a Site

The --mirror shortcut combines several flags to produce a faithful local copy:

wget --mirror \
     --convert-links \
     --adjust-extension \
     --page-requisites \
     --no-parent \
     https://example.com/blog/
  • --convert-links rewrites links to point to local files.
  • --adjust-extension adds appropriate extensions (e.g., .html).
  • --page-requisites grabs CSS, images, scripts needed for proper rendering.

Timestamping & Conditional Requests

Avoid re‑downloading unchanged files with -N (or --timestamping):

wget -N https://example.com/data/dataset.csv

wget adds an If-Modified-Since header; the server replies with 304 Not Modified if the file hasn’t changed, saving bandwidth.

Bandwidth Limiting

When sharing a network connection with other services, throttle wget using --limit-rate:

wget --limit-rate=500k https://largefile.com/backup.iso

Units can be k, m, or g (kilobytes, megabytes, gigabytes per second).

Authentication & Cookies

HTTP Basic/Digest Authentication

wget --user=alice --password=secret https://secure.example.com/report.pdf

For security, avoid plain‑text passwords on the command line; instead, use a .wgetrc file with appropriate permissions:

user = alice
password = secret

Cookies

If a site requires login via a cookie, you can export the cookie from a browser (e.g., using the “Export Cookies” extension) and supply it:

wget --load-cookies cookies.txt https://example.com/private/data.json

Proxy Support

Set environment variables or use wget options:

export http_proxy="http://proxy.example.com:8080"
export https_proxy="http://proxy.example.com:8080"
wget https://internal.example.com/resource

Or directly:

wget -e use_proxy=yes -e http_proxy=http://proxy.example.com:8080 \
     https://internal.example.com/resource

HTTPS, FTP, and Other Protocols

wget supports:

  • HTTPS – Encrypted downloads; use --secure-protocol=auto (default) or --secure-protocol=TLSv1_2 for strict TLS versions.

  • FTP – Classic file transfer; example:

    wget ftp://ftp.example.org/pub/software.tar.gz
    
  • SFTP – Not native to wget (requires curl or ssh), but you can invoke wget via a wrapper script if needed.

Resuming Interrupted Downloads

Large downloads are prone to interruption. Use -c (continue) to resume:

wget -c https://mirror.example.com/large.iso

If the remote server does not support byte ranges, wget will restart from the beginning.

Robots.txt and Ethical Scraping

By default, wget obeys the robots.txt policy of a site. Override with:

wget --no-robots https://example.com/

Caution: Ignoring robots.txt may violate a site’s terms of service and can lead to IP bans. Always respect crawling policies, and consider using a delay:

wget --wait=2 --random-wait --limit-rate=200k \
     --recursive https://example.com/data/

Output Control & Logging

Redirect wget’s log output to a file for later analysis:

wget -o download.log -a append.log https://example.com/file.zip
  • -o writes a new log file.
  • -a appends to an existing log (useful for batch jobs).

You can also suppress the progress bar while keeping status messages:

wget --quiet --show-progress URL

Scripting with wget

wget shines in automation. Below are common patterns for inclusion in shell scripts, cron jobs, or CI pipelines.

Example 1: Daily Backup of Remote Assets

#!/usr/bin/env bash
# backup.sh – pull remote assets nightly

BASE_URL="https://assets.example.com/releases"
DEST="/var/backups/assets"
TODAY=$(date +%F)

mkdir -p "$DEST/$TODAY"

wget --quiet \
     --recursive \
     --no-parent \
     --timestamping \
     --directory-prefix="$DEST/$TODAY" \
     "$BASE_URL"

# Log rotation
find "$DEST" -maxdepth 1 -type d -mtime +30 -exec rm -rf {} \;

Add to cron (crontab -e):

0 2 * * * /usr/local/bin/backup.sh >> /var/log/backup.log 2>&1

Example 2: Parallel Downloads with GNU Parallel

wget itself is single‑threaded per URL, but you can parallelize multiple downloads:

cat urls.txt | parallel -j 4 wget -c -nv {}

Example 3: Conditional Download Based on HTTP Status

#!/usr/bin/env bash
URL="https://example.com/report.pdf"
TMPFILE=$(mktemp)

# Perform a HEAD request first
if wget --spider -S "$URL" 2>&1 | grep -q "200 OK"; then
    wget -O report.pdf "$URL"
else
    echo "File not available (status != 200)" >&2
fi

rm -f "$TMPFILE"

Common Pitfalls & Troubleshooting

SymptomLikely CauseFix
wget: unable to resolve host addressDNS misconfiguration or no internet connectivity.Verify /etc/resolv.conf, test with dig or ping.
ERROR 403: ForbiddenServer blocks non‑browser User‑Agents.Use --user-agent="Mozilla/5.0" or proper authentication.
ERROR 404: Not Found after a recursive crawlLink was relative and wget rewrote it incorrectly.Use --adjust-extension and --convert-links; check --no-parent.
Downloads restart from the beginning despite -cRemote server doesn’t support byte‑range requests.Use --continue only on servers that advertise Accept-Ranges: bytes.
Excessive bandwidth usageMissing --limit-rate or --wait.Add throttling options, or schedule during off‑peak hours.
SSL/TLS handshake failuresOut‑dated wget version or missing CA certificates.Update wget or install ca-certificates package.

Debugging tip: Increase verbosity with -d (debug) to see the exact HTTP exchange.

wget -d https://example.com/file.tar.gz

wget vs. curl: When to Use Which?

Both tools are ubiquitous, but they excel in different scenarios.

Featurewgetcurl
Recursive downloading / mirroring✅ (native)❌ (requires external scripts)
Resume support
POST/PUT with complex payloads❌ (limited)✅ (full HTTP method support)
HTTPS with client certificates✅ (limited)✅ (robust)
Downloading to stdout❌ (requires -O -)✅ (default)
Progress barSimple text barAdvanced, configurable bar
Scripting convenienceGood for simple GET/recursive tasksBetter for API interactions

Rule of thumb: Use wget for straightforward file retrieval, bulk downloads, or site mirroring. Reach for curl when you need fine‑grained HTTP control, custom headers, or to interact with REST APIs.


Real‑World Use Cases

  1. Automated Dataset Collection
    Researchers often need to pull large CSV or image collections from public repositories. wget’s --timestamping ensures only new files are downloaded each night.

  2. Continuous Integration Artifact Retrieval
    CI pipelines can fetch pre‑built binaries from an internal Nexus repository:

    wget --auth-no-challenge --user=ci --password=$CI_PASS \
         https://nexus.example.com/repository/releases/app-1.2.3.tar.gz
    
  3. Offline Documentation Mirroring
    Companies ship internal documentation to air‑gapped environments by mirroring a Confluence space:

    wget --mirror --convert-links --adjust-extension \
         --page-requisites --no-parent \
         https://docs.internal.company.com/
    
  4. Backup of Remote Log Files
    System administrators can rotate remote log archives via a cron job that pulls .log.gz files from a central logging server.

  5. IoT Firmware Updates
    Embedded Linux devices with limited UI can invoke wget to download new firmware images over HTTPS, then flash them automatically.


Security Considerations

While wget is reliable, misuse can expose systems to risk:

  • Untrusted URLs – Downloading executables from unknown sources may lead to malware. Always verify checksums (sha256sum) after download.
  • Plain‑text credentials – Avoid passing passwords on the command line; use .netrc or environment variables with restricted permissions.
  • TLS verification – By default, wget validates server certificates. Disabling verification (--no-check-certificate) defeats this protection and should be reserved for testing.
  • Directory traversal – When mirroring, ensure --no-parent and proper --reject patterns to avoid unintentionally downloading sensitive files.
  • Rate limiting – Excessive crawling may trigger DDoS protection on target sites. Respect robots.txt and use --wait/--random-wait.

Conclusion

wget remains a cornerstone of command‑line networking, offering a blend of simplicity, power, and reliability that few modern tools can match. Whether you’re backing up a critical dataset, mirroring an entire website for offline access, or automating nightly builds, mastering wget unlocks a world of efficient, scriptable file retrieval.

Key takeaways:

  • Install the latest version for TLS and security improvements.
  • Leverage recursive and mirroring options for large‑scale downloads.
  • Use -c, -N, and --limit-rate to make downloads resilient and network‑friendly.
  • Combine wget with shell scripting, cron, or GNU Parallel for robust automation pipelines.
  • Always respect ethical scraping practices and secure your credentials.

Armed with the concepts and examples in this guide, you’re ready to harness wget in both everyday tasks and complex production workflows.


Resources

  1. GNU Wget Manual – Comprehensive official documentation.
    https://www.gnu.org/software/wget/manual/

  2. Linux man page for wget – Quick reference for all command‑line options.
    https://man7.org/linux/man-pages/man1/wget.1.html

  3. Wget Tips & Tricks (DigitalOcean) – Practical recipes for common scenarios.
    https://www.digitalocean.com/community/tutorials/how-to-use-wget-to-download-files-from-the-internet

  4. Understanding robots.txt (Google Developers) – Guidance on ethical crawling.
    https://developers.google.com/search/docs/advanced/robots/intro

  5. Comparing wget and curl (Stack Overflow) – Community insights on when to use each tool.
    https://stackoverflow.com/questions/10261931/difference-between-wget-and-curl