How to Archive a Website: Our Mammoth Guide to Saving Your Site
Maintaining your website involves having a dedicated backup strategy. While backups are essential, they’re not the only way to preserve your site. The natural extension to backup is to archive a website—though they’re complementary processes.
There are several flexible ways to archive a website. The great news is that they’re all user-friendly and accessible. You merely have to pick the right solution for your needs and requirements.
In this post, we’ll take a look at how to archive a website. We’ll also explore different archiving types you’ll come across, round up a few of the most prominent site archiving tools, and discuss some tips for archiving your site.
An Introduction to Website Archiving
Archiving a website means preserving the content, data, and media for future reference. Using a dedicated service such as the Wayback Machine (though we’ll get onto other solutions later), you can view older versions of a website.
On a technical level, crawlers take snapshots of a website, which constitutes the archive itself. You’re able to access it using a simple calendar and view each iteration in a timeline format if you wish.
As for why solutions such as the Wayback Machine exist, we have to go back to the early 2000s. The dot-com bubble was all but burst; many businesses were going down. Some popular websites were shut down or abandoned, with few memories left behind.
Much like other media formats before the internet, such as music and television, these websites held historical and nostalgic value. To save them meant to give future internet users a glimpse at how far we’d come from earlier technology.
The Internet Archive launched the Wayback Machine to help preserve websites. If a site has been archived on there, you can see how a site has evolved over the years.
Many crawlers are required for archiving a website, including huge individual crawls that could take years to complete. The grunt needed to carry out crawling “expeditions” and store the resultant snapshots is immense.
However, not everyone is comfortable with the work the Internet Archive is doing. There have been several discussions and legal challenges based on whether an archive of a website breaks existing copyright issues.
Still, given the considerable growth in the number of archives stored, there’s a clear desire to preserve websites.
Why You’d Want to Archive a Website
There are plenty of reasons for wanting to archive a website, other than simply for nostalgic reasons. For a real-world analogy, look at GitHub.
Github stores repositories of a project, along with every “commit” made. To compare this to internet archiving, the repositories represent the whole archive, and the commits are the snapshots.
In the same way that Git repositories are valuable, so is an archive. For example, you can look at previous iterations of your site—even from many years ago—to influence your current design choices.
Also, you may be legally obligated to archive your site, especially if you’re in the financial or legal industries.
Finally, if you’re unfortunate enough to be involved in litigation surrounding your site, your archives will be valuable evidence. If you can present clear and complete site archives, you can throw off disputes even before the courts get involved.
The Difference Between Backups and Archiving
Before we talk about the different types of web archiving available, it’s worth coming back to a topic we touched on earlier. On paper, a site backup and website archive appear similar. However, they perform different jobs that complement each other. In a nutshell:
- Backups are data-based. They’re more concerned with preserving the data of your site. Given that backups are vital if you need to restore your site, having a complete backup of your data is paramount.
- Archives preserve context over data. If you trawl through your favorite website’s archive, you’ll notice that the functionality is often patchy. However, the site’s design and static content are usually intact.
It’s worth noting that archiving doesn’t look to eschew data preservation efforts altogether. Indeed, one of the benefits is letting users navigate your site as if it were live. Even so, given that sites such as the Wayback Machine exist as a virtual “memory lane,” keeping the visuals intact takes higher priority than preserving backend functionality.
In short, you’ll want to use both backups and archives for your site—the former as daily protection in case the worst happens, and the latter as an additional way to help document the evolution of your site.
The Different Types of Web Archiving You’ll Encounter
Web archiving doesn’t just come in one flavor. There are a few different types you’ll come across. Here’s a breakdown of each:
- Client-side: It involves the end-user saving a version of the website in question. It’s simple, scalable, and lets you archive a website with no fuss.
- Server-side: The approach of the Wayback Machine and others is classed as server-side archiving. It uses crawlers and other technology to archive a website, but it also requires a level of consent not found in client-side archiving.
- Transaction-based: While this is still based on server-side archiving, it’s more complex and requires explicit consent from the site owner. Essentially, it archives the site transactions between the end-user and server.
For simple websites with static data, coupled with an organized archiving strategy, client-side archiving should fit the bill. However, most other sites will favor server-side archives—transaction-based archiving isn’t necessary for most websites.
Finally—and we’ll discuss this in more detail throughout the post—you’ll also want to consider where and how your archives are stored. For example, a local archive isn’t a poor choice, but you could see it disappear if you have a computer failure. On the flip side, you have less control over what’s archived if you opt for a third-party solution.
As you’d expect, the answer here is to use a multi-faceted approach to archive a website. We suggest treating archives like backups: keep three different copies in separate locations and synced somehow.
You may want to make one of the archives live, too, so that you can take advantage of any server-side functionality on your site. The result is a website with a robust backup and archive strategy that remains useful to others.
A Beginners’ Guide to Internet Archive Tools and Sites
There are a plethora of solutions available to archive a website. We’ll run down a few of the more popular ones, along with our opinion on how it might suit you.
1. Wayback Machine
First off, let’s discuss the Wayback Machine. It was the first of its kind, so it set the benchmark for other archiving tools.
As such, it’s likely going to be the first place head when looking to archive a website. It has many ways to create and upload archives, and even a dedicated API to hook into its functionality. It’s worth noting it’s a server-side archive solution too.
That said, due to how it crawls and archives websites, the Wayback Machine might not be able to preserve all of your site’s functionality. Nevertheless, it’s considered the industry standard for web archivists, and it’s entirely free to boot. We’ll show you how to archive a website in more detail using the Wayback Machine later in this article.
Next up is Archive.today. It’s similar in many ways to the Wayback Machine—even down to the site’s almost “retro” design. Its data servers are based in Europe, but it approaches archiving differently than the Wayback Machine.
For starters, Archive.today isn’t based on crawlers running over the web. Instead, you submit your URLs and consent to inclusion in the archive. Besides, its feature list is more bare-bones than other solutions. There isn’t a robust deletion policy, for example, and the archiving process excludes certain media and file types.
Still, it’s free and suitable if you want a complimentary place to store archives. The site even has search functionality to find previously archived sites.
We’ve mentioned the Internet Archive and the Wayback Machine almost interchangeably in this post so far. The Wayback Machine is just one service, though, and the Internet Archive offers a few other archiving products aside from it. Heritrix is a free, open-source tool born from a collaboration between the Internet Archive and Nordic libraries.
It’s essentially a web crawler rather than a full-featured archiving tool. However, you can package all the crawled results together. While this hasn’t been the case in the past, the Wayback Machine now uses Heritrix to crawl sites for inclusion on its own site. What’s more, a large number of libraries and institutions use Heritrix to build archives.
Despite its impressive features, installing Heritrix requires some technical know-how. There isn’t a user-friendly interface to install it for you, so you’ll need knowledge of Git, GitHub, and the command line.
As with other similar solutions, Heritrix is entirely free to use, so it’s suitable as a cost-effective self-archiving solution.
4. Web Archiving Integration Layer (WAIL)
If you’re looking at Heritrix to archive a website, but are put off by the technical knowledge required to simply install the software, there’s a potential solution for you. The Web Archiving Integration Layer (WAIL) is a free and open-source cross-platform desktop app that gives you a functional Graphical User Interface (GUI) to use, along with an installer.
The good news is that Heritrix is WAIL’s crawling engine. It means you get to leverage the power of Heritrix while not having to traverse GitHub and the command line. Besides, WAIL uses the OpenWayback engine to “replay” web archives.
As such, you have a full-featured web archiving tool ready to go on your machine. We’ll also show you exactly how WAIL works later on in the article.
Our penultimate archiving tool is billed as an automated solution that takes snapshots at set intervals. Stillio is a premium service that looks and feels different from other archiving solutions.
The website looks slick and gives you myriad options to create an archive that meets your exact requirements. For example, you’re able to add tags and custom titles to your URLs.
However, Stillio has one huge drawback: it doesn’t support back-end archiving. You’re restricted to screenshots of your website rather than a full archive of data. For many applications, this isn’t enough.
However, Stillio could be useful in some cases, such as serving as a brand management and tracking tool. For example, you can take screenshots of competitor sites or search engine results. It’s also great for content verification.
Stillio’s pricing starts at $29 per month and rises through four tiers up to $299 per month. It’s a big ask, especially when there are free alternatives with more powerful features. But if it fits your use case perfectly, then it’s worth taking a look!
Our final solution is another automated tool. Pagefreezer offers many of the same benefits as Stillio, but it also archives social media content, text messages, full sites, and enterprise-level collaboration platforms.
On the surface, Pagefreezer seems like a more robust solution than Stillio and would have greater value in various use cases.
For example, where you’re legally required to archive a site fully, Pagefreezer fits the bill. It allows you to automate the number of snapshots and review them using a site archive browser and comparison tool.
What Is the Web Archive (WARC) File Format?
If you’re researching how to archive a website, you’ll come across the Web Archive (WARC) format. It’s a packaged combo of your site archive’s various files so that it’s portable and self-contained.
The Internet Archive created WARC to preserve web data on a long-term basis. The International Internet Preservation Consortium (IIPC) has published the full specification of the file format. It will store images, metadata, and practically everything your site needs to run on a standalone basis.
While it was originally just a handy file format, WARC is now an international ISO standard for digital archives. As such, it’s been adopted by governments and other official bodies. In fact, there are several use cases where a WARC file is vital:
- E-discovery: It’s the process during litigation where digital records are researched and presented for inclusion in a trial. For social media records, a WARC file meets the E-discovery legal standard.
- Freedom of Information (FOI): There are many governments and official bodies that use the FOI and Open Records acts to offer a “Right to Know” (RTK) service to state constituents. The WARC format is ideal in cases involving digital records.
WARC is used by many different archiving solutions and crawlers, such as StormCrawler and Apache Nutch. You can also tweak the settings of a command-line tool such as Wget to fetch and package requests as WARC files. We’ll discuss this in more detail shortly.
There are plenty of other tools that can output to WARC files too. For example, the open source web pages saving tool wallabag can do this.
As an alternative, grab-site is a web-based app to help with crawling archives as WARC files.
Opening a WARC file depends on the tool you’re using. Regardless of the solution you prefer, bear in mind that some of these tools haven’t been updated in a while.
As such, you’ll want to make sure your chosen solution works with your current system and that it’s going to be available to use in the future. You’ll be saving yourself plenty of headaches if you avoid a tool that could wind up discontinued or abandoned while you’re in the middle of an archiving project.
Tips for Managing Your Offline Archives
Before we get into how to archive a website, let’s take a few minutes to help you organize your existing archives. We’ve touched on the subject, but having a solid approach in place will make your archives more manageable. Your site’s users will also get greater use out of a well-organized archive.
There are three key elements you have to keep in mind:
- Frequency: Decide how often you want to archive a site. Huge, dynamic, complex sites with almost daily changes will need more frequent snapshots than static sites.
- Location: Just like backups, you should save archives in several different places, including the cloud. Follow the 3-2-1 rule for extra assurance. We’d also suggest more than this if you want to capture the full depth of your site.
- Structure: Like your computer’s directories, you should look to use explicit folders, subdivided into the site archives’ names and the date a specific site was archived.
While you could expand your archive administration further, these three tips will start your archiving off on the right foot.
5 Ways to Archive a Website
Below, we’re going to suggest five different ways to archive a website. We’ve ordered the solutions based on their relative difficulty. However, if you spot a solution you think will work for your current needs, feel free to dive in and find more.
1. Save a Single Page to Your Local Computer
First off, let’s discuss the most straightforward solution. It’s great if you need to archive a single page, and even better, the functionality is already in practically every browser.
To start, open your favorite browser and head to the website you’d like to archive. Once the page has loaded, navigate to your browser’s File menu and find the Save Page As option:
Next, click the option to save the page, at which point the browser will show you a dialog box.
Here, choose a name for your page (though the default is fine). Also, make sure that you’re saving the entire page rather than just the HTML. It will preserve the site with the most functionality possible.
2. Use DevKinsta to Archive Your WordPress Website
We think DevKinsta is an essential tool for creating and deploying WordPress websites. However, it also has another string in its bow: it helps you archive your Kinsta-hosted websites too.
We’ve covered the entire process of pulling an external MyKinsta backup into DevKinsta in one of our knowledgebase articles. To summarize:
- Create and download a backup in MyKinsta.
- Create a new site with DevKinsta.
- Import your content and database.
- Carry out a search-and-replace on your database to change the URL name from your live site to your new local archive.
At this point, you can open your site in DevKinsta and use it as though it were live.
3. Use an Online Archive (Such As the Wayback Machine)
No tutorial would be complete without showing you how the Wayback Machine works. Fortunately, the process is simple. That said, note that this method only lets you archive individual pages (though the subscription Archive-It service does let you archive full sites).
For this approach, head to the Wayback Machine home page and check out the Save Page Now form:
To archive a page, simply add the URL you wish to save to this form, then click Save Page. Depending on how large or complex the page is, you may need to wait a few minutes while the crawler and engine do their thing. It could be that the page looks as though it’s crashed. We were faced with a White Screen of Death (WSoD) for a while in our testing.
However, once the page has been archived, Wayback Machine will redirect you to the new, dedicated page.
Note that you can also use a bookmarklet and browser extension to archive a website too. In fact, most of the current browsers have these options out of the box, including Google Chrome, Firefox, and Safari.
4. Install the Web Archiving Integration Layer (WAIL)
Your first step with this approach is to download WAIL itself and install it. Fortunately, there’s a dedicated installer for the tool (though because the program is written in Python, it uses the PyInstaller module).
The install process is a breeze. Regardless of your operating system (OS), you can carry out the following:
- Navigate to the WAIL website and download the appropriate installer for your OS.
- Either unzip the file for the Windows version, or mount the DMG image for macOS.
- On the resultant dialog screen for macOS, drag the app icon to your Applications folder. For Windows users, simply drag the unzipped folder to your root C: drive.
- Launch either WAIL.app or WAIL.exe (depending on your OS).
Once WAIL is open, you’ll see its minimal interface:
You are now presented with three options to choose from: view an archive, check its status, or archive a website. The buttons are slightly confusing, as your natural inclination may be to read from left to right. However, on the first launch, you’ll have nothing in your archives.
Instead, enter the URL for the site you want to archive, and click Archive Now! You’ll see WAIL begin to crawl the website. You can check on the status of your crawl on the Advanced > Heritrix tab:
When it’s done, it’ll show you a “Success” message. At this point, you can click the View Archive button on the Basic tab. This will open your archived site in a browser, ready for you to view.
5. Use Wget If You’re Comfortable Using the Command Line
For our final method to archive a website, you’ll need a few things before you start:
- Command line access to your computer
- A suitable command line tool such as Windows Command Prompt, or Terminal on macOS and Linux
- Wget installed on your computer
You’ll likely have the first two already.
On macOS, you can install Wget through Homebrew with the
brew install wget command. Note that you also need to install Homebrew, but it only takes seconds. On Linux, Wget is pre-installed on most of the major distros.
If you’re a Windows user, you may have a tougher time installing Wget on your computer. While there are tutorials available across the web, their guidance doesn’t appear consistent between machines. Instead, we recommend you head to the official Wget website and check out some of the available Windows binaries, as these are more likely to work for you.
Regardless, once you’ve installed Wget, using it is straightforward. First, navigate to a directory in a new terminal window. Here, we’re creating the directory too, but this step is optional:
cd documents && mkdir archive && cd archive
Note that Wget will pull all downloads into whatever the working directory is. In this case, we’ve specified a folder for our files.
Next, you’ll want to crawl a site and pull the files. Every action is invoked using the
wget command, and you’ll want to use the following format:
wget "https://kinsta.com/" --warc-file="kins"
Hitting the Enter key will begin the download of kinsta.com to an index.html file and create a WARC file named kins-00000.warc.gz.
Wget is powerful, and there are many commands and options you can use. For example, you can use the
--mirrorcommand to create a WARC file containing your site’s complete mirror. You can also use the
--no-warc-compression command to write uncompressed files, though this is obviously going to take up more space per download. Using the built-in compressor is the optimal approach.
Web archiving has grown from a need to document the rapidly changing shape of the internet. It now has multiple valid applications—for example, in the case of legal files and requirements. Regardless of your need, having a well-structured and organized archive can complement your overall backup strategy.
Fortunately, there are plenty of solutions available to help. Most browsers offer the ability to save a web page on your computer, though solutions such as DevKinsta are also capable tools for the job. However, dedicated archiving tools such as the Wayback Machine, Heritrix, WAIL, and Wget are all particularly robust solutions and offer standardized file formats to work.
Has this article led you to want to archive a website of your own? Share your thoughts and opinions in the comments section below!
If you enjoyed this article, then you’ll love Kinsta’s WordPress hosting platform. Turbocharge your website and get 24/7 support from our veteran WordPress team. Our Google Cloud powered infrastructure focuses on auto-scaling, performance, and security. Let us show you the Kinsta difference! Check out our plans