Create Your Own Internet Archive — ArchiveBox

Amoghavarsha
4 min readApr 1, 2022

--

Since 1996, Internet Archive is actively serving the purpose. What does it do? It keeps an archive of the things that were once found on the internet in their original form and also in the modified form. Today the Internet Archive has a collection of nearly 625 billion web pages, 38 million books and texts, 14 million audio recordings, 7 million videos (including 2 million Television News programs), 4 million images, 790,000 software programs and more.

The data from the past tells a story — origin, process, transcendence, end or any other juicy information. This past data could be helpful for journalists, security researchers, web developers, archeologists, academicians, historians, corporate entities and so on. Read more here.

Archive Box

With all due respect, we don’t need all the data that’s been recorded by the Internet Archive. Usually people save pages on the Wayback Machine that are relevant to them.

A screenshot of Internet Archive.

The Internet Archive doesn’t save all our favorite pages automatically, most of the times, the pages that we are looking for might not be captured by others, due to lack of relevance, traffic or the page gets deleted before it’s archived! And also, we need an account and internet connection to see that pages that we’ve archived. To solve these bottlenecks, ArchiveBox comes in handy.

ArchiveBox is an open source internet archiving solution to collect and save the relevant web pages of our choice offline. It can be very useful for security researchers to keep track of a website, to analyze it’s past and present developments; for journalists and OSINT researchers to archive the content privately before it’s deleted, and to analyze information about the domain, its past history and ownership.

It comes in many forms, we can set that up on Windows, GNU/Linux and MacOS. It can be used as a command line tool, a web app and a desktop app(alpha). We can also use it on Docker. For the convenience, we’ll be demonstrating it on a GNU/Linux system and we’ll be using its auto setup script to install it one the machine.

  1. Open the terminal and type:

curl -sSL 'https://get.archivebox.io' | sh

We’ll wait for the script to download and configure itself (it may take some time). And that’s it, ArchiveBox is now installed on our system.

2. Now we’ll change the directory where we have installed our ArchiveBox.

Contents of archivebox on our local system.

3. Now let’s run our archivebox server locally. Type the following command inside /archivebox directory.

archivebox server 0.0.0.0:8000

4. As we can see in the image, we’ll copy and paste(or click) http://0.0.0.0:8000/ to our favorite browser.

ArchiveBox server running locally on our browser

5. Now we can type any URL on the search bar, here we have already archived ‘example.com’. If we click on the ‘Example Domain’ at the left, we can see all the contents of example.com that are archived.

Archived contents of example.com

6. All the contents that are archived will be stored in the /archive directory.

7. We can also import URLs regularly on schedule.

archivebox schedule --every=day --depth=1 https://example.com/rss.xml

ArchiveBox is truly an amazing tool to archive the contents of the internet that are useful to us. It stores the data on our local machine and we don’t need internet to access that data when we are offline. It makes archiving easier and faster relatively.

That’s it for now, go check out — https://archivebox.io/

--

--