After a long time, I wrote a module on npm.

Background

My father asked me to record his friend website in PDF. Since he died, there is no way to access his website and source code. But my father wanted to keep the website for his family. The reasons why we need to record them in PDF format are :

  • His family are not good at computer. Keeping webpage source code in HTML, CSS is not desirable.
  • PDF can be easily read and printed by his family.

His website is not so large. There are 20~30 pages in total. So I can take a screenshot each page manually. But I wanted to write a code to crawl webpage and take a screenshot in PDF by using Puppetteer which is recently released by Google. Then site-snapshot was created.

Usage

Usage of site-snapshot is simple. First you need to write a path to be crawled in JSON format. We call it site.json.

{
  "name": "index",
  "selector": null,
  "baseUrl": "http://www.lewuathe.com",
  "children": [
    {
      "name": "menu",
      "selector": ".element",
      "children": []
    }
  ]
}

site.json is a tree whose element has

  • name : The name of the page
  • children : The child elements to be crawled next
  • selector : jQuery formatted selector of the child element to be crawled.

tree

site-snapshot first takes a picture of index page, then searching child elements based on given selector. You can specifies the pages to be taken in site.json. After all, the pages are stored in the directory in the same structure to site.json.

$ tree index
index
├── index.html.pdf
└── menu
    ├── menu-about.pdf
    ├── menu-contact.pdf
    └── menu-writing.pdf

1 directory, 4 files

You can use site-snapshot by siteshot CLI.

$ siteshot --help

  Usage: siteshot [options]


  Options:

    -V, --version              output the version number
    -s, --sitefile [sitefile]  The path to site.json file
    -h, --help

Never ending website

20~30 years has passed since internet had been widely used. There are amazingly a lot of website around the world. Some are actively maintained, the others are ruined. Some are able to be reached from the Google first page, the others are not. Some are popular, others are not. But all websites have their own history and maintainers. Now I found the end of the history won’t come even after the maintainer has died. Because there are readers on that website. Even if the server is stopped, the website will keep living in PDF or in your heart.

I hope I can write such a website.