ウェブブラウザを起動して指定したウェブページを保存してくれるscript.
ウェブブラウザ(Google Chrome or Chromium or Firefox)と xdotool
が必要.
保存時のファイル名にプレフィックスを付けたり,ディレイ時間を指定したりもできる.
$ git clone https://github.com/abiyani/automate-save-page-as $ cd automate-save-page-as $ ./save_page_as --help save_page_as: Open the given url in a browser tab/window, perform 'Save As' operation and close the tab/window. USAGE: save_page_as URL [OPTIONS] URL The url of the web page to be saved. options: -d, --destination Destination path. If a directory, then file is saved with default name inside the directory, else assumed to be full path of target file. Default = '.' -s, --suffix An optional suffix string for the target file name (ignored if --destination arg is a full path) -b, --browser Browser executable to be used (must be one of 'google-chrome', 'chromium-browser' or 'firefox'). Default = 'google-chrome'. --load-wait-time Number of seconds to wait for the page to be loaded (i.e., seconds to sleep before Ctrl+S is 'pressed'). Default = 4 --save-wait-time Number of seconds to wait for the page to be saved (i.e., seconds to sleep before Ctrl+F4 is 'pressed'). Default = 8 -h, --help Display this help message and exit.
--destination
を指定して /tmp
に保存.titleがファイル名になる.
$ ./save_page_as "matoken.org" --destination "/tmp" INFO: The specified destination ('/tmp') is a directory path, will save file inside it with the default name. INFO: Saving web page ... INFO: Done! $ ls -l /tmp/matoken\'s\ meme.* -rw-r--r-- 1 matoken matoken 7758 Jun 25 22:42 "/tmp/matoken's meme..html" "/tmp/matoken's meme._files": total 272 -rw-r--r-- 1 matoken matoken 71735 Jun 25 22:42 8633952663_5aaf4e26ae_c.jpg -rw-r--r-- 1 matoken matoken 182 Jun 25 22:42 cm.html -rw-r--r-- 1 matoken matoken 139471 Jun 25 22:42 f.txt -rw-r--r-- 1 matoken matoken 46274 Jun 25 22:42 ga.js -rw-r--r-- 1 matoken matoken 501 Jun 25 22:42 post.css -rw-r--r-- 1 matoken matoken 49 Jun 25 22:42 saved_resource
-s, --suffix
に "_$( date +%F_%T )"
のようにすると,ファイル名がtitleの後に日時がついて便利.
--load-wait-time
でページを保存する前に読み込み終わるのを待つ時間を指定できる.既定値は4(秒).回線が細かったりサーバの反応が悪いときに長くすると良さそう.
--save-wait-time
でページを保存した後にページを閉じるまでの時間を指定できる.既定値は8(秒).ページのデータが多く保存に時間がかかるときに長くすると良さそう.
-b chromium-browser
で行けるはずだが失敗する
$ ./save_page_as 2>&1 | grep Browser -b, --browser Browser executable to be used (must be one of 'google-chrome', 'chromium-browser' or 'firefox'). Default = 'google-chrome'. $ ./save_page_as example.com -b chromium-browser INFO: The specified destination ('.') is a directory path, will save file inside it with the default name. ERROR: Command 'chromium-browser' not found. Make sure it is installed, and in path.
コマンドを現在の chromium
に修正して動くようになった
$ git diff ./save_page_as diff --git a/save_page_as b/save_page_as index abeee4d..3895d58 100755 --- a/save_page_as +++ b/save_page_as @@ -25,7 +25,7 @@ function print_usage() { printf "options:\n" >&2 printf " -d, --destination Destination path. If a directory, then file is saved with default name inside the directory, else assumed to be full path of target file. Default = '%s'\n" "${destination}" >&2 printf " -s, --suffix An optional suffix string for the target file name (ignored if --destination arg is a full path)\n" >&2 - printf " -b, --browser Browser executable to be used (must be one of 'google-chrome', 'chromium-browser' or 'firefox'). Default = '%s'.\n" "${browser}" >&2 + printf " -b, --browser Browser executable to be used (must be one of 'google-chrome', 'chromium' or 'firefox'). Default = '%s'.\n" "${browser}" >&2 printf " --load-wait-time Number of seconds to wait for the page to be loaded (i.e., seconds to sleep before Ctrl+S is 'pressed'). Default = %s\n" "${load_wait_time}" >&2 printf " --save-wait-time Number of seconds to wait for the page to be saved (i.e., seconds to sleep before Ctrl+F4 is 'pressed'). Default = %s\n" "${save_wait_time}" >&2 printf " -h, --help Display this help message and exit.\n" >&2 @@ -109,8 +109,8 @@ function validate_input() { fi destination="$(readlink -f "$destination")" # Ensure absolute path - if [[ "${browser}" != "google-chrome" && "${browser}" != "chromium-browser" && "${browser}" != "firefox" ]]; then - printf "ERROR: Browser (%s) is not supported, must be one of 'google-chrome', 'chromium-browser' or 'firefox'.\n" "${browser}" >&2 + if [[ "${browser}" != "google-chrome" && "${browser}" != "chromium" && "${browser}" != "firefox" ]]; then + printf "ERROR: Browser (%s) is not supported, must be one of 'google-chrome', 'chromium' or 'firefox'.\n" "${browser}" >&2 exit 1 fi $ ./save_page_as example.com -b chromium :
X2GoでリモートのXを起動して以下のようなscriptをcrontabで呼ぶようにしてみた.X2Goは非接続時にはレジュームしてしまう.レジューム状態ではChromiumが利用できず失敗する. VNCなどに変更したら良さそう.
5分毎に10時間ほど動かして数回保存ウィンドウで止まった(未解決).その時は手動で保存した.
#!/bin/sh date +%F_%T DISPLAY=:50 /home/matoken/src/automate-save-page-as/save_page_as "https://www.youtube.com/c/OSPNjp/" --destination "/home/matoken/osc21do/" --suffix "_$( date +%F_%T )" --browser "chromium" --load-wait-time 15 --save-wait-time 30 chmod -R o+rx /home/matoken/osc21do echo
JavaScriptなどでの描画のないコンテンツなどならwgetなどでも.
JavaScriptで作られたdomならchromium headlessなどでも.
$ chromium --headless --dump-dom https://example.com/
その他Seleniumで色々
コメント