EDGI Web Monitoring Crawler

This repo contains tools and scripts for automating regular crawls of web pages that EDGI is actively monitoring as part of its Web Governance project. The goal is to create archive-ready captures of web content and store it in the Internet Archive, EDGI’s own cloud storage, and import analyzable metadata into Web-Monitoring-DB

⚠️ This is a messy experiment right now! In the past, we relied on partner organizations (e.g. the Internet Archive) or monitoring services (e.g. Versionista, PageFreezer, Visualping, etc.) to do our crawling, and focused our work on analysis and workflow tools. This year, we are taking on more of the crawling work ourselves, but have mainly been managing crawls by hand with a bunch of one-off scripts. This repository aims to better organize and automate that work.

There is currently no standardized PR or contribution process for this project right now — work is very free-form. EDGI’s Code of Conduct stil applies.

Architecture/Roadmap

The scheduled crawls here currently run on GitHub Actions: it is free, convenient, and gives us some really nice workflow management. The basic setup is that there are two jobs:

setup Grabs all the actively monitored URLs from the database and writes them out as a set of Browsertrix config files (via edgi-wm-crawler multi-seeds).

This does some fancy footwork to break things down into a set of crawls that can be run in parallel. It tries to keep each primary domain in a single crawl (except ones that are really big, like epa.gov) so that there’s minimal duplication of page resources across crawls (i.e. less duplicated stuff between the WARC files each crawl generates) and so the number of concurrent requests to a given domain is kept down. This is also necessary because GH actions jobs don’t have enough disk space to store the results from all the crawls together, but the other benefits are important, too!
crawl runs a crawl for each of the config files generated in step 1. This is a matrix job, so all the crawls run efficiently in parallel.

After the crawl finishes, this job also:
1. Saves the results as a GH actions artifact (even if the crawl fails, so it can be inspected).
2. Uploads results to S3.
3. Imports the results to web-monitoring-db.
4. Uploads the WARC files to the Internet Achive.
Ideally these follow-on bits would be separate jobs, but I don’t think there’s a way to do that in the current workflow syntax.

For now, this seems to work really well — it gets the crawls done quickly and efficiently, and splits work in a relatively smart way with (I think?) minimal duplication of subresources.

However, we might outgrow GH Actions at some point or have IP address blocking issues (honestly I’m surprised actions IP addresses aren’t already blocked!). Crawls have historically worked perfectly for us in AWS with a public, elastic IP. Some possible options:

Keep it simple, just customize the webrecorder/browsertrix-crawler Docker image to include scripts to grab the seeds and do the uploading importing stuff before/after the crawl. Schedule that on ECS (with EventBridge) or our Kubernetes cluster (as a CronJob).

The main downside here is that we go back to only having a few large, probably less efficient crawls. There’s not an easy way to manage the workflow of creating a dynamic number of parallel crawls here.

The big upside is that this is really simple, and easy to deploy on any infrastructure.
AWS Batch has array jobs for parallel jobs, but you can’t set the number of parallel jobs dynamically. The first job would need to schedule the follow-on jobs.

Worth noting: Kubernetes has “indexed jobs”, which fill a similar role. They would also require having the first job create the subsequent jobs, which feels like a little more of a bear to me in Kubernetes than AWS Batch, but whatever.
AWS Step Functions is a more purpose-built system for nice workflow automation like this. But it is a complex and very-custom AWS thing. We’d want to use the “Map” state to run the crawls after generating configs.
Apache Airflow is a nice system (with a fancy visual UI) that handles this kind of stuff well. You can dynamically map results of one job across subsequent parallel jobs with expand(). This is a lot more complicated to deploy effectively, but AWS has a managed version.

Step functions and Airflow are fairly equivalent; the difference is mainly portability and cost — Airflow is portable to other platforms, but also more expensive and higher-overhead to run than step functions.

Code of Conduct

This repository falls under EDGI’s Code of Conduct.

License & Copyright

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.0.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
src/edgi_wm_crawler		src/edgi_wm_crawler
.editorconfig		.editorconfig
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
crawl-lib.sh		crawl-lib.sh
crawl.sh		crawl.sh
pyproject.toml		pyproject.toml
test.crawl.yaml		test.crawl.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDGI Web Monitoring Crawler

Architecture/Roadmap

Code of Conduct

License & Copyright

About

Releases

Packages

Languages

License

edgi-govdata-archiving/web-monitoring-crawler

Folders and files

Latest commit

History

Repository files navigation

EDGI Web Monitoring Crawler

Architecture/Roadmap

Code of Conduct

License & Copyright

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages