You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Downloading archived web pages from the [Wayback Machine](https://archive.org/web/).
@@ -29,12 +28,15 @@ This tool allows you to download content from the Wayback Machine (archive.org).
29
28
```pip install .```
30
29
- in a virtual env or use `--break-system-package`
31
30
32
-
## Usage infos - important notes
31
+
## Important notes
33
32
34
33
- Linux recommended: On Windows machines, the path length is limited. This can only be overcome by editing the registry. Files that exceed the path length will not be downloaded.
35
34
- If you query an explicit file (e.g. a query-string `?query=this` or `login.html`), the `--explicit`-argument is recommended as a wildcard query may lead to an empty result.
36
35
- The tool uses a sqlite database to handle snapshots. The database will only persist while the download is running.
37
36
37
+
<br>
38
+
<br>
39
+
38
40
## Arguments
39
41
40
42
-`-h`, `--help`: Show the help message and exit.
@@ -55,7 +57,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
55
57
-**`-s`**, **`--save`**:<br>
56
58
Save a page to the Wayback Machine. (beta)
57
59
58
-
### Optional query parameters
60
+
####Optional query parameters
59
61
60
62
-**`-e`**, **`--explicit`**:<br>
61
63
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
@@ -76,7 +78,9 @@ Limits the amount of snapshots to query from the CDX server. If an existing CDX
76
78
-**`--end`**:<br>
77
79
Timestamp to end searching.
78
80
79
-
### Behavior manipulation
81
+
### Optional
82
+
83
+
#### Behavior Manipulation
80
84
81
85
-**`-o`**, **`--output`**:<br>
82
86
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
@@ -105,55 +109,64 @@ Specifies delay between download requests in seconds. Default is no delay (0).
105
109
<!-- - **`--convert-links`**:<br>
106
110
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
107
111
108
-
###Special:
112
+
#### Job Handling:
109
113
110
114
-**`--reset`**:
111
115
If set, the job will be reset, and any existing `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch without considering previously downloaded data.
112
116
113
117
-**`--keep`**:
114
118
If set, all files will be kept after the job is finished. This includes the `cdx` and `db` file. Without this argument, they will be deleted if the job finished successfully.
115
119
116
-
# Usage
120
+
<br>
121
+
<br>
122
+
123
+
## Usage
117
124
118
125
### Handling Interrupted Jobs
119
-
When a job is interrupted (by any reason), `pywaybackup` is designed to resume the job from where it left off. It automatically detects existing job data (based on the URL and <u>**optional query parameters**</u> - including output directory) and resumes the process without requiring manual intervention. Here's how the tool handles different scenarios:
120
-
121
-
-**Default Behavior:**
122
-
- On restarting the same job (same URL, <u>**optional query parameters**</u>, and output directory), the tool will:
123
-
- Reuse the existing `.cdx` and `.db` files.
124
-
- Resume downloading snapshots from the last successful point.
125
-
- Skip previously downloaded files to save time and resources.
126
-
127
-
-**Manual Reset with `--reset`:**
128
-
- This command deletes any existing `.cdx` and `.db` files associated with the job and starts the process from scratch.
129
-
- Useful if:
130
-
- The previous data is corrupted.
131
-
- You want to re-query the snapshots without considering previously downloaded data.
132
-
133
-
-**Preserving Job Data with `--keep`:**
134
-
- Normally, `.cdx` and `.db` files are deleted after the job finishes successfully.
135
-
- Use `--keep` to retain these files for future use (e.g., re-analysis or extending the query later).
136
-
137
-
> **Note1:** The resumption process only works if the output directory remains the same as the one used during the initial job.
138
-
>
139
-
> **Note2:**`--reset` will NOT delete the already downloaded files for now. You have to remove them 'by hand'.
140
-
141
-
### Example
142
126
143
-
1. Start downloading all available snapshots:<br>`waybackup -u https://example.com -a`
144
-
2. Interrupt the process `CTRL+C`<br>
145
-
3. The tool will detect the existing job data and resume downloading from the last completed point:<br>`waybackup -u https://example.com -a`
146
-
> **Important:**`waybackup -u https://example.com -c` -> The tool will NOT resume because a necessary identifier-changed
147
-
4. This deletes any existing .cdx and .db files associated with the job and starts the process from scratch:<br>`waybackup -u https://example.com -a --reset`
148
-
5. This ensures all job-related files are kept for future use, such as re-analysis or extending the query later:<br>`waybackup -u https://example.com -a --keep`
127
+
`pywaybackup` resumes interrupted jobs. The tool automatically continues from where it left off.
128
+
129
+
- Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
130
+
- Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
- Will currently create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
181
+
- Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
Each snapshot is stored with the following keys/values. These are either stored in a sqlite database while the download is running or saved into a CSV file after the download is finished.
190
203
@@ -210,11 +223,12 @@ For download queries:
210
223
211
224
Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
212
225
213
-
### Known ToDos
214
-
215
-
-[ ] currently there is no logic to handle if both a http and https version of a page is available
226
+
<br>
227
+
<br>
216
228
217
229
## Contributing
218
230
219
231
I'm always happy for some feature requests to improve the usability of this tool.
220
232
Feel free to give suggestions and report issues. Project is still far from being perfect.
0 commit comments