Skip to content

Commit f5ec109

Browse files
committed
Merge branch 'r/3.1.0'
1 parent 5de971d commit f5ec109

14 files changed

+259
-187
lines changed

README.md

+57-43
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@
33
[![PyPI](https://img.shields.io/pypi/v/pywaybackup)](https://pypi.org/project/pywaybackup/)
44
[![PyPI - Downloads](https://img.shields.io/pypi/dm/pywaybackup)](https://pypi.org/project/pywaybackup/)
55
![Python Version](https://img.shields.io/badge/Python-3.8-blue)
6-
![Python_Sqlite3 Version](https://img.shields.io/badge/Python_Sqlite3-3.25-blue)
76
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
87

98
Downloading archived web pages from the [Wayback Machine](https://archive.org/web/).
@@ -29,12 +28,15 @@ This tool allows you to download content from the Wayback Machine (archive.org).
2928
```pip install .```
3029
- in a virtual env or use `--break-system-package`
3130

32-
## Usage infos - important notes
31+
## Important notes
3332

3433
- Linux recommended: On Windows machines, the path length is limited. This can only be overcome by editing the registry. Files that exceed the path length will not be downloaded.
3534
- If you query an explicit file (e.g. a query-string `?query=this` or `login.html`), the `--explicit`-argument is recommended as a wildcard query may lead to an empty result.
3635
- The tool uses a sqlite database to handle snapshots. The database will only persist while the download is running.
3736

37+
<br>
38+
<br>
39+
3840
## Arguments
3941

4042
- `-h`, `--help`: Show the help message and exit.
@@ -55,7 +57,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
5557
- **`-s`**, **`--save`**:<br>
5658
Save a page to the Wayback Machine. (beta)
5759

58-
### Optional query parameters
60+
#### Optional query parameters
5961

6062
- **`-e`**, **`--explicit`**:<br>
6163
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
@@ -76,7 +78,9 @@ Limits the amount of snapshots to query from the CDX server. If an existing CDX
7678
- **`--end`**:<br>
7779
Timestamp to end searching.
7880

79-
### Behavior manipulation
81+
### Optional
82+
83+
#### Behavior Manipulation
8084

8185
- **`-o`**, **`--output`**:<br>
8286
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
@@ -105,55 +109,64 @@ Specifies delay between download requests in seconds. Default is no delay (0).
105109
<!-- - **`--convert-links`**:<br>
106110
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
107111

108-
### Special:
112+
#### Job Handling:
109113

110114
- **`--reset`**:
111115
If set, the job will be reset, and any existing `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch without considering previously downloaded data.
112116

113117
- **`--keep`**:
114118
If set, all files will be kept after the job is finished. This includes the `cdx` and `db` file. Without this argument, they will be deleted if the job finished successfully.
115119

116-
# Usage
120+
<br>
121+
<br>
122+
123+
## Usage
117124

118125
### Handling Interrupted Jobs
119-
When a job is interrupted (by any reason), `pywaybackup` is designed to resume the job from where it left off. It automatically detects existing job data (based on the URL and <u>**optional query parameters**</u> - including output directory) and resumes the process without requiring manual intervention. Here's how the tool handles different scenarios:
120-
121-
- **Default Behavior:**
122-
- On restarting the same job (same URL, <u>**optional query parameters**</u>, and output directory), the tool will:
123-
- Reuse the existing `.cdx` and `.db` files.
124-
- Resume downloading snapshots from the last successful point.
125-
- Skip previously downloaded files to save time and resources.
126-
127-
- **Manual Reset with `--reset`:**
128-
- This command deletes any existing `.cdx` and `.db` files associated with the job and starts the process from scratch.
129-
- Useful if:
130-
- The previous data is corrupted.
131-
- You want to re-query the snapshots without considering previously downloaded data.
132-
133-
- **Preserving Job Data with `--keep`:**
134-
- Normally, `.cdx` and `.db` files are deleted after the job finishes successfully.
135-
- Use `--keep` to retain these files for future use (e.g., re-analysis or extending the query later).
136-
137-
> **Note1:** The resumption process only works if the output directory remains the same as the one used during the initial job.
138-
>
139-
> **Note2:** `--reset` will NOT delete the already downloaded files for now. You have to remove them 'by hand'.
140-
141-
### Example
142126

143-
1. Start downloading all available snapshots:<br>`waybackup -u https://example.com -a`
144-
2. Interrupt the process `CTRL+C`<br>
145-
3. The tool will detect the existing job data and resume downloading from the last completed point:<br>`waybackup -u https://example.com -a`
146-
> **Important:** `waybackup -u https://example.com -c` -> The tool will NOT resume because a necessary identifier-changed
147-
4. This deletes any existing .cdx and .db files associated with the job and starts the process from scratch:<br>`waybackup -u https://example.com -a --reset`
148-
5. This ensures all job-related files are kept for future use, such as re-analysis or extending the query later:<br>`waybackup -u https://example.com -a --keep`
127+
`pywaybackup` resumes interrupted jobs. The tool automatically continues from where it left off.
128+
129+
- Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
130+
- Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
131+
- Skips previously downloaded files to save time.
132+
> **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
133+
134+
#### Resetting a Job (`--reset`)
135+
- Deletes `.cdx` and `.db` files and restarts the process from scratch.
136+
- Does **not** remove already downloaded files.
137+
- `waybackup -u https://example.com -a --reset`
138+
139+
#### Keeping Job Data (`--keep`)
140+
- Normally, `.cdx` and `.db` files are deleted after a successful job.
141+
- `--keep` preserves them for future re-analysis or extending the query.
142+
- `waybackup -u https://example.com -a --keep`
149143

150-
## Output path structure
144+
<br>
145+
<br>
146+
147+
## Examples
148+
149+
1. Download a specific single snapshot of all available files (starting from root):<br>
150+
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
151+
2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
152+
`waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
153+
3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
154+
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
155+
4. Download all snapshots of all available files in the given range:<br>
156+
`waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
157+
158+
<br>
159+
<br>
160+
161+
## Output
162+
163+
### Path Structure
151164

152165
The output path is currently structured as follows by an example for the query:<br>
153-
`http://example.com/subdir1/subdir2/assets/`:
166+
`http://example.com/subdir1/subdir2/assets/`
154167
<br><br>
155168
For the first and last version (`-f` or `-l`):
156-
- The requested path will only include all files/folders starting from your query-path.
169+
- Will only include all files/folders starting from your query-path.
157170
```
158171
your/path/waybackup_snapshots/
159172
└── the_root_of_your_query/ (example.com/)
@@ -165,7 +178,7 @@ your/path/waybackup_snapshots/
165178
...
166179
```
167180
For all versions (`-a`):
168-
- Will currently create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
181+
- Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
169182
```
170183
your/path/waybackup_snapshots/
171184
└── the_root_of_your_query/ (example.com/)
@@ -184,7 +197,7 @@ your/path/waybackup_snapshots/
184197
...
185198
```
186199

187-
## CSV Output
200+
### CSV
188201

189202
Each snapshot is stored with the following keys/values. These are either stored in a sqlite database while the download is running or saved into a CSV file after the download is finished.
190203

@@ -210,11 +223,12 @@ For download queries:
210223

211224
Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
212225

213-
### Known ToDos
214-
215-
- [ ] currently there is no logic to handle if both a http and https version of a page is available
226+
<br>
227+
<br>
216228

217229
## Contributing
218230

219231
I'm always happy for some feature requests to improve the usability of this tool.
220232
Feel free to give suggestions and report issues. Project is still far from being perfect.
233+
234+
> Please PR from dev into dev.

pyproject.toml

+2-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ packages = ["pywaybackup"]
77

88
[project]
99
name = "pywaybackup"
10-
version = "3.0.2"
10+
version = "3.1.0"
1111
description = "Query and download archive.org as simple as possible."
1212
authors = [
1313
{ name = "bitdruid", email = "[email protected]" }
@@ -16,6 +16,7 @@ license = { file = "LICENSE" }
1616
readme = "README.md"
1717
requires-python = ">=3.8"
1818
dependencies = [
19+
"pysqlite3-binary==0.5.4",
1920
"requests==2.31.0",
2021
"tqdm==4.66.2",
2122
"python-magic==0.4.27; sys_platform == 'linux'",

pywaybackup/Arguments.py

+5-1
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,11 @@
22
import sys
33
import os
44
import argparse
5+
56
from importlib.metadata import version
67

78
from pywaybackup.helper import url_split, sanitize_filename
9+
from pywaybackup.Exception import Exception as ex
810

911
class Arguments:
1012

@@ -73,7 +75,7 @@ def init(cls):
7375

7476
if cls.output is None:
7577
cls.output = os.path.join(os.getcwd(), "waybackup_snapshots")
76-
os.makedirs(cls.output, exist_ok=True)
78+
os.makedirs(cls.output, exist_ok=True) if not cls.save else None
7779

7880
if cls.log is True:
7981
cls.log = os.path.join(cls.output, f"waybackup_{sanitize_filename(cls.url)}.log")
@@ -84,6 +86,8 @@ def init(cls):
8486
cls.mode = "last"
8587
if cls.first:
8688
cls.mode = "first"
89+
if cls.save:
90+
cls.mode = "save"
8791

8892
if cls.filetype:
8993
cls.filetype = [ft.lower().strip() for ft in cls.filetype.split(",")]

pywaybackup/Exception.py

+16-19
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,33 @@
1-
21
import sys
32
import os
4-
from datetime import datetime
3+
import re
54
import linecache
65
import traceback
7-
8-
import re
6+
from datetime import datetime
97

108
from importlib.metadata import version
119

12-
class Exception:
1310

11+
class Exception:
1412
new_debug = True
1513
output = None
1614
command = None
1715

1816
@classmethod
1917
def init(cls, output=None, command=None):
20-
sys.excepthook = cls.exception_handler # set custom exception handler (uncaught exceptions)
18+
sys.excepthook = (
19+
cls.exception_handler
20+
) # set custom exception handler (uncaught exceptions)
2121
cls.output = output
2222
cls.command = command
2323

2424
@classmethod
2525
def exception(cls, message: str, e: Exception, tb=None):
2626
custom_tb = sys.exc_info()[-1] if tb is None else tb
27-
original_tb = cls.relativate_path("".join(traceback.format_exception(type(e), e, e.__traceback__)))
28-
exception_message = (
29-
"-------------------------\n"
30-
f"!-- Exception: {message}\n"
27+
original_tb = cls.relativate_path(
28+
"".join(traceback.format_exception(type(e), e, e.__traceback__))
3129
)
30+
exception_message = f"-------------------------\n!-- Exception: {message}\n"
3231
if custom_tb is not None:
3332
while custom_tb.tb_next: # loop to last traceback frame
3433
custom_tb = custom_tb.tb_next
@@ -46,10 +45,7 @@ def exception(cls, message: str, e: Exception, tb=None):
4645
)
4746
else:
4847
exception_message += "!-- Traceback is None\n"
49-
exception_message += (
50-
f"!-- Description: {e}\n"
51-
"-------------------------"
52-
)
48+
exception_message += f"!-- Description: {e}\n-------------------------"
5349
print(exception_message)
5450
debug_file = os.path.join(cls.output, "waybackup_error.log")
5551
print(f"Exception log: {debug_file}")
@@ -85,10 +81,10 @@ def relativate_path(cls, input: str) -> str:
8581
if os.path.isfile(input): # case single path
8682
return os.path.relpath(input, os.getcwd())
8783
input_modified = ""
88-
input_lines = input.split('\n')
89-
if len(input_lines) == 1: # case single line
84+
input_lines = input.split("\n")
85+
if len(input_lines) == 1: # case single line
9086
return input
91-
for line in input.split('\n'): # case multiple lines
87+
for line in input.split("\n"): # case multiple lines
9288
match = path_pattern.search(line)
9389
if match:
9490
original_path = match.group(1)
@@ -104,5 +100,6 @@ def exception_handler(exception_type, exception, traceback):
104100
if issubclass(exception_type, KeyboardInterrupt):
105101
sys.__excepthook__(exception_type, exception, traceback)
106102
return
107-
Exception.exception("UNCAUGHT EXCEPTION", exception, traceback) # uncaught exceptions also with custom scheme
108-
103+
Exception.exception(
104+
"UNCAUGHT EXCEPTION", exception, traceback
105+
) # uncaught exceptions also with custom scheme

pywaybackup/SnapshotCollection.py

+16-5
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,13 @@
1-
from pywaybackup.Verbosity import Verbosity as vb
2-
from pywaybackup.helper import url_split
3-
from pywaybackup.db import Database
4-
from tqdm import tqdm
51
import json
62
import csv
73
import os
84

5+
from tqdm import tqdm
6+
7+
from pywaybackup.Verbosity import Verbosity as vb
8+
from pywaybackup.helper import url_split
9+
from pywaybackup.db import Database
10+
911
class SnapshotCollection:
1012
"""
1113
Represents the interaction with the snapshot-collection contained in the snapshot database.
@@ -292,12 +294,21 @@ def get_snapshot(connection):
292294
"""
293295
Get a snapshot-row from the snapshot table with response NULL. (not processed)
294296
"""
297+
# mark as locked for other workers // only visual because get_snapshot fetches by NULL
295298
connection.cursor.execute(
296299
"""
297-
SELECT rowid, * FROM snapshot_tbl WHERE response IS NULL LIMIT 1
300+
UPDATE snapshot_tbl
301+
SET response = 'LOCK'
302+
WHERE rowid = (
303+
SELECT rowid FROM snapshot_tbl
304+
WHERE response IS NULL
305+
LIMIT 1
306+
)
307+
RETURNING rowid, *;
298308
"""
299309
)
300310
row = connection.cursor.fetchone()
311+
connection.conn.commit()
301312
return row
302313

303314
@classmethod

pywaybackup/Verbosity.py

+7-5
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
import sys
21
from tqdm import tqdm
32

43
class Verbosity:
@@ -63,21 +62,24 @@ def progress(cls, progress: int, maxval: int = None):
6362
cls.pbar.refresh()
6463

6564
@classmethod
66-
def generate_logline(cls, status: str = "", type: str = "", message: str = ""):
65+
def generate_logline(cls, status: str, type: str, message: str):
6766
"""
68-
STATUS -> TYPE: MESSAGE
67+
STATUS TYPE: MESSAGE
6968
"""
7069

7170
if not status and not type:
7271
return message
7372

74-
status_length = 11
73+
status_length = 10
7574
type_length = 5
7675

7776
status = status.ljust(status_length)
77+
status = f"{status} -> "
78+
7879
type = type.ljust(type_length)
80+
type = f"{type}: " if type.strip() else ""
7981

80-
log_entry = f"{status} -> {type}: {message}"
82+
log_entry = f"{status}{type}{message}"
8183

8284
return log_entry
8385

0 commit comments

Comments
 (0)