All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Note that for version number starting with a 0
, i.e., 0.x.y
, a bump of x
should be considered as a major (and thus potentially breaking) change. See
semver guidelines for more details about this.
- Mark python 3.13 as supported.
- New model
standard_v3_3
model, with better support for TypeScript and non-ascii characters in textual files. See models' CHANGELOG for more information. identify_stream()
now restores the stream's original position after reading from it, preventing side effects on subsequent stream operations. (#1020)- Bugfix: limit the number of bytes we read in case of an input with just many whitespaces. (#1015)
- Bugfix: do not alter warnings' simplefilter as this has visible side effects for other modules. (#1017)
- Add
asdict()
utility method toMagikaResult
. - Set
prediction.overwrite_reason
toOverwrite.NONE
ifoutput.label
is the same asdl.label
. (#1023)
Magika v0.6.1 is a significant update featuring a new model with 2x supported content types, a new command line client in Rust, performance improvements, API enhancements, and a few breaking changes. This changelog entry rolls up all changes from v0.5.1, the last stable release.
Important
There are a few breaking changes! After reading about the new key features and improvements, we suggest to consult the migration guide below and the updated documention.
- New deep learning model: We introduce a new model,
standard_v3_2
, which supports 2x content types (200+ in total, see full list here), has a similar ~99% average accuracy, and is ~20% faster, with an inference speed of about ~2ms on CPUs (YMMV depending on your testing setup). See models' CHANGELOG for more information. - New command line client, written in Rust: We developed a new command line client, written in Rust, which is not affected by the one-time boostrap overhead caused by the python's interpreter itself. This new client is packaged, pre-compiled, into the
magika
python package. This new client replaces the old client written in Python (but the old Python one is still available as a fallback for those platforms for which we don't have precompiled rust binaries). - New stream-based identification: Added
identify_stream(stream: typing.BinaryIO)
API to infer content types from open binary streams. (#970) - Improved path handling:
identify_path
andidentify_paths
now acceptUnion[str, os.PathLike]
objects. You no longer need to explicitly usepathlib.Path
. (#935) - Improved python API: The new Python APIs offer a number of improvements, including: the inference APIs now return a
MagikaResult
, which is aabsl::StatusOr
-like object that wrapsMagikaPrediction
, with a clear separation between valid predictions and error situations; the output content types (label
) are not juststr
anymore, but of typeContentTypeLabel
, making integrations more robust (ContentTypeLabel
extendsStrEnum
: thus, they are not juststr
, but you can treat them as such). TheMagikaPrediction
object now has additionalis_text
andextensions
fields (in addition to the existinglabel
,mime_type
,group
, anddescription
). - New debugging APIs: Added new APIs to ease debugging and introspection, such as
get_output_content_types()
,get_model_content_types()
,get_module_version()
, andget_model_name()
.
This release introduces several breaking changes. Please review this guide carefully to update your code:
- New
identify_*
API output format: The inference Python APIs now return aMagikaResult
object, which is similar toabsl::StatusOr
; This provides a cleaner way to handle errors.dl.ct_label
andoutput.ct_label
are renamed todl.label
andoutput.label
.label
s are now of typeContentTypeLabel
, which extendsStrEnum
(thus, they are not juststr
, but you can treat them as such). Thescore
field is now at the top level, alongsidedl
andoutput
. Themagic
field has been removed as it was often either incorrect or reduntant; usedescription
instead.
-
Before (v0.5.x and earlier):
import magika m = magika.Magika() result = m.identify_path("my_file.py") print(result.output.ct_label) # Assumed success
-
After (v0.6.1):
import magika m = magika.Magika() result = m.identify_path("my_file.py") if result.ok(): print(result.output.label) else: print(f"Error: {result.status}")
- CLI Output Format Change (v0.6.0): The JSON output format of the CLI has changed. Those changes are analogous to the changes to the Python APIs. The
score
field is now at the top level, alongsidedl
andoutput
, and is no longer nested withindl
oroutput
. The output also includesis_text
andextensions
fields. Themagic
metadata has been removed as it was often either incorrect or reduntant; usedescription
instead. Moreover, similarly to what happens under the hood with theStatusOr
pattern,result.status
indicates whether the prediction was successful, and the prediction results are available under theresult.value
key.
-
Before (v0.5.x and earlier): (Illustrative example - adapt to your specific output)
{ "path": "code.py", "dl": { "ct_label": "python", "score": 0.9940916895866394, "group": "code", "mime_type": "text/x-python", "magic": "Python script", "description": "Python source" }, "output": { "ct_label": "python", "score": 0.9940916895866394, "group": "code", "mime_type": "text/x-python", "magic": "Python script", "description": "Python source" } }
-
After (v0.6.1):
{ "path": "code.py", "result": { "status": "ok", "value": { "dl": { "description": "Python source", "extensions": ["py", "pyi"], "group": "code", "is_text": true, "label": "python", "mime_type": "text/x-python" }, "output": { "description": "Python source", "extensions": ["py", "pyi"], "group": "code", "is_text": true, "label": "python", "mime_type": "text/x-python" }, "score": 0.9890000224113464 } } }
dl.label == ContentTypeLabel.UNDEFINED
when the model is not used: There are situations in which the deep learning model is not used, for example when the file is too small or empty. In these cases,dl.label
is now set toContentTypeLabel.UNDEFINED
instead of having the fulldl
block being set toNone
.
-
Before (v0.5.x and earlier):
# ... (assuming successful result) if prediction.dl is not None: print(prediction.dl.ct_label)
-
After (v0.6.1):
# ... (assuming successful result) if prediction.dl.label != magika.ContentTypeLabel.UNDEFINED: print(prediction.dl.label)
- Expanded List of Content Types: The model now supports over 200 content types.
- Migration: Review the updated list of supported content types and adjust any code that relies on specific content type labels returned by previous versions. Labels have not changed, but a file previously detected as
javascript
may not be detected astypescript
. Consider usingget_output_content_types()
to dynamically retrieve the supported labels.
- Pure Python Wheel and Rust Client Fallback: If you are installing Magika on a platform without pre-built wheels (e.g., Windows on ARM), you will automatically get the pure-python wheel. In this case, the package does not include the Rust binary client, but it does include the old python client as fallback; you can use such old python client with
$ magika-python-client
.
For a detailed list of all changes, including those from the -rc releases, please refer to the individual changelog entries for each release candidate:
- Add support for python 3.12. Magika now supports python >=3.8 and <3.13.
- Fix bugs for features extraction to cover more corner cases.
- Remove MIME types from table of supported content types (Relevant for
--list-output-content-types
; see FAQs for context). - Refactor features extraction around a Seekable abstraction; we now have only one reference implementation.
- Start groundwork for v2 of features extraction.
- Various clean ups and internal refactors.
- New public python APIs:
identify_paths
,identify_path
,identify_bytes
. - The APIs now return a
MagikaResult
object. - When the model's prediction has low confidence and we return a generic content type, print anyways (with a disclaimer) the model's best guess.
- Updated description for "unknown" to "Unknown binary data".
- Magika CLI now defaults to "high-confidence" mode. "default" mode is now called "medium-confidence".
- Magika CLI
-p/--output-probability
has been renamed to-s/--output-score
for consistency. - Default model is now called
standard_v1
. - Major refactoring and clean up.
- Various improvements and clean ups.
- Update model to dense_v4_top_20230910.
- Package now contains the model itself.
- Support reading from stdin:
$ cat <path> | magika -
$ curl <url> | magika -
- Change how we deal with padding, using 256 instead of 0. This boosts precision.
- "symlink" output label has been renamed to "symlinktext" to better reflect its nature.
- New
--prediction-mode
CLI option to indicate which confidence is required for the predictions. We support three modes:best-guess
,default
,high-confidence
. - Support for directories and symlinks similarly to
file
. - Adapt
-r
/--recursive
CLI option to be compatible with the new way magika handles directories. - Add special handling for small files.
- Magika does not crash anymore when scanning files with permission issues. It now returns "permission_error".
- Do not resolve file paths (i.e., relative paths remain relative).
- Add --no-dereference CLI option: by default symlinks are dereferenced.
this option makes magika not dereferencing symlinks. This is what
file
does. - Clean up and many bug fixes.
- Removed warnings when using MIME type and compatibility mode.
- By default, magika now outputs a human-readable output.
- Add
-l
/--label
CLI option to output a stable, content type label. - JSON/JSONL output now shows all metadata about a given content type.
- Add metadata about magic and description for each relevant content type.
- Logs are now printed to stderr, not stdout.
- Add
--generate-report
CLI option to output a JSON report that can be useful for debugging and reporting feedback. - Be more flexible with the required python version (now we require "^3.8" instead of "^3.8,<3.11")
- Show a descriptive error in case magika can't find any file to scan (instead of silently exiting).
- If the prediction score is higher than a given threshold (0.95), consider it regardless of the per-content-type threshold.
- Output format is back being just
<content type>
; group is displayed only when showing metadata. - Update metadata of some content types.
- Several small bug fixes.
- Input files are now processed in multiple small batches, instead of one big batch.
- Per-content-type threshold based on the 0.005 quantile for recall.
- MIME type and "group" metadata for all content types.
- Introduce basic support for compatibility mode.
-c
/--compatibility-mode
CLI option to enable compatibility mode.--no-colors
CLI option to disable colors.-b
/--batch-size
CLI option to specify the batch size.--guess
/--output-highest-probability
CLI option to output the content type with the highest probability regardless of its probability score.--version
CLI option to print Magika's version.
- Output follows the
<group>::<content type>
format. - Probability score is not shown by default; enable with
-p
. - Output is colored according to the file content type's group.
- Remove dependency from richlogger, add a much simpler logger.
- First release.