Skip to content

Update PySBD component to support spaCy v3 #114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 12 additions & 11 deletions examples/pysbd_as_spacy_component.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,27 +3,28 @@

Installation:
pip install spacy

NOTE: Works with spacy>=3.x.x
"""
import pysbd
import spacy
from spacy.language import Language

from pysbd.utils import PySBDFactory


@Language.factory("pysbd", default_config={"language": 'en'})
def pysbd_component(nlp, name, language: str):
return PySBDFactory(nlp, language=language)

def pysbd_sentence_boundaries(doc):
seg = pysbd.Segmenter(language="en", clean=False, char_span=True)
sents_char_spans = seg.segment(doc.text)
char_spans = [doc.char_span(sent_span.start, sent_span.end) for sent_span in sents_char_spans]
start_token_ids = [span[0].idx for span in char_spans if span is not None]
for token in doc:
token.is_sent_start = True if token.idx in start_token_ids else False
return doc

if __name__ == "__main__":
text = "My name is Jonas E. Smith. Please turn to p. 55."
nlp = spacy.blank('en')

# add as a spacy pipeline component
nlp.add_pipe(pysbd_sentence_boundaries)
nlp.add_pipe("pysbd", first=True)

doc = nlp(text)
print('sent_id', 'sentence', sep='\t|\t')
for sent_id, sent in enumerate(doc.sents, start=1):
print(sent_id, sent.text, sep='\t|\t')
print(sent_id, repr(sent.text), sep='\t|\t')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: why is the repr() necessary here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary at all. I add it sometimes just to see raw representation of the sentence. Especially in given example to see whether trailing spaces are captured properly or not 😅

Screenshot with repr vs without
image

2 changes: 1 addition & 1 deletion pysbd/about.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# https://python-packaging-user-guide.readthedocs.org/en/latest/single_source_version/

__title__ = "pysbd"
__version__ = "0.3.4"
__version__ = "0.3.5"
__summary__ = "pysbd (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box across many languages."
__uri__ = "http://nipunsadvilkar.github.io/"
__author__ = "Nipun Sadvilkar"
Expand Down
32 changes: 21 additions & 11 deletions pysbd/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,16 @@
import re
import pysbd

class Rule(object):

class Rule(object):
def __init__(self, pattern, replacement):
self.pattern = pattern
self.replacement = replacement

def __repr__(self): # pragma: no cover
return '<{} pattern="{}" and replacement="{}">'.format(
self.__class__.__name__, self.pattern, self.replacement)
self.__class__.__name__, self.pattern, self.replacement
)


class Text(str):
Expand All @@ -30,14 +31,14 @@ class Text(str):
input as it is if rule pattern doesnt match
else replacing found pattern with replacement chars
"""

def apply(self, *rules):
for each_r in rules:
self = re.sub(each_r.pattern, each_r.replacement, self)
return self


class TextSpan(object):

def __init__(self, sent, start, end):
"""
Sentence text and its start & end character offsets within original text
Expand All @@ -57,25 +58,34 @@ def __init__(self, sent, start, end):

def __repr__(self): # pragma: no cover
return "{0}(sent={1}, start={2}, end={3})".format(
self.__class__.__name__, repr(self.sent), self.start, self.end)
self.__class__.__name__, repr(self.sent), self.start, self.end
)

def __eq__(self, other):
if isinstance(self, other.__class__):
return self.sent == other.sent and self.start == other.start and self.end == other.end
return (
self.sent == other.sent
and self.start == other.start
and self.end == other.end
)


class PySBDFactory(object):
"""pysbd as a spacy component through entrypoints"""

def __init__(self, nlp, language='en'):
def __init__(self, nlp, language="en"):
self.nlp = nlp
self.seg = pysbd.Segmenter(language=language, clean=False,
char_span=True)
self.seg = pysbd.Segmenter(language=language, clean=False, char_span=True)

def __call__(self, doc):
sents_char_spans = self.seg.segment(doc.text_with_ws)
start_token_ids = [sent.start for sent in sents_char_spans]
sents_char_spans_doc = [
doc.char_span(sent_span.start, sent_span.end, alignment_mode="contract")
for sent_span in sents_char_spans
]
start_token_ids = [
span[0].idx for span in sents_char_spans_doc if span is not None
]
for token in doc:
token.is_sent_start = (True if token.idx
in start_token_ids else False)
token.is_sent_start = True if token.idx in start_token_ids else False
return doc
3 changes: 0 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,8 +102,5 @@ def run(self):
# $ setup.py publish support.
cmdclass={
'upload': UploadCommand,
},
entry_points={
"spacy_factories": ["pysbd = pysbd.utils:PySBDFactory"]
Comment on lines -105 to -107
Copy link
Owner Author

@nipunsadvilkar nipunsadvilkar Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmitsch I would have to remove this entrypoint now as spacy uses @Language.factory decorator compulsorily in spacy v3 to register a custom component and since PySBDFactory resides at pysbd/utils.py, I would need to add spacy>=3 requirement to pysbd's setup.py

I wish to keep pysbd lightweight (use only inbuilt python modules).
Do you have any thoughts on this? Like other way around?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's tricky. You could have a look at vendoring @Language.factory. You'd definitely need the registry functionality which can be found in https://github.com/explosion/catalogue now. It's still relatively lightweight, but it's already breaking your requirement of only having inbuilt Python modules.

How's spacy_factories used within PSBD?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not used in pysbd.

psybd python library is shipping psybd named spaCy component out-of-the-box via entrypoints.

Given a python environment with spacy and pysbd installed, nlp.add_pipe("pysbd") will work without importing pysbd explicitly.

More info here: https://spacy.io/usage/saving-loading#entry-points-components

Copy link

@rmitsch rmitsch Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively: you could offer pybsd and pybsd[spacy], with only the latter supporting the usage as a spaCy v3.x component and installing spaCy by default.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, was thinking of doing this. Will look into it 👍🏼

}
)