Documentation
¶
Overview ¶
Package segmenter implements Unicode rules used to segment a paragraph of text according to several criteria. In particular, it provides a way of delimiting line break opportunities.
The API of the package follows the very nice iterator pattern proposed in github.com/npillmayer/uax, but use a somewhat simpler internal implementation, inspired by Pango.
The reference documentation is at https://unicode.org/reports/tr14 and https://unicode.org/reports/tr29.
Index ¶
- type Grapheme
- type GraphemeIterator
- type Line
- type LineIterator
- type Segmenter
- func (sg *Segmenter) GraphemeIterator() *GraphemeIterator
- func (seg *Segmenter) Init(paragraph []rune)
- func (seg *Segmenter) InitWithBytes(paragraph []byte)
- func (seg *Segmenter) InitWithString(paragraph string)
- func (sg *Segmenter) LineIterator() *LineIterator
- func (sg *Segmenter) WordIterator() *WordIterator
- type Word
- type WordIterator
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Grapheme ¶
type Grapheme struct {
// Text is a subslice of the original input slice, containing the delimited grapheme
Text []rune
// Offset is the start of the grapheme in the input rune slice
Offset int
}
Grapheme is the content of a grapheme delimited by the segmenter.
type GraphemeIterator ¶
type GraphemeIterator struct {
// contains filtered or unexported fields
}
GraphemeIterator provides a convenient way of iterating over the graphemes delimited by a `Segmenter`.
func (*GraphemeIterator) Grapheme ¶
func (gr *GraphemeIterator) Grapheme() Grapheme
Grapheme returns the current `Grapheme`
func (*GraphemeIterator) Next ¶
func (gr *GraphemeIterator) Next() bool
Next returns true if there is still a grapheme to process, and advances the iterator; or return false.
type Line ¶
type Line struct {
// Text is a subslice of the original input slice, containing the delimited line
Text []rune
// Offset is the start of the line in the input rune slice
Offset int
// IsMandatoryBreak is true if breaking (at the end of the line)
// is mandatory
IsMandatoryBreak bool
}
Line is the content of a line delimited by the segmenter.
type LineIterator ¶
type LineIterator struct {
// contains filtered or unexported fields
}
LineIterator provides a convenient way of iterating over the lines delimited by a `Segmenter`.
func (*LineIterator) Next ¶
func (li *LineIterator) Next() bool
Next returns true if there is still a line to process, and advances the iterator; or return false.
type Segmenter ¶
type Segmenter struct {
// contains filtered or unexported fields
}
Segmenter is the entry point of the package.
Usage :
var seg Segmenter
seg.Init(...)
iter := seg.LineIterator()
for iter.Next() {
... // do something with iter.Line()
}
func (*Segmenter) GraphemeIterator ¶
func (sg *Segmenter) GraphemeIterator() *GraphemeIterator
GraphemeIterator returns an iterator over the graphemes delimited in [Init].
func (*Segmenter) Init ¶
Init resets the segmenter storage with the given input, and computes the attributes required to segment the text.
func (*Segmenter) InitWithBytes ¶ added in v0.3.4
InitWithBytes resets the segmenter storage with the given byte slice input, and computes the attributes required to segment the text.
If paragraph includes an invalid UTF-8 sequence, these are replaced with U+FFFD.
InitWithBytes is more efficient than [Init] if the input is a byte slice. No allocation for the text is made if its internal buffer capacity is already large enough.
func (*Segmenter) InitWithString ¶ added in v0.3.4
InitWithString resets the segmenter storage with the given string input, and computes the attributes required to segment the text.
If paragraph includes an invalid UTF-8 sequence, these are replaced with U+FFFD.
InitWithString is more efficient than [Init] if the input is a string. No allocation for the text is made if its internal buffer capacity is already large enough.
func (*Segmenter) LineIterator ¶
func (sg *Segmenter) LineIterator() *LineIterator
LineIterator returns an iterator on the lines delimited in [Init].
func (*Segmenter) WordIterator ¶ added in v0.1.2
func (sg *Segmenter) WordIterator() *WordIterator
WordIterator returns an iterator over the word delimited in [Init].
type Word ¶ added in v0.1.2
type Word struct {
// Text is a subslice of the original input slice, containing the delimited word
Text []rune
// Offset is the start of the word in the input rune slice
Offset int
}
Word is the content of a word delimited by the segmenter.
More precisely, a word is formed by runes with the [Alphabetic] property, or with a General_Category of Number, delimited by the Word Boundary Unicode Property.
See also https://unicode.org/reports/tr29/#Word_Boundary_Rules, http://unicode.org/reports/tr44/#Alphabetic and http://unicode.org/reports/tr44/#General_Category_Values
type WordIterator ¶ added in v0.1.2
type WordIterator struct {
// contains filtered or unexported fields
}
func (*WordIterator) Next ¶ added in v0.1.2
func (gr *WordIterator) Next() bool
Next returns true if there is still a word to process, and advances the iterator; or return false.
func (*WordIterator) Word ¶ added in v0.1.2
func (gr *WordIterator) Word() Word
Word returns the current `Word`