segmenter

package

v0.3.4 Latest Latest Go to latest Published: Feb 25, 2026 License: BSD-3-Clause, Unlicense Imports: 1 Imported by: 12

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/go-text/typesetting

Links

Open Source Insights

Documentation ¶

Overview ¶

Package segmenter implements Unicode rules used to segment a paragraph of text according to several criteria. In particular, it provides a way of delimiting line break opportunities.

The API of the package follows the very nice iterator pattern proposed in github.com/npillmayer/uax, but use a somewhat simpler internal implementation, inspired by Pango.

The reference documentation is at https://unicode.org/reports/tr14 and https://unicode.org/reports/tr29.

Index ¶

type Grapheme
type GraphemeIterator
- func (gr *GraphemeIterator) Grapheme() Grapheme
- func (gr *GraphemeIterator) Next() bool
type Line
type LineIterator
- func (li *LineIterator) Line() Line
- func (li *LineIterator) Next() bool
type Segmenter
type Word
type WordIterator
- func (gr *WordIterator) Next() bool
- func (gr *WordIterator) Word() Word

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Grapheme ¶

type Grapheme struct {
	// Text is a subslice of the original input slice, containing the delimited grapheme
	Text []rune
	// Offset is the start of the grapheme in the input rune slice
	Offset int
}

Grapheme is the content of a grapheme delimited by the segmenter.

type GraphemeIterator ¶

type GraphemeIterator struct {
	// contains filtered or unexported fields
}

GraphemeIterator provides a convenient way of iterating over the graphemes delimited by a `Segmenter`.

func (*GraphemeIterator) Grapheme ¶

func (gr *GraphemeIterator) Grapheme() Grapheme

Grapheme returns the current `Grapheme`

func (*GraphemeIterator) Next ¶

func (gr *GraphemeIterator) Next() bool

Next returns true if there is still a grapheme to process, and advances the iterator; or return false.

type Line ¶

type Line struct {
	// Text is a subslice of the original input slice, containing the delimited line
	Text []rune
	// Offset is the start of the line in the input rune slice
	Offset int
	// IsMandatoryBreak is true if breaking (at the end of the line)
	// is mandatory
	IsMandatoryBreak bool
}

Line is the content of a line delimited by the segmenter.

type LineIterator ¶

type LineIterator struct {
	// contains filtered or unexported fields
}

LineIterator provides a convenient way of iterating over the lines delimited by a `Segmenter`.

func (*LineIterator) Line ¶

func (li *LineIterator) Line() Line

Line returns the current `Line`

func (*LineIterator) Next ¶

func (li *LineIterator) Next() bool

Next returns true if there is still a line to process, and advances the iterator; or return false.

type Segmenter ¶

type Segmenter struct {
	// contains filtered or unexported fields
}

Segmenter is the entry point of the package.

Usage :

var seg Segmenter
seg.Init(...)
iter := seg.LineIterator()
for iter.Next() {
  ... // do something with iter.Line()
}

func (*Segmenter) GraphemeIterator ¶

func (sg *Segmenter) GraphemeIterator() *GraphemeIterator

GraphemeIterator returns an iterator over the graphemes delimited in [Init].

func (*Segmenter) Init ¶

func (seg *Segmenter) Init(paragraph []rune)

Init resets the segmenter storage with the given input, and computes the attributes required to segment the text.

func (*Segmenter) InitWithBytes ¶ added in v0.3.4

func (seg *Segmenter) InitWithBytes(paragraph []byte)

InitWithBytes resets the segmenter storage with the given byte slice input, and computes the attributes required to segment the text.

If paragraph includes an invalid UTF-8 sequence, these are replaced with U+FFFD.

InitWithBytes is more efficient than [Init] if the input is a byte slice. No allocation for the text is made if its internal buffer capacity is already large enough.

func (*Segmenter) InitWithString ¶ added in v0.3.4

func (seg *Segmenter) InitWithString(paragraph string)

InitWithString resets the segmenter storage with the given string input, and computes the attributes required to segment the text.

If paragraph includes an invalid UTF-8 sequence, these are replaced with U+FFFD.

InitWithString is more efficient than [Init] if the input is a string. No allocation for the text is made if its internal buffer capacity is already large enough.

func (*Segmenter) LineIterator ¶

func (sg *Segmenter) LineIterator() *LineIterator

LineIterator returns an iterator on the lines delimited in [Init].

func (*Segmenter) WordIterator ¶ added in v0.1.2

func (sg *Segmenter) WordIterator() *WordIterator

WordIterator returns an iterator over the word delimited in [Init].

type Word ¶ added in v0.1.2

type Word struct {
	// Text is a subslice of the original input slice, containing the delimited word
	Text []rune
	// Offset is the start of the word in the input rune slice
	Offset int
}

Word is the content of a word delimited by the segmenter.

More precisely, a word is formed by runes with the [Alphabetic] property, or with a General_Category of Number, delimited by the Word Boundary Unicode Property.

type WordIterator ¶ added in v0.1.2

type WordIterator struct {
	// contains filtered or unexported fields
}

func (*WordIterator) Next ¶ added in v0.1.2

func (gr *WordIterator) Next() bool

Next returns true if there is still a word to process, and advances the iterator; or return false.

func (*WordIterator) Word ¶ added in v0.1.2

func (gr *WordIterator) Word() Word

Word returns the current `Word`

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL