epub

package module
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 9, 2025 License: MIT Imports: 11 Imported by: 0

README

go-epub

English | 简体中文

version


Introduction

go-epub is a pure Go toolkit for reading EPUB files (EPUB2 and EPUB3). The project focuses on building a reusable, read-only API surface that can be used in larger systems such as bookshelf services, content analysis pipelines, or document conversion tools. The current version concentrates on reading, while the exposed abstractions allow future write support without breaking changes.

Current Version: v0.1.0

Highlights
  • 📦 Parse the EPUB container to discover OPF packages, table of contents files, and chapters.
  • 🧱 A single Book abstraction providing access to metadata, TOC, and chapter content.
  • 🔍 Convenience helpers for every Dublin Core metadata key with graceful fallbacks when data is missing.
  • 📚 Support for both EPUB2 (NCX) and EPUB3 (navigation documents) TOC formats.
  • 🖼️ Chapter parsing extracts plain text paragraphs and referenced image paths for downstream processing.

Quick Start

# Fetch dependencies
go mod tidy

# Run the example workflow (using samples in testEpubs)
go run ./...

The sample in cmd/demo/main.go demonstrates:

  • Loading an EPUB via epub.ReadBook.
  • Accessing Dublin Core metadata (e.g., book.Title(), book.Creator()).
  • Traversing the TOC using book.FlattenTOC() and reading chapter text with book.ChapterByIndex.

API Overview

import "github.com/ArcadiaLin/go-epub"

book, err := epub.ReadBook("path/to/book.epub")
if err != nil {
        // handle error
}

// Dublin Core metadata helpers
if title, err := book.Title(); err == nil {
        fmt.Println("Title:", title)
}

// Generic metadata access
values, err := book.MetadataByKey("language")

// Iterate over the full metadata map (includes <meta> extensions)
metadata := book.AllMetadata()

// Work with chapters
fmt.Println("Total chapters:", book.ChapterCount())
firstChapter, _ := book.ChapterByIndex(0)
fmt.Println(firstChapter.Text())

// Concatenate the whole book into a single text blob
fmt.Println(book.AllChaptersText())

TOC and Node Utilities
  • book.FlattenTOC() returns a linear TOC view for UI rendering.
  • TOC.FindByHref(href) resolves a node by resource path.
  • HtmlNode / XmlNode include helper methods like Attr, FindAll, and FindNodes for custom extensions.

Design Notes

  • Book acts as the unified entry point, internally managing Container, OPF, TOC, and Chapters.
  • All operations are side-effect-free, consistent with a read-only design philosophy.
  • The extensible API surface (e.g., Chapter.Clone, Metadata.GetAll) enables caching or write support in the future.

To integrate this library, simply import the epub package and call ReadBook. The toolkit uses robust XML/HTML parsing logic for stable behavior across various EPUB implementations.

Documentation

Index

Constants

View Source
const (
	TOCTypeEPUB2   = "EPUB2"
	TOCTypeEPUB3   = "EPUB3"
	TOCTypeUnknown = "UNKNOWN"
)

Variables

View Source
var (
	// ErrMetadataUndefined indicates that a requested metadata field is not
	// present in the document.
	ErrMetadataUndefined = errors.New("metadata not defined")
	// ErrChapterNotFound indicates that the requested chapter could not be
	// located either by ID or by index.
	ErrChapterNotFound = errors.New("chapter not found")
)

Functions

This section is empty.

Types

type Book

type Book struct {
	Container *Container `json:"container,omitempty"`
	Opf       *Opf       `json:"opf,omitempty"`
	TOC       *TOC       `json:"toc,omitempty"`
	Chapters  []Chapter  `json:"chapters,omitempty"`
}

func NewBook

func NewBook() *Book

NewBook creates an empty book structure that can later be populated by the parser. The slices are initialised to avoid nil handling by callers.

func ReadBook

func ReadBook(epubPath string) (*Book, error)

ReadBook parses the EPUB file located at epubPath and populates a Book structure with metadata, table of contents and chapter information.

func (*Book) AllChaptersText

func (b *Book) AllChaptersText() string

AllChaptersText concatenates every chapter's text content in reading order.

func (*Book) AllMetadata

func (b *Book) AllMetadata() map[string][]string

AllMetadata returns all metadata entries, including extension fields defined in the OPF <meta> tags.

func (*Book) ChapterByID

func (b *Book) ChapterByID(id string) (*Chapter, error)

ChapterByID returns the chapter that matches the provided ID.

func (*Book) ChapterByIndex

func (b *Book) ChapterByIndex(index int) (*Chapter, error)

ChapterByIndex returns the chapter by its ordinal index (0-based).

func (*Book) ChapterCount

func (b *Book) ChapterCount() int

ChapterCount returns the number of parsed chapters.

func (*Book) ChapterTextByID

func (b *Book) ChapterTextByID(id string) (string, error)

ChapterTextByID returns the joined text content of the chapter with the provided ID.

func (*Book) ChapterTextByIndex

func (b *Book) ChapterTextByIndex(index int) (string, error)

ChapterTextByIndex returns the joined text content of the chapter at the provided index.

func (*Book) Contributor

func (b *Book) Contributor() (string, error)

func (*Book) Coverage

func (b *Book) Coverage() (string, error)

func (*Book) Creator

func (b *Book) Creator() (string, error)

func (*Book) Date

func (b *Book) Date() (string, error)

func (*Book) Description

func (b *Book) Description() (string, error)

func (*Book) FlattenTOC

func (b *Book) FlattenTOC() []TOC

FlattenTOC returns the table of contents entries as a slice, skipping the synthetic root node if present.

func (*Book) Format

func (b *Book) Format() (string, error)

func (*Book) Identifier

func (b *Book) Identifier() (string, error)

func (*Book) Language

func (b *Book) Language() (string, error)

func (*Book) MetadataByKey

func (b *Book) MetadataByKey(key string) ([]string, error)

MetadataByKey provides direct access to Dublin Core metadata using a dynamic key. It is a convenience wrapper around MetadataValues and should be used by callers that need to iterate over keys.

func (*Book) MetadataValue

func (b *Book) MetadataValue(key string) (string, error)

MetadataValue returns the first value for the provided Dublin Core key.

func (*Book) MetadataValues

func (b *Book) MetadataValues(key string) ([]string, error)

MetadataValues returns all values for the given Dublin Core metadata key.

func (*Book) Publisher

func (b *Book) Publisher() (string, error)

func (*Book) Relation

func (b *Book) Relation() (string, error)

func (*Book) Rights

func (b *Book) Rights() (string, error)

func (*Book) Source

func (b *Book) Source() (string, error)

func (*Book) Subject

func (b *Book) Subject() (string, error)

func (*Book) Title

func (b *Book) Title() (string, error)

The following helpers expose a method for each Dublin Core metadata field.

func (*Book) Type

func (b *Book) Type() (string, error)

type Chapter

type Chapter struct {
	ID         string
	Path       string
	Title      string
	Paragraphs []string
	Images     []string
}

func ParseChapter

func ParseChapter(id, href string, f *zip.File) (*Chapter, error)

func (*Chapter) Clone

func (c *Chapter) Clone() Chapter

Clone returns a deep copy of the chapter. This is handy when callers want to modify the returned value without affecting the book cache.

func (*Chapter) HasImages

func (c *Chapter) HasImages() bool

HasImages reports whether the chapter contains any referenced images.

func (*Chapter) Text

func (c *Chapter) Text() string

Text joins all extracted paragraphs into a single string separated by blank lines. The returned value is suitable for plain-text readers.

type Container

type Container struct {
	Rootfiles []EmptyXmlNode
}

func ParseContainer

func ParseContainer(content []byte) (*Container, error)

ParseContainer parses the META-INF/container.xml document into a Container structure.

func (*Container) FindOpfFile

func (c *Container) FindOpfFile() (string, error)

type EmptyXmlNode

type EmptyXmlNode struct {
	Name  string
	Attrs map[string]string
}

func (EmptyXmlNode) Attr

func (n EmptyXmlNode) Attr(name string) (string, bool)

Attr returns the attribute value for the provided name if it exists.

func (EmptyXmlNode) HasAttr

func (n EmptyXmlNode) HasAttr(name string) bool

HasAttr reports whether the given attribute is defined on the node.

type HtmlNode

type HtmlNode struct {
	Type     NodeType
	Name     string
	Attrs    map[string]string
	Content  string
	Children []*HtmlNode
}

HtmlNode 是 HTML 版本的通用节点结构

func ParseHTML

func ParseHTML(r io.Reader) (*HtmlNode, error)

ParseHTML 解析 HTML/XHTML,返回 HtmlNode 树

func (*HtmlNode) Attr

func (hn *HtmlNode) Attr(name string) (string, bool)

Attr returns the attribute value for the provided name if it exists.

func (*HtmlNode) FindAll

func (hn *HtmlNode) FindAll(name string) []*HtmlNode

FindAll collects all descendant nodes whose element name matches the provided name. The search is case sensitive in order to avoid unexpected matches when working with XHTML documents.

func (*HtmlNode) FindNode

func (hn *HtmlNode) FindNode(name string) *HtmlNode

func (*HtmlNode) NodeText

func (hn *HtmlNode) NodeText() string

NodeText 递归提取所有文本(去除标签与多余空白)

type Manifest

type Manifest struct {
	Items []EmptyXmlNode
}

Manifest 对应 OPF manifest 区域 / Manifest models the OPF manifest section.

func (*Manifest) HrefLookup

func (mf *Manifest) HrefLookup(opfPath string) map[string]string

HrefLookup 构建 id->href 映射 / HrefLookup builds an id to href lookup map.

func (*Manifest) ItemByID

func (mf *Manifest) ItemByID(id string) (EmptyXmlNode, bool)

ItemByID returns the manifest entry that matches the given ID.

func (*Manifest) MediaTypeByID

func (mf *Manifest) MediaTypeByID(id string) (string, bool)

MediaTypeByID returns the media-type attribute for the manifest entry.

type MetaEntry

type MetaEntry struct {
	Value string            // 文本内容 / Text content
	Attrs map[string]string // 所有属性 / Attributes (refines, property, scheme, id, etc.)
}

MetaEntry 存储 metadata 元素 / MetaEntry stores metadata element values and attributes.

type Metadata

type Metadata struct {
	Data map[string]map[string][]MetaEntry
}

Metadata 统一存储 namespace -> tag -> entries / Metadata maps namespaces and tags to entries.

func (*Metadata) First

func (md *Metadata) First(key string) (string, bool)

First returns the first metadata value associated with the key.

func (*Metadata) Get

func (md *Metadata) Get(key string) []string

Get returns all normalized metadata values for the provided Dublin Core key.

func (*Metadata) GetAll

func (md *Metadata) GetAll() map[string][]string

GetAll 返回简化后的元数据视图 / GetAll flattens metadata into a simple key/value representation.

func (*Metadata) Normalize

func (md *Metadata) Normalize() map[string][]string

Normalize 提供标准化后的元数据视图 / Normalize normalizes metadata keys for compatibility.

type NodeType

type NodeType int
const (
	ElementNode NodeType = iota
	TextNode
)

type Opf

type Opf struct {
	XmlNode  *XmlNode
	Metadata *Metadata
	Manifest *Manifest
	Spine    *Spine
}

func ParseOpf

func ParseOpf(content []byte) (*Opf, error)

ParseOpf parses the OPF document and eagerly populates its major sections.

func (*Opf) ChapterPaths

func (opf *Opf) ChapterPaths(opfPath string) []string

ChapterPaths returns the ordered list of chapter document hrefs resolved against the OPF path.

func (*Opf) FindTOCFile

func (opf *Opf) FindTOCFile(opfPath string) (tocType string, tocFile string)

func (*Opf) ParseManifest

func (opf *Opf) ParseManifest() error

ParseManifest 解析 manifest 节点 / ParseManifest parses the manifest section of the OPF document.

func (*Opf) ParseMetadata

func (opf *Opf) ParseMetadata() error

ParseMetadata 解析 metadata 节点 / ParseMetadata extracts the metadata node from the OPF root.

func (*Opf) ParseSpine

func (opf *Opf) ParseSpine() error

ParseSpine 解析 spine 节点 / ParseSpine parses the spine section of the OPF document.

type Spine

type Spine struct {
	Itemrefs []EmptyXmlNode
	Attrs    map[string]string
}

Spine 对应 OPF spine 区域 / Spine models the OPF spine section.

func (*Spine) ExtractChapterIDs

func (spine *Spine) ExtractChapterIDs() []string

ExtractChapterIDs 提取 spine 中的章节 ID / extractChapterIDs returns ordered chapter IDs from the spine.

func (*Spine) Len

func (spine *Spine) Len() int

Len returns the number of spine itemrefs.

type TOC

type TOC struct {
	Title    string
	Href     string
	Children []TOC
}

func ParseTOC

func ParseTOC(tocType, tocFile, opfDir string, files map[string]*zip.File) ([]TOC, error)

func (TOC) FindByHref

func (t TOC) FindByHref(href string) *TOC

FindByHref returns the first TOC entry whose href matches the provided value. The comparison is performed after cleaning the href to make it resilient to small differences in path formatting.

func (TOC) Flatten

func (t TOC) Flatten() []TOC

Flatten flattens the table of contents into a slice preserving the natural reading order.

func (TOC) Walk

func (t TOC) Walk(fn func(TOC) error) error

Walk traverses the table-of-contents tree in depth-first order and invokes fn for every entry. Returning a non-nil error from the callback aborts the walk and propagates the error to the caller.

type XmlNode

type XmlNode struct {
	XMLName  xml.Name
	Attrs    []xml.Attr `xml:",any,attr"`
	Content  string     `xml:",chardata"`
	XmlNodes []XmlNode  `xml:",any"`
}

func ParseXML

func ParseXML(r io.Reader) (*XmlNode, error)

ParseXML 解析 XML,返回 XmlNode 树 / ParseXML decodes the XML stream into a XmlNode tree.

func (*XmlNode) Attr

func (xn *XmlNode) Attr(name string) (string, bool)

Attr returns the attribute value for the provided name if it exists.

func (*XmlNode) FindNode

func (xn *XmlNode) FindNode(name string) *XmlNode

func (*XmlNode) FindNodes

func (xn *XmlNode) FindNodes(name string) []*XmlNode

FindNodes returns all descendants with the provided local name.

func (*XmlNode) NodeText

func (xn *XmlNode) NodeText() string

Directories

Path Synopsis
cmd
demo command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL