epub

package module

v0.1.1 Latest Latest Go to latest Published: Oct 9, 2025 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/ArcadiaLin/go-epub

Links

Open Source Insights

README ¶

go-epub

English | 简体中文

Introduction

go-epub is a pure Go toolkit for reading EPUB files (EPUB2 and EPUB3). The project focuses on building a reusable, read-only API surface that can be used in larger systems such as bookshelf services, content analysis pipelines, or document conversion tools. The current version concentrates on reading, while the exposed abstractions allow future write support without breaking changes.

Current Version: v0.1.0

Highlights

📦 Parse the EPUB container to discover OPF packages, table of contents files, and chapters.
🧱 A single Book abstraction providing access to metadata, TOC, and chapter content.
🔍 Convenience helpers for every Dublin Core metadata key with graceful fallbacks when data is missing.
📚 Support for both EPUB2 (NCX) and EPUB3 (navigation documents) TOC formats.
🖼️ Chapter parsing extracts plain text paragraphs and referenced image paths for downstream processing.

Quick Start

# Fetch dependencies
go mod tidy

# Run the example workflow (using samples in testEpubs)
go run ./...

The sample in cmd/demo/main.go demonstrates:

Loading an EPUB via epub.ReadBook.
Accessing Dublin Core metadata (e.g., book.Title(), book.Creator()).
Traversing the TOC using book.FlattenTOC() and reading chapter text with book.ChapterByIndex.

API Overview

import "github.com/ArcadiaLin/go-epub"

book, err := epub.ReadBook("path/to/book.epub")
if err != nil {
        // handle error
}

// Dublin Core metadata helpers
if title, err := book.Title(); err == nil {
        fmt.Println("Title:", title)
}

// Generic metadata access
values, err := book.MetadataByKey("language")

// Iterate over the full metadata map (includes <meta> extensions)
metadata := book.AllMetadata()

// Work with chapters
fmt.Println("Total chapters:", book.ChapterCount())
firstChapter, _ := book.ChapterByIndex(0)
fmt.Println(firstChapter.Text())

// Concatenate the whole book into a single text blob
fmt.Println(book.AllChaptersText())

TOC and Node Utilities

book.FlattenTOC() returns a linear TOC view for UI rendering.
TOC.FindByHref(href) resolves a node by resource path.
HtmlNode / XmlNode include helper methods like Attr, FindAll, and FindNodes for custom extensions.

Design Notes

Book acts as the unified entry point, internally managing Container, OPF, TOC, and Chapters.
All operations are side-effect-free, consistent with a read-only design philosophy.
The extensible API surface (e.g., Chapter.Clone, Metadata.GetAll) enables caching or write support in the future.

To integrate this library, simply import the epub package and call ReadBook. The toolkit uses robust XML/HTML parsing logic for stable behavior across various EPUB implementations.

Documentation ¶

Index ¶

Constants
Variables
type Book
- func NewBook() *Book
- func ReadBook(epubPath string) (*Book, error)
- func (b *Book) AllChaptersText() string
- func (b *Book) AllMetadata() map[string][]string
- func (b *Book) ChapterByID(id string) (*Chapter, error)
- func (b *Book) ChapterByIndex(index int) (*Chapter, error)
- func (b *Book) ChapterCount() int
- func (b *Book) ChapterTextByID(id string) (string, error)
- func (b *Book) ChapterTextByIndex(index int) (string, error)
- func (b *Book) Contributor() (string, error)
- func (b *Book) Coverage() (string, error)
- func (b *Book) Creator() (string, error)
- func (b *Book) Date() (string, error)
- func (b *Book) Description() (string, error)
- func (b *Book) FlattenTOC() []TOC
- func (b *Book) Format() (string, error)
- func (b *Book) Identifier() (string, error)
- func (b *Book) Language() (string, error)
- func (b *Book) MetadataByKey(key string) ([]string, error)
- func (b *Book) MetadataValue(key string) (string, error)
- func (b *Book) MetadataValues(key string) ([]string, error)
- func (b *Book) Publisher() (string, error)
- func (b *Book) Relation() (string, error)
- func (b *Book) Rights() (string, error)
- func (b *Book) Source() (string, error)
- func (b *Book) Subject() (string, error)
- func (b *Book) Title() (string, error)
- func (b *Book) Type() (string, error)
type Chapter
- func ParseChapter(id, href string, f *zip.File) (*Chapter, error)
- func (c *Chapter) Clone() Chapter
- func (c *Chapter) HasImages() bool
- func (c *Chapter) Text() string
type Container
- func ParseContainer(content []byte) (*Container, error)
- func (c *Container) FindOpfFile() (string, error)
type EmptyXmlNode
- func (n EmptyXmlNode) Attr(name string) (string, bool)
- func (n EmptyXmlNode) HasAttr(name string) bool
type HtmlNode
- func ParseHTML(r io.Reader) (*HtmlNode, error)
- func (hn *HtmlNode) Attr(name string) (string, bool)
- func (hn *HtmlNode) FindAll(name string) []*HtmlNode
- func (hn *HtmlNode) FindNode(name string) *HtmlNode
- func (hn *HtmlNode) NodeText() string
type Manifest
- func (mf *Manifest) HrefLookup(opfPath string) map[string]string
- func (mf *Manifest) ItemByID(id string) (EmptyXmlNode, bool)
- func (mf *Manifest) MediaTypeByID(id string) (string, bool)
type MetaEntry
type Metadata
- func (md *Metadata) First(key string) (string, bool)
- func (md *Metadata) Get(key string) []string
- func (md *Metadata) GetAll() map[string][]string
- func (md *Metadata) Normalize() map[string][]string
type NodeType
type Opf
- func ParseOpf(content []byte) (*Opf, error)
- func (opf *Opf) ChapterPaths(opfPath string) []string
- func (opf *Opf) FindTOCFile(opfPath string) (tocType string, tocFile string)
- func (opf *Opf) ParseManifest() error
- func (opf *Opf) ParseMetadata() error
- func (opf *Opf) ParseSpine() error
type Spine
- func (spine *Spine) ExtractChapterIDs() []string
- func (spine *Spine) Len() int
type TOC
- func ParseTOC(tocType, tocFile, opfDir string, files map[string]*zip.File) ([]TOC, error)
- func (t TOC) FindByHref(href string) *TOC
- func (t TOC) Flatten() []TOC
- func (t TOC) Walk(fn func(TOC) error) error
type XmlNode
- func ParseXML(r io.Reader) (*XmlNode, error)
- func (xn *XmlNode) Attr(name string) (string, bool)
- func (xn *XmlNode) FindNode(name string) *XmlNode
- func (xn *XmlNode) FindNodes(name string) []*XmlNode
- func (xn *XmlNode) NodeText() string

Constants ¶

View Source

const (
	TOCTypeEPUB2   = "EPUB2"
	TOCTypeEPUB3   = "EPUB3"
	TOCTypeUnknown = "UNKNOWN"
)

Variables ¶

View Source

var (
	// ErrMetadataUndefined indicates that a requested metadata field is not
	// present in the document.
	ErrMetadataUndefined = errors.New("metadata not defined")
	// ErrChapterNotFound indicates that the requested chapter could not be
	// located either by ID or by index.
	ErrChapterNotFound = errors.New("chapter not found")
)

Functions ¶

This section is empty.

Types ¶

type Book ¶

type Book struct {
	Container *Container `json:"container,omitempty"`
	Opf       *Opf       `json:"opf,omitempty"`
	TOC       *TOC       `json:"toc,omitempty"`
	Chapters  []Chapter  `json:"chapters,omitempty"`
}

func NewBook ¶

func NewBook() *Book

NewBook creates an empty book structure that can later be populated by the parser. The slices are initialised to avoid nil handling by callers.

func ReadBook ¶

func ReadBook(epubPath string) (*Book, error)

ReadBook parses the EPUB file located at epubPath and populates a Book structure with metadata, table of contents and chapter information.

func (*Book) AllChaptersText ¶

func (b *Book) AllChaptersText() string

AllChaptersText concatenates every chapter's text content in reading order.

func (*Book) AllMetadata ¶

func (b *Book) AllMetadata() map[string][]string

AllMetadata returns all metadata entries, including extension fields defined in the OPF <meta> tags.

func (*Book) ChapterByID ¶

func (b *Book) ChapterByID(id string) (*Chapter, error)

ChapterByID returns the chapter that matches the provided ID.

func (*Book) ChapterByIndex ¶

func (b *Book) ChapterByIndex(index int) (*Chapter, error)

ChapterByIndex returns the chapter by its ordinal index (0-based).

func (*Book) ChapterCount ¶

func (b *Book) ChapterCount() int

ChapterCount returns the number of parsed chapters.

func (*Book) ChapterTextByID ¶

func (b *Book) ChapterTextByID(id string) (string, error)

ChapterTextByID returns the joined text content of the chapter with the provided ID.

func (*Book) ChapterTextByIndex ¶

func (b *Book) ChapterTextByIndex(index int) (string, error)

ChapterTextByIndex returns the joined text content of the chapter at the provided index.

func (*Book) Contributor ¶

func (b *Book) Contributor() (string, error)

func (*Book) Coverage ¶

func (b *Book) Coverage() (string, error)

func (*Book) Creator ¶

func (b *Book) Creator() (string, error)

func (*Book) Date ¶

func (b *Book) Date() (string, error)

func (*Book) Description ¶

func (b *Book) Description() (string, error)

func (*Book) FlattenTOC ¶

func (b *Book) FlattenTOC() []TOC

FlattenTOC returns the table of contents entries as a slice, skipping the synthetic root node if present.

func (*Book) Format ¶

func (b *Book) Format() (string, error)

func (*Book) Identifier ¶

func (b *Book) Identifier() (string, error)

func (*Book) Language ¶

func (b *Book) Language() (string, error)

func (*Book) MetadataByKey ¶

func (b *Book) MetadataByKey(key string) ([]string, error)

MetadataByKey provides direct access to Dublin Core metadata using a dynamic key. It is a convenience wrapper around MetadataValues and should be used by callers that need to iterate over keys.

func (*Book) MetadataValue ¶

func (b *Book) MetadataValue(key string) (string, error)

MetadataValue returns the first value for the provided Dublin Core key.

func (*Book) MetadataValues ¶

func (b *Book) MetadataValues(key string) ([]string, error)

MetadataValues returns all values for the given Dublin Core metadata key.

func (*Book) Publisher ¶

func (b *Book) Publisher() (string, error)

func (*Book) Relation ¶

func (b *Book) Relation() (string, error)

func (*Book) Rights ¶

func (b *Book) Rights() (string, error)

func (*Book) Source ¶

func (b *Book) Source() (string, error)

func (*Book) Subject ¶

func (b *Book) Subject() (string, error)

func (*Book) Title ¶

func (b *Book) Title() (string, error)

The following helpers expose a method for each Dublin Core metadata field.

func (*Book) Type ¶

func (b *Book) Type() (string, error)

type Chapter ¶

type Chapter struct {
	ID         string
	Path       string
	Title      string
	Paragraphs []string
	Images     []string
}

func ParseChapter ¶

func ParseChapter(id, href string, f *zip.File) (*Chapter, error)

func (*Chapter) Clone ¶

func (c *Chapter) Clone() Chapter

Clone returns a deep copy of the chapter. This is handy when callers want to modify the returned value without affecting the book cache.

func (*Chapter) HasImages ¶

func (c *Chapter) HasImages() bool

HasImages reports whether the chapter contains any referenced images.

func (*Chapter) Text ¶

func (c *Chapter) Text() string

Text joins all extracted paragraphs into a single string separated by blank lines. The returned value is suitable for plain-text readers.

type Container ¶

type Container struct {
	Rootfiles []EmptyXmlNode
}

func ParseContainer ¶

func ParseContainer(content []byte) (*Container, error)

ParseContainer parses the META-INF/container.xml document into a Container structure.

func (*Container) FindOpfFile ¶

func (c *Container) FindOpfFile() (string, error)

type EmptyXmlNode ¶

type EmptyXmlNode struct {
	Name  string
	Attrs map[string]string
}

func (EmptyXmlNode) Attr ¶

func (n EmptyXmlNode) Attr(name string) (string, bool)

Attr returns the attribute value for the provided name if it exists.

func (EmptyXmlNode) HasAttr ¶

func (n EmptyXmlNode) HasAttr(name string) bool

HasAttr reports whether the given attribute is defined on the node.

type HtmlNode ¶

type HtmlNode struct {
	Type     NodeType
	Name     string
	Attrs    map[string]string
	Content  string
	Children []*HtmlNode
}

HtmlNode 是 HTML 版本的通用节点结构

func ParseHTML ¶

func ParseHTML(r io.Reader) (*HtmlNode, error)

ParseHTML 解析 HTML/XHTML，返回 HtmlNode 树

func (*HtmlNode) Attr ¶

func (hn *HtmlNode) Attr(name string) (string, bool)

Attr returns the attribute value for the provided name if it exists.

func (*HtmlNode) FindAll ¶

func (hn *HtmlNode) FindAll(name string) []*HtmlNode

FindAll collects all descendant nodes whose element name matches the provided name. The search is case sensitive in order to avoid unexpected matches when working with XHTML documents.

func (*HtmlNode) FindNode ¶

func (hn *HtmlNode) FindNode(name string) *HtmlNode

func (*HtmlNode) NodeText ¶

func (hn *HtmlNode) NodeText() string

NodeText 递归提取所有文本（去除标签与多余空白）

type Manifest ¶

type Manifest struct {
	Items []EmptyXmlNode
}

Manifest 对应 OPF manifest 区域 / Manifest models the OPF manifest section.

func (*Manifest) HrefLookup ¶

func (mf *Manifest) HrefLookup(opfPath string) map[string]string

HrefLookup 构建 id->href 映射 / HrefLookup builds an id to href lookup map.

func (*Manifest) ItemByID ¶

func (mf *Manifest) ItemByID(id string) (EmptyXmlNode, bool)

ItemByID returns the manifest entry that matches the given ID.

func (*Manifest) MediaTypeByID ¶

func (mf *Manifest) MediaTypeByID(id string) (string, bool)

MediaTypeByID returns the media-type attribute for the manifest entry.

type MetaEntry ¶

type MetaEntry struct {
	Value string            // 文本内容 / Text content
	Attrs map[string]string // 所有属性 / Attributes (refines, property, scheme, id, etc.)
}

MetaEntry 存储 metadata 元素 / MetaEntry stores metadata element values and attributes.

type Metadata ¶

type Metadata struct {
	Data map[string]map[string][]MetaEntry
}

Metadata 统一存储 namespace -> tag -> entries / Metadata maps namespaces and tags to entries.

func (*Metadata) First ¶

func (md *Metadata) First(key string) (string, bool)

First returns the first metadata value associated with the key.

func (*Metadata) Get ¶

func (md *Metadata) Get(key string) []string

Get returns all normalized metadata values for the provided Dublin Core key.

func (*Metadata) GetAll ¶

func (md *Metadata) GetAll() map[string][]string

GetAll 返回简化后的元数据视图 / GetAll flattens metadata into a simple key/value representation.

func (*Metadata) Normalize ¶

func (md *Metadata) Normalize() map[string][]string

Normalize 提供标准化后的元数据视图 / Normalize normalizes metadata keys for compatibility.

type NodeType ¶

type NodeType int

const (
	ElementNode NodeType = iota
	TextNode
)

type Opf ¶

type Opf struct {
	XmlNode  *XmlNode
	Metadata *Metadata
	Manifest *Manifest
	Spine    *Spine
}

func ParseOpf ¶

func ParseOpf(content []byte) (*Opf, error)

ParseOpf parses the OPF document and eagerly populates its major sections.

func (*Opf) ChapterPaths ¶

func (opf *Opf) ChapterPaths(opfPath string) []string

ChapterPaths returns the ordered list of chapter document hrefs resolved against the OPF path.

func (*Opf) FindTOCFile ¶

func (opf *Opf) FindTOCFile(opfPath string) (tocType string, tocFile string)

func (*Opf) ParseManifest ¶

func (opf *Opf) ParseManifest() error

ParseManifest 解析 manifest 节点 / ParseManifest parses the manifest section of the OPF document.

func (*Opf) ParseMetadata ¶

func (opf *Opf) ParseMetadata() error

ParseMetadata 解析 metadata 节点 / ParseMetadata extracts the metadata node from the OPF root.

func (*Opf) ParseSpine ¶

func (opf *Opf) ParseSpine() error

ParseSpine 解析 spine 节点 / ParseSpine parses the spine section of the OPF document.

type Spine ¶

type Spine struct {
	Itemrefs []EmptyXmlNode
	Attrs    map[string]string
}

Spine 对应 OPF spine 区域 / Spine models the OPF spine section.

func (*Spine) ExtractChapterIDs ¶

func (spine *Spine) ExtractChapterIDs() []string

ExtractChapterIDs 提取 spine 中的章节 ID / extractChapterIDs returns ordered chapter IDs from the spine.

func (*Spine) Len ¶

func (spine *Spine) Len() int

Len returns the number of spine itemrefs.

type TOC ¶

type TOC struct {
	Title    string
	Href     string
	Children []TOC
}

func ParseTOC ¶

func ParseTOC(tocType, tocFile, opfDir string, files map[string]*zip.File) ([]TOC, error)

func (TOC) FindByHref ¶

func (t TOC) FindByHref(href string) *TOC

FindByHref returns the first TOC entry whose href matches the provided value. The comparison is performed after cleaning the href to make it resilient to small differences in path formatting.

func (TOC) Flatten ¶

func (t TOC) Flatten() []TOC

Flatten flattens the table of contents into a slice preserving the natural reading order.

func (TOC) Walk ¶

func (t TOC) Walk(fn func(TOC) error) error

Walk traverses the table-of-contents tree in depth-first order and invokes fn for every entry. Returning a non-nil error from the callback aborts the walk and propagates the error to the caller.

type XmlNode ¶

type XmlNode struct {
	XMLName  xml.Name
	Attrs    []xml.Attr `xml:",any,attr"`
	Content  string     `xml:",chardata"`
	XmlNodes []XmlNode  `xml:",any"`
}

func ParseXML ¶

func ParseXML(r io.Reader) (*XmlNode, error)

ParseXML 解析 XML，返回 XmlNode 树 / ParseXML decodes the XML stream into a XmlNode tree.

func (*XmlNode) Attr ¶

func (xn *XmlNode) Attr(name string) (string, bool)

Attr returns the attribute value for the provided name if it exists.

func (*XmlNode) FindNode ¶

func (xn *XmlNode) FindNode(name string) *XmlNode

func (*XmlNode) FindNodes ¶

func (xn *XmlNode) FindNodes(name string) []*XmlNode

FindNodes returns all descendants with the provided local name.

func (*XmlNode) NodeText ¶

func (xn *XmlNode) NodeText() string

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
demo command

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL