Documentation
¶
Overview ¶
Package swan implements the Goose HTML Content / Article Extractor algorithm.
Currently, swan will try to extract the following content types:
Comics: if something looks like a web comic, it will be extracted as just an image. This is a WIP.
Everything else: it will look for article text and try to extract any header image that goes with it.
Index ¶
Examples ¶
Constants ¶
const (
// Version of the library
Version = "1.0"
)
Variables ¶
This section is empty.
Functions ¶
Types ¶
type Article ¶
type Article struct {
// Final URL after all redirects
URL string
// Newline-separated and cleaned content
CleanedText string
// Node from which CleanedText was created. Call .Html() on this to get
// printable HTML.
TopNode *goquery.Selection
// A header image to use for the article. Nil if no image could be
// detected.
Img *Image
// All metadata associated with the original document
Meta struct {
Authors []string
Canonical string
Description string
Domain string
Favicon string
Keywords string
Links []string
Lang string
OpenGraph map[string]string
PublishDate string
Tags []string
Title string
}
// Full document backing this article
Doc *goquery.Document
// contains filtered or unexported fields
}
Article is a fully extracted and cleaned document.
func FromDoc ¶
FromDoc does its best to extract an article from a single document
Pass in the URL the document came from so that images can be resolved correctly.
func FromHTML ¶
FromHTML does its best to extract an article from a single HTML page.
Pass in the URL the document came from so that images can be resolved correctly.
Example ¶
htmlIn := `<html>
<head>
<title> Example Title </title>
<meta property="og:site_name" content="Example Name"/>
</head>
<body>
<p>some article body with a bunch of text in it</p>
</body>
</html>`
a, err := FromHTML("http://example.com/article/1", []byte(htmlIn))
if err != nil {
panic(err)
}
if a.TopNode == nil {
panic("no article could be extracted, " +
"but a.Doc and a.Meta are still cleaned " +
"and can be messed with ")
}
// Get the document title
fmt.Printf("Title: %s\n", a.Meta.Title)
// Hit any open graph tags
fmt.Printf("Site Name: %s\n", a.Meta.OpenGraph["site_name"])
// Print out any cleaned-up HTML that was found
html, _ := a.TopNode.Html()
fmt.Printf("HTML: %s\n", strings.TrimSpace(html))
// Print out any cleaned-up text that was found
fmt.Printf("Plain: %s\n", a.CleanedText)
Output: Title: Example Title Site Name: Example Name HTML: <p>some article body with a bunch of text in it</p> Plain: some article body with a bunch of text in it