Documentation
¶
Overview ¶
Package webextractor provides a ready-to-use and extensible web crawling and scraping framework for extracting structured data from the web.
Index ¶
- func DefaultHTTPTransport() *http.Transport
- func New() (*colibri.Colibri, error)
- type Client
- type CookieJar
- type Response
- func (resp *Response) Body() io.ReadCloser
- func (resp *Response) Header() http.Header
- func (resp *Response) MarshalJSON() ([]byte, error)
- func (resp *Response) Redirects() []*url.URL
- func (resp *Response) Serializable() map[string]any
- func (resp *Response) StatusCode() int
- func (resp *Response) URL() *url.URL
- type RobotsData
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func DefaultHTTPTransport ¶
Types ¶
type Client ¶
type Client struct {
// Jar specifies the cookie jar.
Jar *CookieJar
// contains filtered or unexported fields
}
Client represents an HTTP client. See the colibri.Client interface.
type CookieJar ¶
type CookieJar struct {
// contains filtered or unexported fields
}
CookieJar is a concurrency-safe “cookiejar.Jar” wrapper.
func NewCookieJar ¶
func NewCookieJar() *CookieJar
type Response ¶
Response represents an HTTP response. See the colibri.Response interface.
func (*Response) Body ¶
func (resp *Response) Body() io.ReadCloser
func (*Response) MarshalJSON ¶
func (*Response) Serializable ¶
func (*Response) StatusCode ¶
type RobotsData ¶
type RobotsData struct {
// contains filtered or unexported fields
}
RobotsData gets, stores and parses robots.txt restrictions. See the colibri.RobotsTxt interface.
func NewRobotsData ¶
func NewRobotsData() *RobotsData
func (*RobotsData) Clear ¶
func (robots *RobotsData) Clear()
Clear removes stored robots.txt restrictions.
func (*RobotsData) IsAllowed ¶
func (robots *RobotsData) IsAllowed(c *colibri.Colibri, u *url.URL, rules *colibri.Rules) (colibri.Response, error)
IsAllowed verifies that the User-Agent can access the URL. Gets and stores the robots.txt restrictions of the URL host and for use in URLs with the same host.
Directories
¶
| Path | Synopsis |
|---|---|
|
Package colibri is the extensible core for web crawling and scraping, designed to facilitate the extraction of structured data from the web.
|
Package colibri is the extensible core for web crawling and scraping, designed to facilitate the extraction of structured data from the web. |
|
Package parsers provide implementations of the colibri.Parser interface to parse different web content formats.
|
Package parsers provide implementations of the colibri.Parser interface to parse different web content formats. |
Click to show internal directories.
Click to hide internal directories.