flyscrape

package module

v0.9.0 Latest Latest Go to latest Published: Nov 24, 2024 License: MPL-2.0 Imports: 28 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/philippta/flyscrape

Links

Open Source Insights

README ¶

Flyscrape is a command-line web scraping tool designed for those without
advanced programming skills, enabling precise extraction of website data.

Installation · Documentation · Releases

Demo

Features

Standalone: Flyscrape comes as a single binary executable.
jQuery-like: Extract data from HTML pages with a familiar API.
Scriptable: Use JavaScript to write your data extraction logic.
System Cookies: Give Flyscrape access to your browsers cookie store.
Browser Mode: Render JavaScript heavy pages using a headless Browser.

Overview

Example
Installation
Usage
Configuration
Query API
Flyscrape API
- Document Parsing
- File Downloads
Issues and suggestions

Example

This example scrapes the first few pages form Hacker News, specifically the New, Show and Ask sections.

export const config = {
    urls: [
        "https://news.ycombinator.com/new",
        "https://news.ycombinator.com/show",
        "https://news.ycombinator.com/ask",
    ],

    // Cache request for later.
    cache: "file",

    // Enable JavaScript rendering.
    browser: true,
    headless: false,

    // Follow pagination 5 times.
    depth: 5,
    follow: ["a.morelink[href]"],
}

export default function ({ doc, absoluteURL }) {
    const title = doc.find("title");
    const posts = doc.find(".athing");

    return {
        title: title.text(),
        posts: posts.map((post) => {
            const link = post.find(".titleline > a");

            return {
                title: link.text(),
                url: link.attr("href"),
            };
        }),
    }
}

$ flyscrape run hackernews.js
[
  {
    "url": "https://news.ycombinator.com/new",
    "data": {
      "title": "New Links | Hacker News",
      "posts": [
        {
          "title": "Show HN: flyscrape - An standalone and scriptable web scraper",
          "url": "https://flyscrape.com/"
        },
        ...
      ]
    }
  }
]

Check out the examples folder for more detailed examples.

Installation

Homebrew

For macOS users flyscrape is also available via homebrew:

brew install flyscrape

Pre-compiled binary

flyscrape is available for MacOS, Linux and Windows as a downloadable binary from the releases page.

Compile from source

To compile flyscrape from source, follow these steps:

Install Go: Make sure you have Go installed on your system. If not, you can download it from https://go.dev/.
Install flyscrape: Open a terminal and run the following command:
```
go install github.com/philippta/flyscrape/cmd/flyscrape@latest
```

Usage

Usage:

    flyscrape run SCRIPT [config flags]

Examples:

    # Run the script.
    $ flyscrape run example.js

    # Set the URL as argument.
    $ flyscrape run example.js --url "http://other.com"

    # Enable proxy support.
    $ flyscrape run example.js --proxies "http://someproxy:8043"

    # Follow paginated links.
    $ flyscrape run example.js --depth 5 --follow ".next-button > a"

    # Set the output format to ndjson.
    $ flyscrape run example.js --output.format ndjson

    # Write the output to a file.
    $ flyscrape run example.js --output.file results.json

Configuration

Below is an example scraping script that showcases the capabilities of flyscrape. For a full documentation of all configuration options, visit the documentation page.

export const config = {
    // Specify the URL to start scraping from.
    url: "https://example.com/",

    // Specify the multiple URLs to start scraping from.   (default = [])
    urls: [                          
        "https://anothersite.com/",
        "https://yetanother.com/",
    ],

    // Enable rendering with headless browser.             (default = false)
    browser: true,

    // Specify if browser should be headless or not.       (default = true)
    headless: false,

    // Specify how deep links should be followed.          (default = 0, no follow)
    depth: 5,                        

    // Speficy the css selectors to follow.                (default = ["a[href]"])
    follow: [".next > a", ".related a"],                      
 
    // Specify the allowed domains. ['*'] for all.         (default = domain from url)
    allowedDomains: ["example.com", "anothersite.com"],              
 
    // Specify the blocked domains.                        (default = none)
    blockedDomains: ["somesite.com"],              

    // Specify the allowed URLs as regex.                  (default = all allowed)
    allowedURLs: ["/posts", "/articles/\d+"],                 
 
    // Specify the blocked URLs as regex.                  (default = none)
    blockedURLs: ["/admin"],                 
   
    // Specify the rate in requests per minute.            (default = no rate limit)
    rate: 60,                       

    // Specify the number of concurrent requests.          (default = no limit)
    concurrency: 1,                       

    // Specify a single HTTP(S) proxy URL.                 (default = no proxy)
    // Note: Not compatible with browser mode.
    proxy: "http://someproxy.com:8043",

    // Specify multiple HTTP(S) proxy URLs.                (default = no proxy)
    // Note: Not compatible with browser mode.
    proxies: [
      "http://someproxy.com:8043",
      "http://someotherproxy.com:8043",
    ],                     

    // Enable file-based request caching.                  (default = no cache)
    cache: "file",                   

    // Specify the HTTP request header.                    (default = none)
    headers: {                       
        "Authorization": "Bearer ...",
        "User-Agent": "Mozilla ...",
    },

    // Use the cookie store of your local browser.         (default = off)
    // Options: "chrome" | "edge" | "firefox"
    cookies: "chrome",

    // Specify the output options.
    output: {
        // Specify the output file.                        (default = stdout)
        file: "results.json",
        
        // Specify the output format.                      (default = json)
        // Options: "json" | "ndjson"
        format: "json",
    },
};

export default function ({ doc, url, absoluteURL }) {
    // doc              - Contains the parsed HTML document
    // url              - Contains the scraped URL
    // absoluteURL(...) - Transforms relative URLs into absolute URLs
}

Query API

// <div class="element" foo="bar">Hey</div>
const el = doc.find(".element")
el.text()                                 // "Hey"
el.html()                                 // `<div class="element">Hey</div>`
el.name()                                 // div
el.attr("foo")                            // "bar"
el.hasAttr("foo")                         // true
el.hasClass("element")                    // true

// <ul>
//   <li class="a">Item 1</li>
//   <li>Item 2</li>
//   <li>Item 3</li>
// </ul>
const list = doc.find("ul")
list.children()                           // [<li class="a">Item 1</li>, <li>Item 2</li>, <li>Item 3</li>]

const items = list.find("li")
items.length()                            // 3
items.first()                             // <li>Item 1</li>
items.last()                              // <li>Item 3</li>
items.get(1)                              // <li>Item 2</li>
items.get(1).prev()                       // <li>Item 1</li>
items.get(1).next()                       // <li>Item 3</li>
items.get(1).parent()                     // <ul>...</ul>
items.get(1).siblings()                   // [<li class="a">Item 1</li>, <li>Item 2</li>, <li>Item 3</li>]
items.map(item => item.text())            // ["Item 1", "Item 2", "Item 3"]
items.filter(item => item.hasClass("a"))  // [<li class="a">Item 1</li>]

// <div>
//   <h2 id="aleph">Aleph</h2>
//   <p>Aleph</p>
//   <h2 id="beta">Beta</h2>
//   <p>Beta</p>
//   <h2 id="gamma">Gamma</h2>
//   <p>Gamma</p>
// </div>
const header = doc.find("div h2")

header.get(1).prev()                     // <p>Aleph</p>
header.get(1).prevAll()                  // [<p>Aleph</p>, <h2 id="aleph">Aleph</h2>]
header.get(1).prevUntil('div,h1,h2,h3')  // <h2 id="aleph">Aleph</h2>
header.get(1).next()                     // <p>Beta</p>
header.get(1).nextAll()                  // [<p>Beta</p>, <h2 id="gamma">Gamma</h2>, <p>Gamma</p>]
header.get(1).nextUntil('div,h1,h2,h3')  // <p>Beta</p>

Flyscrape API

Document Parsing

import { parse } from "flyscrape";

const doc = parse(`<div class="foo">bar</div>`);
const text = doc.find(".foo").text();

File Downloads

import { download } from "flyscrape/http";

download("http://example.com/image.jpg")              // downloads as "image.jpg"
download("http://example.com/image.jpg", "other.jpg") // downloads as "other.jpg"
download("http://example.com/image.jpg", "dir/")      // downloads as "dir/image.jpg"

// If the server offers a filename via the Content-Disposition header and no
// destination filename is provided, Flyscrape will honor the suggested filename.
// E.g. `Content-Disposition: attachment; filename="archive.zip"`
download("http://example.com/generate_archive.php", "dir/") // downloads as "dir/archive.zip"

Issues and Suggestions

If you encounter any issues or have suggestions for improvement, please submit an issue.

Documentation ¶

Index ¶

Constants
Variables
func Dev(file string, overrides map[string]any) error
func Document(sel *goquery.Selection) map[string]any
func DocumentFromString(s string) (map[string]any, error)
func MockResponse(statusCode int, html string) (*http.Response, error)
func RegisterModule(mod Module)
func Run(file string, overrides map[string]any) error
func Watch(path string, fn func(string) error) error
type Config
type Context
type Exports
- func Compile(src string, imports Imports) (Exports, error)
- func (e Exports) Config() []byte
- func (e Exports) Scrape(p ScrapeParams) (any, error)
type Finalizer
type Imports
- func NewJSLibrary(client *http.Client) (imports Imports, wait func())
type Module
- func LoadModules(cfg Config) []Module
type ModuleInfo
type Provisioner
type Request
type RequestBuilder
type RequestValidator
type Response
type ResponseReceiver
type RoundTripFunc
- func MockTransport(statusCode int, html string) RoundTripFunc
- func (f RoundTripFunc) RoundTrip(r *http.Request) (*http.Response, error)
type ScrapeFunc
type ScrapeParams
type Scraper
- func NewScraper() *Scraper
- func (s *Scraper) MarkUnvisited(url string)
- func (s *Scraper) MarkVisited(url string)
- func (s *Scraper) Run()
- func (s *Scraper) ScriptName() string
- func (s *Scraper) Visit(url string)
type TransformError
- func (err TransformError) Error() string
type TransportAdapter

Constants ¶

View Source

const HeaderBypassCache = "X-Flyscrape-Bypass-Cache"

Variables ¶

View Source

var ScriptTemplate []byte

View Source

var StopWatch = errors.New("stop watch")

View Source

var Version string

Functions ¶

func Dev ¶ added in v0.4.0

func Dev(file string, overrides map[string]any) error

func Document ¶ added in v0.4.0

func Document(sel *goquery.Selection) map[string]any

func DocumentFromString ¶ added in v0.4.0

func DocumentFromString(s string) (map[string]any, error)

func MockResponse ¶ added in v0.2.0

func MockResponse(statusCode int, html string) (*http.Response, error)

func RegisterModule ¶ added in v0.2.0

func RegisterModule(mod Module)

func Run ¶ added in v0.4.0

func Run(file string, overrides map[string]any) error

func Watch ¶

func Watch(path string, fn func(string) error) error

Types ¶

type Config ¶ added in v0.2.0

type Config []byte

type Context ¶ added in v0.2.0

type Context interface {
	ScriptName() string
	Visit(url string)
	MarkVisited(url string)
	MarkUnvisited(url string)
}

type Exports ¶ added in v0.4.0

type Exports map[string]any

func Compile ¶

func Compile(src string, imports Imports) (Exports, error)

func (Exports) Config ¶ added in v0.4.0

func (e Exports) Config() []byte

func (Exports) Scrape ¶ added in v0.4.0

func (e Exports) Scrape(p ScrapeParams) (any, error)

type Finalizer ¶ added in v0.2.0

type Finalizer interface {
	Finalize()
}

type Imports ¶ added in v0.4.0

type Imports map[string]map[string]any

func NewJSLibrary ¶ added in v0.4.0

func NewJSLibrary(client *http.Client) (imports Imports, wait func())

type Module ¶ added in v0.2.0

type Module interface {
	ModuleInfo() ModuleInfo
}

func LoadModules ¶ added in v0.2.0

func LoadModules(cfg Config) []Module

type ModuleInfo ¶ added in v0.2.0

type ModuleInfo struct {
	ID  string
	New func() Module
}

type Provisioner ¶ added in v0.2.0

type Provisioner interface {
	Provision(Context)
}

type Request ¶ added in v0.2.0

type Request struct {
	Method  string
	URL     string
	Headers http.Header
	Cookies http.CookieJar
	Depth   int
}

type RequestBuilder ¶ added in v0.2.0

type RequestBuilder interface {
	BuildRequest(*Request)
}

type RequestValidator ¶ added in v0.2.0

type RequestValidator interface {
	ValidateRequest(*Request) bool
}

type Response ¶ added in v0.2.0

type Response struct {
	StatusCode int
	Headers    http.Header
	Body       []byte
	Data       any
	Error      error
	Request    *Request

	Visit func(url string)
}

type ResponseReceiver ¶ added in v0.2.0

type ResponseReceiver interface {
	ReceiveResponse(*Response)
}

type RoundTripFunc ¶ added in v0.2.0

type RoundTripFunc func(*http.Request) (*http.Response, error)

func MockTransport ¶ added in v0.2.0

func MockTransport(statusCode int, html string) RoundTripFunc

func (RoundTripFunc) RoundTrip ¶ added in v0.2.0

func (f RoundTripFunc) RoundTrip(r *http.Request) (*http.Response, error)

type ScrapeFunc ¶

type ScrapeFunc func(ScrapeParams) (any, error)

type ScrapeParams ¶

type ScrapeParams struct {
	HTML string
	URL  string
}

type Scraper ¶

type Scraper struct {
	ScrapeFunc ScrapeFunc
	Script     string
	Modules    []Module
	Client     *http.Client
	// contains filtered or unexported fields
}

func NewScraper ¶ added in v0.2.0

func NewScraper() *Scraper

func (*Scraper) MarkUnvisited ¶ added in v0.2.0

func (s *Scraper) MarkUnvisited(url string)

func (*Scraper) MarkVisited ¶ added in v0.2.0

func (s *Scraper) MarkVisited(url string)

func (*Scraper) Run ¶ added in v0.2.0

func (s *Scraper) Run()

func (*Scraper) ScriptName ¶ added in v0.2.0

func (s *Scraper) ScriptName() string

func (*Scraper) Visit ¶ added in v0.2.0

func (s *Scraper) Visit(url string)

type TransformError ¶

type TransformError struct {
	Line   int
	Column int
	Text   string
}

func (TransformError) Error ¶

func (err TransformError) Error() string

type TransportAdapter ¶ added in v0.2.0

type TransportAdapter interface {
	AdaptTransport(http.RoundTripper) http.RoundTripper
}

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
flyscrape command
modules
browser
cache
cookies
depth
domainfilter
followlinks
headers
hook
output/json
output/ndjson
proxy
ratelimit
retry
starturl
urlfilter

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

Demo

Features

Overview

Example

Installation

Recommended

Homebrew

Pre-compiled binary

Compile from source

Usage

Configuration

Query API

Flyscrape API

Document Parsing

File Downloads

Issues and Suggestions

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func Dev ¶ added in v0.4.0

func Document ¶ added in v0.4.0

func DocumentFromString ¶ added in v0.4.0

func MockResponse ¶ added in v0.2.0

func RegisterModule ¶ added in v0.2.0

func Run ¶ added in v0.4.0

func Watch ¶

Types ¶

type Config ¶ added in v0.2.0

type Context ¶ added in v0.2.0

type Exports ¶ added in v0.4.0

func Compile ¶

func (Exports) Config ¶ added in v0.4.0

func (Exports) Scrape ¶ added in v0.4.0

type Finalizer ¶ added in v0.2.0

type Imports ¶ added in v0.4.0

func NewJSLibrary ¶ added in v0.4.0

type Module ¶ added in v0.2.0

func LoadModules ¶ added in v0.2.0

type ModuleInfo ¶ added in v0.2.0

type Provisioner ¶ added in v0.2.0

type Request ¶ added in v0.2.0

type RequestBuilder ¶ added in v0.2.0

type RequestValidator ¶ added in v0.2.0

type Response ¶ added in v0.2.0

type ResponseReceiver ¶ added in v0.2.0

type RoundTripFunc ¶ added in v0.2.0

func MockTransport ¶ added in v0.2.0

func (RoundTripFunc) RoundTrip ¶ added in v0.2.0

type ScrapeFunc ¶

type ScrapeParams ¶

type Scraper ¶

func NewScraper ¶ added in v0.2.0

func (*Scraper) MarkUnvisited ¶ added in v0.2.0

func (*Scraper) MarkVisited ¶ added in v0.2.0

func (*Scraper) Run ¶ added in v0.2.0

func (*Scraper) ScriptName ¶ added in v0.2.0

func (*Scraper) Visit ¶ added in v0.2.0

type TransformError ¶

func (TransformError) Error ¶

type TransportAdapter ¶ added in v0.2.0

Source Files ¶

Directories ¶