triosim

package module
v0.0.0-...-09c061b Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 24, 2025 License: MIT Imports: 7 Imported by: 0

README

TrioSim

A lightweight simulator for large-scale DNN workloads on multi-GPU systems. TrioSim supports various parallelism strategies including data parallelism, tensor parallelism, and pipeline parallelism.

Table of Contents

1. Tracer

TrioSim provides processed sample traces for immediate use. These traces are located in the ./sample_trace/ directory. To start a test quickly: skip the trace collection steps (Section 1); go directly to Section 2 to begin simulation.

1.1 Trace Collection
Environment Used
  • Python: 3.10.12
  • CUDA: 12.1
  • torch: 2.1.0+cu121
  • torchvision: 0.16.0+cu121
  • torchaudio: 2.1.0+cu121
Dataset

The codes use the ILSVRC2012_img_val dataset. For a quick start, a subset of 256 images is included under ./tracer.

Usage

To collect traces from PyTorch models, we use PyTorch Profiler to gather layer or operator time information, and the Execution Graph Observer tool to collect detailed input, output, and other tensor or data information. The batch size is set via command-line arguments. You can also customize the number of iterations (num_iters) and the model to trace (listmodel) directly in the code.

Here is a code example for collecting trace when batch size is 16:

python tracer/datacollect.py 16

This will generate two types of files:

  • profiler_xx.json: Contains timing information for each operator
  • graph_xx.json: Contains detailed tensor information.

For a quick start, generated traces are available in:

  • tracer/data/graph/graph_xx.json
  • tracer/data/profiler/profiler_xx.json
1.2 Trace Data Processing

The TARGET_OP_PREFIXES variable allows users to define which layers are included in tensor parallelism. By default, it targets 'convolution', 'linear', and 'embedding' layers.

Run the following command to convert the collected traces into TrioSim format:

python tracer/dataprocess.py

The processed traces: tensor.csv and trace.csv, will be available under:

./tracer/data/middledata/trace/XXmodel

2. TrioSim

2.1 Go Installation

Install Go by following the official installation guide: Go Installation Guide

2.2 Configuration

The simulator can be configured using the following command-line flags:

Flag Type Default Description
-trace-dir string "../sample_trace/trace2-h100-bs128/vgg13/" Directory containing trace files
-batch-size int 128 Original trace batch size
-batch-size-sim int -1 Simulation batch size (defaults to batch-size)
-bandwidth float 696 GPU to remote memory bandwidth (GBps)
-ptp-bandwidth float 65 GPU to GPU bandwidth (GBps)
-GPUnumber int 8 Number of GPUs
-micro-batch-size int -1 Micro batch size for pipeline parallelism
-case int 0 Simulation mode: 0=training, 1=standard data parallel, 2=distributed data parallel, 3=tensor parallel, 4=pipeline parallel
-capacity int 40 Memory capacity of each device (1 << capacity)
-numCols int -1 Number of columns in optical network mesh
-numRows int 1 Number of rows in optical network mesh
-interconnects int 0 Interconnect type: 0=electrical, 1=optical
2.3 Running the Simulator
Basic Usage
  1. Navigate to the triosim directory:
cd triosim
  1. Run the simulator with your desired configuration:
go run main.go \
  -batch-size 128 \
  --batch-size-sim 128 \
  -trace-dir ../sample_trace/trace2-h100-bs128/vgg13 \
  --GPUnumber 4 \
  --case 1

Documentation

Overview

Package triosim provides a simulator that replays DNN execution traces.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Layer

type Layer struct {
	ID           int //operator id
	Name         string
	Inputs       []Tensor
	Outputs      []Tensor
	InputSize    []int
	OutputSize   []int
	TimeInSec    float64
	GPUID        int
	Stage        string
	SetBatchSize bool
	TPflag       int
}

A Layer represents a layer in the neural network.

type Tensor

type Tensor struct {
	Index        int
	ID           string //tensor id
	Size         int
	Category     TensorType
	ChunkID      int
	GPUID        int
	MemoryStatus TensorMemoryStatus
}

A Tensor represents a tensor being used in the neural network. We do not carry the data since the execution time should be data independent.

func (*Tensor) Bytes

func (t *Tensor) Bytes() uint64

Bytes returns the number of bytes of the tensor.

type TensorMemoryStatus

type TensorMemoryStatus int

TensorMemoryStatus represents the memory status of a tensor.

const (
	TensorMemoryStatusUnknown TensorMemoryStatus = iota
	TensorMemoryStatusAllocated
	TensorMemoryStatusAvailable
	TensorMemoryStatusToBeUsed
	TensorMemoryStatusUsed
)

TensorMemoryStatus constants

type TensorMsg

type TensorMsg struct {
	sim.MsgMeta
	TensorPkg     []Tensor
	DstRegionName string
	GPUID         int
	Purpose       string
	RoundID       int
}

A TensorMsg represents the transfer of a tensor package.

func (*TensorMsg) Meta

func (m *TensorMsg) Meta() *sim.MsgMeta

Meta returns the meta data of the message.

type TensorType

type TensorType int

A TensorType represent the type of data it stores

const (
	Input TensorType = iota
	Output
	Weight
	RunningMean
	RunningVar
	Bias
	Activation
	Gradient
	Other
)

TensorType constants

type Trace

type Trace []*Layer

Trace represents a trace of the execution of the neural network.

type TraceLoader

type TraceLoader struct {
	// The directory where the trace files are located.
	Dir string
}

A TraceLoader loads a trace from a set of files.

func (*TraceLoader) Load

func (l *TraceLoader) Load(bsRatio float64) (Trace, error)

Load loads a trace from a set of files.

Directories

Path Synopsis
Package networkmodel provides a performance model for the network that connects devices.
Package networkmodel provides a performance model for the network that connects devices.
test command
Package timemodel provides a performance model for the time of execution of operators and layers.
Package timemodel provides a performance model for the time of execution of operators and layers.
Package traceplayer provides a trace player that plays a trace and simulates the execution of the trace.
Package traceplayer provides a trace player that plays a trace and simulates the execution of the trace.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL