[Colloquium] Shambayati/MS Presentation/Nov 12, 2014

Wed Oct 29 13:11:55 CDT 2014

This is an announcement of Amirali Shambayati's MS Presentation.

------------------------------------------------------------------------------
Date:  Wednesday, November 12, 2014

Time:  10:00 AM

Place:  Ryerson 255

M.S. Candidate:  Amirali Shambayati

M.S. Paper Title: Data Layout Transformation Micro-Engine: A
Specialized Architecture to Manage Data Movements for Performance and
Energy Efficiency

Abstract:
Data movements across the memory hierarchy to the processor's
registers dissipate a significant portion of total system energy and
this portion is growing fast as computations are improved via
accelerator-based computing approaches. Since the energy and
performance cost of data movements is not free anymore, a systematic
approach to manage data movements among the system components is a
critical demand.

We propose Data-Layout Transformation (DLT) micro-engine, which
explicitly controls data movements between main memory, cache
hierarchy, vector registers and local memory. Although DLT can serve
as data movements orchestrator beside a general-purpose processor, its
roles becomes more vital when it is applied by a computation
accelerator to mitigate data movements costs.

We apply DLT in 10x10 heterogeneous architecture paradigm [Chien11]
which aims to tame heterogeneity for general-purpose computing. 10x10
consists of a set of programmable accelerators (micro-engines), which
are located in the middle of general-purpose computing towards
aggressive special-purpose computing spectrum. DLT provides a set of
gather-scatter instructions to move data efficiently between on-chip
and off-chip memories and vector registers to be consumed by other
micro-engines. Gather and scatter in DLT are non-blocking
instructions, that is, data movements can be hidden behind
computations and synchronized via DLT's fence instructions.

We evaluate performance and energy efficiency benefits of DLT
integrated with other 10x10 micro-engines. Since bandwidth and energy
efficiency for off-chip memory are important factors in data movements
costs, evaluations are based on two different memory systems. We study
DLT in the context of DDR3 and Hybrid Memory Cube memory systems using
a variety of embedded workloads (Fast Fourier Transform, Discrete
Wavelet Transform, Convolution, Merge sort, Matrix Multiplication).
Our results indicate that DLT can decrease total instruction and cycle
count by range (7x,115x) and 44x in average. In terms of total system
energy, DLT can reduce required system energy by range (5x,47x) and
17x in average. Specifically DLT reduces memory system energy by range
(2x,46x) and 18x in average for DDR3 and range (4x,32x) and 12x in
average for HMC.

Finally, we discuss about the related work and suggest the several
directions for future work in this area.

Amirali's advisor is Prof. Andrew Chien

Login to the Computer Science Department website for details:
 https://cs.uchicago.edu/phd/ms_announcements#amirali

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Margaret P. Jaffey            margaret at cs.uchicago.edu
Department of Computer Science
Student Support Rep (Ry 156)               (773) 702-6011
The University of Chicago      http://cs.uchicago.edu
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=