<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

</head>

<body>

<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">

<div class="PlainText">This is an announcement of Adrian Lehmann's MS Presentation<br>

===============================================<br>

Candidate: Adrian Lehmann<br>

<br>

Date: Friday, March 31, 2023<br>

<br>

Time:  2 pm CST<br>

<br>

Location: JCL 390<br>

<br>

M.S. Paper Title: Automatically parallelizing Diderot programs on CUDA targets<br>

<br>

Abstract: Diderot is a domain-specific language to perform scientific visualizations.

<br>

Its programs are structured largely like bulk-synchronous parallelism. In this pattern, multiple strands (often also called treads) run one update step in isolation, followed by a single global reduction step (similar to MapReduce).<br>

Currently, a compiler exists that transforms Diderot programs, along with the domain-specific operations, into C++.<br>

The compiler supports targeting both sequential and parallel CPU execution models.<br>

However, given the programming model&#39;s parallel nature, adding GPU support to Diderot&#39;s compiler is a natural step.<br>

<br>

Our work fills this gap. <br>

We add support for automatically parallelizing Diderot applications by modifying the compiler to be able to generate CUDA code.

<br>

We propose three strategies to schedule CUDA threads.<br>

One that closely follows the BSP model, one that runs strands to completion (assuming no reduction steps are needed), and one that builds on a work queue.<br>

We also propose a permutation mechanism for stochastic load distribution to mitigate strand divergence.<br>

We also create variants that utilize CUDA unified memory, an API to move memory pages between system and GPU memory.<br>

<br>

In benchmarks, we see speedups of 60-500x, where the queue-based approach outperforms other approaches.<br>

Further, we see differences in the performance of our approaches between benchmarks.<br>

We observe that permutation performance is highly dependent on the benchmark structure and the homogeneity of strand execution.<br>

Furthermore, we conclude that in our test, CUDA unified memory leads to a significant performance penalty for benchmarks with fewer strands while greatly simplifying the produced code.<br>

<br>

Advisors: John Reppy<br>

<br>

Committee Members: John Reppy, Hank Hoffmann, and Ravi Chugh<br>

<br>

<br>

</div>

</span></font></div>

<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">

<div class="PlainText"><br>

<br>

<br>

<br>

</div>

</span></font></div>

</body>

</html>