[Theory] SOON in Room 529: 9/24 Thesis Defense: Shengjie Lin, TTIC

Wed Sep 24 14:12:25 CDT 2025

*When*:     Wednesday, September 24th from* 2:30 - 3:30p**m CT*

*Where:   * Speaker is remote but the *talk can be viewed*
               * in-person in room 529*

*Virtually*: via *Zoom*
<https://uchicagogroup.zoom.us/j/95222100221?pwd=Mdam0MtHKq34h0ipKA1bNx3DivTMww.1>

*Who:  *    Shengjie Lin, TTIC

*Title: *Scalable 3D Scene Understanding and Embodied Reasoning for Robotics
*Abstract: *For robots to become truly useful partners in our world, they
must first build a rich, actionable understanding of it.
This thesis introduces a complete framework to make that happen, enabling
an agent to perceive its environment, understand how objects function, and
act on complex human instructions.
The work first tackles perception at scale. We present a novel method for
building large, high-fidelity 3D maps by seamlessly stitching together
multiple, independently-captured scene representations. This provides the
robot with a foundational spatial awareness. Building on this static map,
we then address object dynamics. From just two sparse observations—like a
cabinet door open and closed—our system can reconstruct an articulated
object and infer its kinematics, creating a world model that is not just
descriptive, but functional.

With this rich world model in place, we use a Large Language Model (LLM) as
the robot's cognitive engine. This system empowers the agent to robustly
ground complex, free-form human commands within its 3D map. Crucially, it
also maintains an explicit "memory" of the world's state, allowing it to
track changes as it acts and successfully perform long-horizon, multi-step
tasks.

Together, these contributions form a complete pipeline from pixels to
actions, paving the way for more capable and collaborative robots that can
operate effectively in human-centric environments.

Thesis* Committee: *Chair: Matthew Walter <mwalter at ttic.edu>  Members:Greg
Shakhnarovich, Vitor Guizilini

Mary C. Marre
Faculty Administrative Support
*Toyota Technological Institute*
*6045 S. Kenwood Avenue, Rm 517*
*Chicago, IL  60637*
*773-834-1757*
*mmarre at ttic.edu <mmarre at ttic.edu>*

On Wed, Sep 24, 2025 at 11:18 AM Mary Marre <mmarre at ttic.edu> wrote:

> *When*:     Wednesday, September 24th from* 2:30 - 3:30p**m CT*
>
> *Where:   * Speaker is remote but the *talk can be viewed*
> *                in-person in room 529*
>
> *Virtually*: via *Zoom*
> <https://uchicagogroup.zoom.us/j/95222100221?pwd=Mdam0MtHKq34h0ipKA1bNx3DivTMww.1>
>
>
> *Who:  *    Shengjie Lin, TTIC
>
>
> *Title: *Scalable 3D Scene Understanding and Embodied Reasoning for
> Robotics
> *Abstract: *For robots to become truly useful partners in our world, they
> must first build a rich, actionable understanding of it.
> This thesis introduces a complete framework to make that happen, enabling
> an agent to perceive its environment, understand how objects function, and
> act on complex human instructions.
> The work first tackles perception at scale. We present a novel method for
> building large, high-fidelity 3D maps by seamlessly stitching together
> multiple, independently-captured scene representations. This provides the
> robot with a foundational spatial awareness. Building on this static map,
> we then address object dynamics. From just two sparse observations—like a
> cabinet door open and closed—our system can reconstruct an articulated
> object and infer its kinematics, creating a world model that is not just
> descriptive, but functional.
>
> With this rich world model in place, we use a Large Language Model (LLM)
> as the robot's cognitive engine. This system empowers the agent to robustly
> ground complex, free-form human commands within its 3D map. Crucially, it
> also maintains an explicit "memory" of the world's state, allowing it to
> track changes as it acts and successfully perform long-horizon, multi-step
> tasks.
>
> Together, these contributions form a complete pipeline from pixels to
> actions, paving the way for more capable and collaborative robots that can
> operate effectively in human-centric environments.
>
> Thesis* Committee: *Chair: Matthew Walter <mwalter at ttic.edu>  Members:Greg
> Shakhnarovich, Vitor Guizilini
>
>
>
> Mary C. Marre
> Faculty Administrative Support
> *Toyota Technological Institute*
> *6045 S. Kenwood Avenue, Rm 517*
> *Chicago, IL  60637*
> *773-834-1757*
> *mmarre at ttic.edu <mmarre at ttic.edu>*
>
>
> On Tue, Sep 23, 2025 at 3:49 PM Mary Marre <mmarre at ttic.edu> wrote:
>
>> *When*:     Wednesday, September 24th from* 2:30 - 3:30p**m CT*
>>
>> *Virtually*: via *Zoom*
>> <https://uchicagogroup.zoom.us/j/95222100221?pwd=Mdam0MtHKq34h0ipKA1bNx3DivTMww.1>
>>
>>
>> *Who:  *    Shengjie Lin, TTIC
>>
>>
>> *Title: *Scalable 3D Scene Understanding and Embodied Reasoning for
>> Robotics
>> *Abstract: *For robots to become truly useful partners in our world,
>> they must first build a rich, actionable understanding of it. This thesis introduces
>> a complete framework to make that happen, enabling an agent to perceive its
>> environment, understand how objects function, and act on complex human
>> instructions.
>> The work first tackles perception at scale. We present a novel method for
>> building large, high-fidelity 3D maps by seamlessly stitching together
>> multiple, independently-captured scene representations. This provides the
>> robot with a foundational spatial awareness. Building on this static map,
>> we then address object dynamics. From just two sparse observations—like a
>> cabinet door open and closed—our system can reconstruct an articulated
>> object and infer its kinematics, creating a world model that is not just
>> descriptive, but functional.
>>
>> With this rich world model in place, we use a Large Language Model (LLM)
>> as the robot's cognitive engine. This system empowers the agent to robustly
>> ground complex, free-form human commands within its 3D map. Crucially, it
>> also maintains an explicit "memory" of the world's state, allowing it to
>> track changes as it acts and successfully perform long-horizon, multi-step
>> tasks.
>>
>> Together, these contributions form a complete pipeline from pixels to
>> actions, paving the way for more capable and collaborative robots that can
>> operate effectively in human-centric environments.
>>
>> Thesis* Committee: *Chair: Matthew Walter <mwalter at ttic.edu>  Members:Greg
>> Shakhnarovich, Vitor Guizilini
>>
>>
>>
>>
>> Mary C. Marre
>> Faculty Administrative Support
>> *Toyota Technological Institute*
>> *6045 S. Kenwood Avenue, Rm 517*
>> *Chicago, IL  60637*
>> *773-834-1757*
>> *mmarre at ttic.edu <mmarre at ttic.edu>*
>>
>>
>> On Sat, Sep 20, 2025 at 4:43 PM Mary Marre <mmarre at ttic.edu> wrote:
>>
>>> *When*:     Wednesday, September 24th from* 2:30 - 3:30p**m CT*
>>>
>>> *Virtually*: via *Zoom*
>>> <https://uchicagogroup.zoom.us/j/95222100221?pwd=Mdam0MtHKq34h0ipKA1bNx3DivTMww.1>
>>>
>>>
>>> *Who:  *    Shengjie Lin, TTIC
>>>
>>>
>>> *Title: *Scalable 3D Scene Understanding and Embodied Reasoning for
>>> Robotics
>>> *Abstract: *For robots to become truly useful partners in our world,
>>> they must first build a rich, actionable understanding of it. This thesis
>>> introduces a complete framework to make that happen, enabling an agent to
>>> perceive its environment, understand how objects function, and act on
>>> complex human instructions.
>>> The work first tackles perception at scale. We present a novel method
>>> for building large, high-fidelity 3D maps by seamlessly stitching together
>>> multiple, independently-captured scene representations. This provides the
>>> robot with a foundational spatial awareness. Building on this static map,
>>> we then address object dynamics. From just two sparse observations—like a
>>> cabinet door open and closed—our system can reconstruct an articulated
>>> object and infer its kinematics, creating a world model that is not just
>>> descriptive, but functional.
>>>
>>> With this rich world model in place, we use a Large Language Model (LLM)
>>> as the robot's cognitive engine. This system empowers the agent to robustly
>>> ground complex, free-form human commands within its 3D map. Crucially, it
>>> also maintains an explicit "memory" of the world's state, allowing it to
>>> track changes as it acts and successfully perform long-horizon, multi-step
>>> tasks.
>>>
>>> Together, these contributions form a complete pipeline from pixels to
>>> actions, paving the way for more capable and collaborative robots that can
>>> operate effectively in human-centric environments.
>>>
>>> Thesis* Committee: *Chair: Matthew Walter <mwalter at ttic.edu>  Members:Greg
>>> Shakhnarovich, Vitor Guizilini
>>>
>>>
>>>
>>> Mary C. Marre
>>> Faculty Administrative Support
>>> *Toyota Technological Institute*
>>> *6045 S. Kenwood Avenue, Rm 517*
>>> *Chicago, IL  60637*
>>> *773-834-1757*
>>> *mmarre at ttic.edu <mmarre at ttic.edu>*
>>>
>>>
>>> On Thu, Sep 11, 2025 at 8:00 AM Mary Marre <mmarre at ttic.edu> wrote:
>>>
>>>> *When*:     Wednesday, September 24th from* 2:30 - 3:30p**m CT*
>>>>
>>>> *Virtually*: via *Zoom*
>>>> <https://uchicagogroup.zoom.us/j/95222100221?pwd=Mdam0MtHKq34h0ipKA1bNx3DivTMww.1>
>>>>
>>>>
>>>> *Who:  *    Shengjie Lin, TTIC
>>>>
>>>>
>>>> *Title: *Scalable 3D Scene Understanding and Embodied Reasoning for
>>>> Robotics
>>>> *Abstract: *For robots to become truly useful partners in our world,
>>>> they must first build a rich, actionable understanding of it. This thesis
>>>> introduces a complete framework to make that happen, enabling an agent to
>>>> perceive its environment, understand how objects function, and act on
>>>> complex human instructions.
>>>> The work first tackles perception at scale. We present a novel method
>>>> for building large, high-fidelity 3D maps by seamlessly stitching together
>>>> multiple, independently-captured scene representations. This provides the
>>>> robot with a foundational spatial awareness. Building on this static map,
>>>> we then address object dynamics. From just two sparse observations—like a
>>>> cabinet door open and closed—our system can reconstruct an articulated
>>>> object and infer its kinematics, creating a world model that is not just
>>>> descriptive, but functional.
>>>>
>>>> With this rich world model in place, we use a Large Language Model
>>>> (LLM) as the robot's cognitive engine. This system empowers the agent to
>>>> robustly ground complex, free-form human commands within its 3D map.
>>>> Crucially, it also maintains an explicit "memory" of the world's state,
>>>> allowing it to track changes as it acts and successfully perform
>>>> long-horizon, multi-step tasks.
>>>>
>>>> Together, these contributions form a complete pipeline from pixels to
>>>> actions, paving the way for more capable and collaborative robots that can
>>>> operate effectively in human-centric environments.
>>>>
>>>> Thesis* Committee: *Chair: Matthew Walter <mwalter at ttic.edu>  Members:Greg
>>>> Shakhnarovich, Vitor Guizilini
>>>>
>>>>
>>>>
>>>>
>>>> Mary C. Marre
>>>> Faculty Administrative Support
>>>> *Toyota Technological Institute*
>>>> *6045 S. Kenwood Avenue, Rm 517*
>>>> *Chicago, IL  60637*
>>>> *773-834-1757*
>>>> *mmarre at ttic.edu <mmarre at ttic.edu>*
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.uchicago.edu/pipermail/theory/attachments/20250924/d5450b66/attachment-0001.html>