Preparing Data at Scale for AuroraGPT

July 9, 2024
1:00 PM – 2:00 PM
Online

Speaker: Robert Underwood (MCS)

CS Seminar

Abstract: In this talk, I’ll share the recent progress of the AuroraGPT Data Team, how we contribute the project of building a science focused LLM with AuroraGPT, how we collaborate with the other teams, and what topics we see as open questions. As the data team, our team is responsible for identifying, preparing, and dedupicating scientific data and text. We’ll talk about the systems and data quality challenges that our team tackles to prepare terabytes of scientific data and text to produce high quality text and data for training.