AI LLM Data Engineer
Physical Presence: Hybrid – 3 days per week in office (Downtown Toronto)
Job Description:
looking for an AI LLM Data Engineer to join Data Science team. In this role, the candidate will collaborate closely with data scientists and AI researchers to manage, preprocess, and optimize the data pipeline for Large Language Models (LLMs). The ideal candidate will have experience with building and maintaining data infrastructures specifically designed for AI applications, with a strong understanding of LLM architectures, data engineering, and data preprocessing techniques to ensure that data pipelines are scalable, efficient, and robust.
Key Responsibilities:
– Data Pipeline Development & Maintenance. Design, develop, and maintain data pipelines for training and fine-tuning large language models (LLMs) to support AI-driven applications and research.
-Implement robust data ingestion mechanisms for structured, unstructured, and semi-structured datasets, including text, image, and audio data, ensuring high-quality data processing for LLM models.
-Collaborate with data scientists to streamline the extraction, transformation, and loading (ETL) processes, optimizing data flow for AI model training, evaluation, and deployment.
-Develop and implement data preprocessing pipelines to clean, tokenize, and structure large datasets efficiently, ensuring they are ready for model input.
-Create processes to handle missing data and build scalable solutions to process raw data into structured formats suitable for LLM training.
-Work with various data types (text, metadata, embeddings, etc.) and ensure that they meet the model’s input requirements for optimization.
-Proficient with LLM Data Optimization and ensuring that large-scale datasets are efficiently processed and stored.
Collaborate with AI researchers to fine-tune data ingestion strategies, adjusting data pipelines to accommodate changing model needs and model architectures.
Implement advanced techniques in feature engineering to improve the model’s ability to learn from large, diverse datasets.
-Partner with cross-functional teams, including AI researchers, data scientists, and machine learning engineers, to understand their needs and provide the necessary data support for model development and deployment.
-Build and maintain documentation for data pipelines, transformations, and data handling processes for reproducibility and transparency.
Develop automated processes for data quality checks to ensure that datasets meet quality standards before they are fed into LLMs for training.
Establish robust monitoring mechanisms to track data pipeline performance and quality during model training and inference.
Use tools and frameworks to log and audit data transformations and pipeline activities, identifying and addressing data issues promptly.
Required Skills and Qualifications:
-3-5 years of experience in data engineering, with a focus on building and maintaining scalable data pipelines for AI applications.
-Expertise in data engineering technologies like Apache Kafka, Apache Airflow, Spark, Hadoop, or similar distributed systems for processing large datasets.
-Strong experience with ETL processes, particularly for high-volume data used in machine learning and AI model training.
Experience working with databases (SQL, NoSQL, and distributed data stores like Cassandra, MongoDB, etc.).
-AI/LLM Expertise:
Proven experience working with Large Language Models (LLMs) or similar AI technologies, particularly with large-scale data processing needs.
Understanding of transformer architectures (e.g., GPT, BERT, T5) and how to design data pipelines that support model training and evaluation for these models.
Familiarity with Natural Language Processing (NLP) tasks, such as tokenization, named entity recognition, and semantic analysis.
Experience with data strategies for pretraining, fine-tuning, and evaluating LLMs.
-Proficiency in Python and related libraries such as Pandas, NumPy, and Dask for data manipulation and large-scale data processing.
-Experience with machine learning frameworks such as TensorFlow, PyTorch, or similar tools used to train or fine-tune LLMs.
-Familiarity with cloud computing platforms like AWS, GCP, or Azure for data storage, processing, and scaling large datasets.
-Strong knowledge of data privacy regulations, including GDPR, CCPA, and how they affect the management of large datasets used in AI model training.
Experience with data governance, version control, and data lineage, particularly for AI model data pipelines.
Ability to collaborate closely with data scientists, AI researchers, and engineers
-Strong communication skills
Nice to have skills:
Experience with deep learning frameworks and understanding of distributed computing environments.
Exposure to cloud-native architectures for scalable AI workflows, including containerized environments using Docker and Kubernetes.
Knowledge of data augmentation techniques for enhancing training datasets and improving model generalization.
A background in software engineering best practices, including version control, automated testing, and continuous integration/continuous delivery (CI/CD).
Job Features
Job Category | AI LLM Data Engineer |