RLWRLD

Cloud Infrastructure Engineer

Division

Korea

Job group

Tech/Product

Experience Level

Experienced 3 years or more

Job Types

Full-time

Locations

Seoul Office서울특별시 강남구 선릉로 561

RLWRLD is a leading Physical AI company developing a Robotics Foundation Model (RFM) that enables robots to perceive, reason, and act in the real world like humans.

Building on deep research capabilities in AI and robotics and a strong data collaboration network with industrial partners in Japan, Korea, and beyond, RLWRLD is rapidly advancing our RFM to enable precise manipulation by high-degree-of-freedom robotic hands. The company is also collaborating with world-class research groups and partners in robotics and sensor solutions to develop AI models that can be practically deployed across industries such as manufacturing, logistics, and services.

Having raised approximately KRW 60 billion in cumulative seed funding from leading domestic and global venture capital firms and major corporations, RLWRLD continues to attract exceptional talent who are eager to drive innovation across AI, robotics technology, and business.

About the Product Organization

At RLWRLD, our Product Organization is responsible for developing all core products — spanning planning, development, and research.

We are building foundational technologies such as:

Robotics Foundation Model (RFM)
APIs/SDKs to deliver RFM functionality
Data pipeline & teleoperation tools
Training systems for model learning
Benchmark systems to test performance
Robot control systems
Infra stack (GPU orchestration, compute management)

Our team includes both research and software engineers, working fluidly across AI model development and software infrastructure. We collaborate closely with Academy Researchers, robotic hardware partners, and internal business developers to deliver cutting-edge robotics solutions.

Position Overview

We are looking for engineers who can push large-scale GPU clusters and data pipelines to their limits in order to support continuously advancing robot intelligence.

In this role, you will be responsible for reliably operating an HPC environment composed of hundreds of A100/H100 GPUs and petabyte-scale storage, while eliminating bottlenecks in the training data flow to maximize research productivity.

We are seeking engineers who can accelerate the training of Physical AI models through best-in-class infrastructure engineering.

Key Responsibilities

Large-Scale GPU Cluster Management
Multi-Cloud / Hybrid Cloud Operations: Build and manage large-scale GPU clusters (e.g., SLURM, Kubernetes) across diverse cloud environments such as AWS, Kakao Cloud, Azure, and SK Lambda
High-Availability Architecture Design: Design systems with login node HA, auto-scaling, and GPU failure auto-drain mechanisms to proactively prevent outages and enable automatic recovery
Resource Optimization: Define job scheduling policies (MaxTime, Priority, Fair Share) and maximize GPU utilization across the cluster
ML Pipeline Optimization & Acceleration
I/O Bottleneck Analysis and Mitigation: Profile and tune storage systems (NFS, Lustre, GPUDirect Storage) and DataLoader stages to resolve bottlenecks during large-scale dataset training
Distributed Training Support: Optimize networking (InfiniBand, RoCE) and improve NCCL communication efficiency in multi-node training environments
DevOps / MLOps: Automate infrastructure changes (IaC) and training environment deployments using tools such as GitHub Actions
Monitoring & Reliability Engineering
Proactive Incident Response: Build alerting systems that monitor GPU temperature, memory errors (ECC), and network latency in real time to detect and mitigate issues before failures occur
Observability: Develop dashboards for cluster resource status and per-user utilization using tools such as Grafana and Prometheus

Required Qualifications

3+ years of experience in infrastructure engineering
Cluster Management Experience: Hands-on experience operating large-scale compute clusters using SLURM, Kubernetes (EKS / AKS / GKE), or similar systems
Cloud Platform Proficiency: Experience designing networking (VPC, subnets), security (IAM), and storage architectures in public cloud environments such as AWS, GCP, or Azure
Linux & Scripting: Strong skills in Linux system/kernel tuning and in building automation tools using shell scripting and Python

Preferred Qualifications

High Performance Computing (HPC) & AI
Advanced GPU Cluster Operations: Experience configuring and troubleshooting high-speed interconnects such as InfiniBand, NVLink, and RDMA
ML Framework Understanding: Deep understanding of data loading mechanisms in PyTorch or TensorFlow, with experience optimizing them at the system level
Parallel File Systems: Experience operating high-performance distributed file systems such as Lustre, GPFS, or WekaIO
MLOps & Workflow Orchestration
Workflow Management: Experience designing and operating complex data preprocessing and training workflows using Prefect, Airflow, or Kubeflow Pipelines
Experiment Tracking: Experience integrating infrastructure with experiment management tools such as Weights & Biases or MLflow
Model Serving: Experience deploying and optimizing models using Triton Inference Server, TorchServe, or similar systems
Problem Solving & Optimization
Bottleneck Profiling: Experience diagnosing end-to-end system performance issues (Storage → CPU/RAM → GPU) using tools such as Nsight Systems or PyTorch Profiler
Cost Optimization: Experience reducing cloud costs through strategies such as spot instance usage and reserved instance planning

Working Conditions

Work Location: 561 Seolleung-ro, Gangnam-gu, Seoul (RUBINA Building, Yeoksam-dong)
Employment Type: Full-time
Probationary Period
A three-month probationary period will apply upon employment.
During this period, your work attitude and performance will be evaluated.
Depending on the evaluation results, the probationary period may be extended or the employment offer may be withdrawn.

How to Apply

Application Materials:
Resume in English or Korean
(optional) Portfolio, research materials, or project documents showcasing your capabilities
Application Deadline: Rolling basis

Hiring Process

Document Screening → 1st Interview → 2nd Interview → 3rd Interview → Final Offer
Candidates who pass the document screening will be contacted individually.
Additional Coffee Chats or Coding Test may be conducted if necessary.

Work Environment & Support

Flexible Work Schedule: Adjust your working hours autonomously to match your personal rhythm.
Equipment & Software Support: We provide job-specific equipment and essential software required for your role.
Office Amenities: Enjoy our in-office snack bar and coffee machines.
Holiday & Birthday Gifts: Small gifts are provided for holidays and birthdays.
Health Checkup Support: We support your well-being through regular health checkups.