Cloud Infrastructure Engineer
Division
Korea
Job group
Tech/Product
Experience Level
Experienced 3 years or more
Job Types
Full-time
Locations
Seoul Office서울특별시 강남구 선릉로 561

RLWRLD is ​a ​leading ​Physical AI ​company developing a Robotics ​Foundation ​Model (RFM) ​that enables robots ​to perceive, ​reason, ​and act ​in ​the ​real world like ​humans.


Building ​on deep research ​capabilities ​in ​AI and robotics ​and a ​strong ​data collaboration ​network with ​industrial ​partners in Japan, ​Korea, and ​beyond, RLWRLD is rapidly advancing our RFM to enable precise manipulation by high-degree-of-freedom robotic hands. The company is also collaborating with world-class research groups and partners in robotics and sensor solutions to develop AI models that can be practically deployed across industries such as manufacturing, logistics, and services.


Having raised approximately KRW 60 billion in cumulative seed funding from leading domestic and global venture capital firms and major corporations, RLWRLD continues to attract exceptional talent who are eager to drive innovation across AI, robotics technology, and business.







About the Product Organization


At RLWRLD, our Product Organization is responsible for developing all core products — spanning planning, development, and research.


We are building foundational technologies such as:

  • Robotics Foundation Model (RFM)
  • APIs/SDKs to deliver RFM functionality
  • Data pipeline & teleoperation tools
  • Training systems for model learning
  • Benchmark systems to test performance
  • Robot control systems
  • Infra stack (GPU orchestration, compute management)


Our team includes both research and software engineers, working fluidly across AI model development and software infrastructure. We collaborate closely with Academy Researchers, robotic hardware partners, and internal business developers to deliver cutting-edge robotics solutions.



Position Overview

We are looking for engineers who can push large-scale GPU clusters and data pipelines to their limits in order to support continuously advancing robot intelligence.


In this role, you will be responsible for reliably operating an HPC environment composed of hundreds of A100/H100 GPUs and petabyte-scale storage, while eliminating bottlenecks in the training data flow to maximize research productivity.


We are seeking engineers who can accelerate the training of Physical AI models through best-in-class infrastructure engineering.




Key Responsibilities

  • Large-Scale GPU Cluster Management
  • Multi-Cloud / Hybrid Cloud Operations: Build and manage large-scale GPU clusters (e.g., SLURM, Kubernetes) across diverse cloud environments such as AWS, Kakao Cloud, Azure, and SK Lambda
  • High-Availability Architecture Design: Design systems with login node HA, auto-scaling, and GPU failure auto-drain mechanisms to proactively prevent outages and enable automatic recovery
  • Resource Optimization: Define job scheduling policies (MaxTime, Priority, Fair Share) and maximize GPU utilization across the cluster
  • ML Pipeline Optimization & Acceleration
  • I/O Bottleneck Analysis and Mitigation: Profile and tune storage systems (NFS, Lustre, GPUDirect Storage) and DataLoader stages to resolve bottlenecks during large-scale dataset training
  • Distributed Training Support: Optimize networking (InfiniBand, RoCE) and improve NCCL communication efficiency in multi-node training environments
  • DevOps / MLOps: Automate infrastructure changes (IaC) and training environment deployments using tools such as GitHub Actions
  • Monitoring & Reliability Engineering
  • Proactive Incident Response: Build alerting systems that monitor GPU temperature, memory errors (ECC), and network latency in real time to detect and mitigate issues before failures occur
  • Observability: Develop dashboards for cluster resource status and per-user utilization using tools such as Grafana and Prometheus




Required Qualifications

  • 3+ years of experience in infrastructure engineering
  • Cluster Management Experience: Hands-on experience operating large-scale compute clusters using SLURM, Kubernetes (EKS / AKS / GKE), or similar systems
  • Cloud Platform Proficiency: Experience designing networking (VPC, subnets), security (IAM), and storage architectures in public cloud environments such as AWS, GCP, or Azure
  • Linux & Scripting: Strong skills in Linux system/kernel tuning and in building automation tools using shell scripting and Python




Preferred Qualifications

  • High Performance Computing (HPC) & AI
  • Advanced GPU Cluster Operations: Experience configuring and troubleshooting high-speed interconnects such as InfiniBand, NVLink, and RDMA
  • ML Framework Understanding: Deep understanding of data loading mechanisms in PyTorch or TensorFlow, with experience optimizing them at the system level
  • Parallel File Systems: Experience operating high-performance distributed file systems such as Lustre, GPFS, or WekaIO
  • MLOps & Workflow Orchestration
  • Workflow Management: Experience designing and operating complex data preprocessing and training workflows using Prefect, Airflow, or Kubeflow Pipelines
  • Experiment Tracking: Experience integrating infrastructure with experiment management tools such as Weights & Biases or MLflow
  • Model Serving: Experience deploying and optimizing models using Triton Inference Server, TorchServe, or similar systems
  • Problem Solving & Optimization
  • Bottleneck Profiling: Experience diagnosing end-to-end system performance issues (Storage → CPU/RAM → GPU) using tools such as Nsight Systems or PyTorch Profiler
  • Cost Optimization: Experience reducing cloud costs through strategies such as spot instance usage and reserved instance planning




Working Conditions

  • Work Location: 561 Seolleung-ro, Gangnam-gu, Seoul (RUBINA Building, Yeoksam-dong)
  • Employment Type: Full-time
  • Probationary Period
  • A three-month probationary period will apply upon employment.
  • During this period, your work attitude and performance will be evaluated.
  • Depending on the evaluation results, the probationary period may be extended or the employment offer may be withdrawn.



How to Apply

  • Application Materials:
  • Resume in English or Korean
  • (optional) Portfolio, research materials, or project documents showcasing your capabilities
  • Application Deadline: Rolling basis



Hiring Process

  • Document Screening → 1st Interview → 2nd Interview → 3rd Interview → Final Offer
  • Candidates who pass the document screening will be contacted individually.
  • Additional Coffee Chats or Coding Test may be conducted if necessary.



Work Environment & Support

  • Flexible Work Schedule: Adjust your working hours autonomously to match your personal rhythm.
  • Equipment & Software Support: We provide job-specific equipment and essential software required for your role.
  • Office Amenities: Enjoy our in-office snack bar and coffee machines.
  • Holiday & Birthday Gifts: Small gifts are provided for holidays and birthdays.
  • Health Checkup Support: We support your well-being through regular health checkups.
Share
Cloud Infrastructure Engineer

RLWRLD is ​a ​leading ​Physical AI ​company developing a Robotics ​Foundation ​Model (RFM) ​that enables robots ​to perceive, ​reason, ​and act ​in ​the ​real world like ​humans.


Building ​on deep research ​capabilities ​in ​AI and robotics ​and a ​strong ​data collaboration ​network with ​industrial ​partners in Japan, ​Korea, and ​beyond, RLWRLD is rapidly advancing our RFM to enable precise manipulation by high-degree-of-freedom robotic hands. The company is also collaborating with world-class research groups and partners in robotics and sensor solutions to develop AI models that can be practically deployed across industries such as manufacturing, logistics, and services.


Having raised approximately KRW 60 billion in cumulative seed funding from leading domestic and global venture capital firms and major corporations, RLWRLD continues to attract exceptional talent who are eager to drive innovation across AI, robotics technology, and business.







About the Product Organization


At RLWRLD, our Product Organization is responsible for developing all core products — spanning planning, development, and research.


We are building foundational technologies such as:

  • Robotics Foundation Model (RFM)
  • APIs/SDKs to deliver RFM functionality
  • Data pipeline & teleoperation tools
  • Training systems for model learning
  • Benchmark systems to test performance
  • Robot control systems
  • Infra stack (GPU orchestration, compute management)


Our team includes both research and software engineers, working fluidly across AI model development and software infrastructure. We collaborate closely with Academy Researchers, robotic hardware partners, and internal business developers to deliver cutting-edge robotics solutions.



Position Overview

We are looking for engineers who can push large-scale GPU clusters and data pipelines to their limits in order to support continuously advancing robot intelligence.


In this role, you will be responsible for reliably operating an HPC environment composed of hundreds of A100/H100 GPUs and petabyte-scale storage, while eliminating bottlenecks in the training data flow to maximize research productivity.


We are seeking engineers who can accelerate the training of Physical AI models through best-in-class infrastructure engineering.




Key Responsibilities

  • Large-Scale GPU Cluster Management
  • Multi-Cloud / Hybrid Cloud Operations: Build and manage large-scale GPU clusters (e.g., SLURM, Kubernetes) across diverse cloud environments such as AWS, Kakao Cloud, Azure, and SK Lambda
  • High-Availability Architecture Design: Design systems with login node HA, auto-scaling, and GPU failure auto-drain mechanisms to proactively prevent outages and enable automatic recovery
  • Resource Optimization: Define job scheduling policies (MaxTime, Priority, Fair Share) and maximize GPU utilization across the cluster
  • ML Pipeline Optimization & Acceleration
  • I/O Bottleneck Analysis and Mitigation: Profile and tune storage systems (NFS, Lustre, GPUDirect Storage) and DataLoader stages to resolve bottlenecks during large-scale dataset training
  • Distributed Training Support: Optimize networking (InfiniBand, RoCE) and improve NCCL communication efficiency in multi-node training environments
  • DevOps / MLOps: Automate infrastructure changes (IaC) and training environment deployments using tools such as GitHub Actions
  • Monitoring & Reliability Engineering
  • Proactive Incident Response: Build alerting systems that monitor GPU temperature, memory errors (ECC), and network latency in real time to detect and mitigate issues before failures occur
  • Observability: Develop dashboards for cluster resource status and per-user utilization using tools such as Grafana and Prometheus




Required Qualifications

  • 3+ years of experience in infrastructure engineering
  • Cluster Management Experience: Hands-on experience operating large-scale compute clusters using SLURM, Kubernetes (EKS / AKS / GKE), or similar systems
  • Cloud Platform Proficiency: Experience designing networking (VPC, subnets), security (IAM), and storage architectures in public cloud environments such as AWS, GCP, or Azure
  • Linux & Scripting: Strong skills in Linux system/kernel tuning and in building automation tools using shell scripting and Python




Preferred Qualifications

  • High Performance Computing (HPC) & AI
  • Advanced GPU Cluster Operations: Experience configuring and troubleshooting high-speed interconnects such as InfiniBand, NVLink, and RDMA
  • ML Framework Understanding: Deep understanding of data loading mechanisms in PyTorch or TensorFlow, with experience optimizing them at the system level
  • Parallel File Systems: Experience operating high-performance distributed file systems such as Lustre, GPFS, or WekaIO
  • MLOps & Workflow Orchestration
  • Workflow Management: Experience designing and operating complex data preprocessing and training workflows using Prefect, Airflow, or Kubeflow Pipelines
  • Experiment Tracking: Experience integrating infrastructure with experiment management tools such as Weights & Biases or MLflow
  • Model Serving: Experience deploying and optimizing models using Triton Inference Server, TorchServe, or similar systems
  • Problem Solving & Optimization
  • Bottleneck Profiling: Experience diagnosing end-to-end system performance issues (Storage → CPU/RAM → GPU) using tools such as Nsight Systems or PyTorch Profiler
  • Cost Optimization: Experience reducing cloud costs through strategies such as spot instance usage and reserved instance planning




Working Conditions

  • Work Location: 561 Seolleung-ro, Gangnam-gu, Seoul (RUBINA Building, Yeoksam-dong)
  • Employment Type: Full-time
  • Probationary Period
  • A three-month probationary period will apply upon employment.
  • During this period, your work attitude and performance will be evaluated.
  • Depending on the evaluation results, the probationary period may be extended or the employment offer may be withdrawn.



How to Apply

  • Application Materials:
  • Resume in English or Korean
  • (optional) Portfolio, research materials, or project documents showcasing your capabilities
  • Application Deadline: Rolling basis



Hiring Process

  • Document Screening → 1st Interview → 2nd Interview → 3rd Interview → Final Offer
  • Candidates who pass the document screening will be contacted individually.
  • Additional Coffee Chats or Coding Test may be conducted if necessary.



Work Environment & Support

  • Flexible Work Schedule: Adjust your working hours autonomously to match your personal rhythm.
  • Equipment & Software Support: We provide job-specific equipment and essential software required for your role.
  • Office Amenities: Enjoy our in-office snack bar and coffee machines.
  • Holiday & Birthday Gifts: Small gifts are provided for holidays and birthdays.
  • Health Checkup Support: We support your well-being through regular health checkups.