Logo

Manager, Solutions Architecture - Continuous Bringup and Optimization

Nvidia • Japan, Japan, Tokyo, Japan, Remote

No Relocation

Posted: April 24, 2026

Job Description

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people.

NVIDIA is looking for a Manager of Solution Architecture to lead NVIDIA Infrastructure Specialist Team, Continuous bringup and optimization. Academic and commercial groups around the world are using NVIDIA products to redefine deep learning and data analytics, and to power data centers. We are building many of the largest and fastest AI/HPC systems in the world! We are looking for someone with the ability to work on a dynamic customer focused team that requires excellent interpersonal skills. This role will be interacting with customers, partners and internal teams, to analyze, define and implement large scale projects. The scope of these efforts includes a combination of Networking, System Design and Automation and being the face to the customer!

What you'll be doing:

Lead a team dedicated to consulting, optimizing, and improving the resiliency of customer AI factory infrastructures, ensuring high service quality and operational perfection.
Drive hands-on infrastructure analysis and tuning of complex GPU-accelerated systems, AI workloads, and datacenter environments, identifying areas for efficiency gains and operational improvements.
Work closely with internal teams (Engineering, Product, Sales) and customer collaborators to align infrastructure strategies with business goals, enabling smooth, scalable AI deployments.
Act as a technical authority on NVIDIA GPU, CPU and networking technologies, supporting customer discussions, architecture reviews.
Establishing and evolving optimization and monitoring methodologies, using analytics and tooling to detect bottlenecks, reduce downtime, and ensure system health at scale.
Participate in customer-facing engagements, including roadmap sessions, post-deployment reviews, and incident retrospectives, helping to craft the customer experience and influence NVIDIA’s infrastructure strategy.

What we need to see:

Over 4 years leading teams and 8+ overall years in service operations in large data centers, focusing on infrastructure performance.
Bachelor’s, Master’s, or PhD in Computer Science, Engineering, or a related field, with shown technical leadership in data center, server, and network operations.
Proficiency in both Japanese and English, demonstrating clear communication of technical topics across multicultural teams and with customers.
Deep expertise in data center architecture and operations, including servers, GPUs, NICs, networking topologies, storage systems, and Linux-based environments.
Strong analytical, solving problems, and decision-making skills, capable of identifying root causes, driving continuous improvement, and delivering resilient technical solutions.
Strong communication, time management, and organizational skills, along with experience in leading complex projects, guiding technical teams, and meeting important metrics.

Ways to stand out from the crowd:

Deep familiarity with AI infrastructure and workflows, including training/inference pipelines, MLOps/DevOps tools, containerization (Docker, Kubernetes), and large-scale system deployments.
Knowledge of data center infrastructure operations, including safety, security, environmental controls, and standard operating procedures.
Strong interpersonal and collaboration skills, with the ability to lead discussions, influence outcomes, and build positive relationships with both internal and external collaborators.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking individuals in the world working for us. If you're creative and autonomous, we want to hear from you!

Additional Content

What you'll be doing:

Lead a team dedicated to consulting, optimizing, and improving the resiliency of customer AI factory infrastructures, ensuring high service quality and operational perfection.
Drive hands-on infrastructure analysis and tuning of complex GPU-accelerated systems, AI workloads, and datacenter environments, identifying areas for efficiency gains and operational improvements.
Work closely with internal teams (Engineering, Product, Sales) and customer collaborators to align infrastructure strategies with business goals, enabling smooth, scalable AI deployments.
Act as a technical authority on NVIDIA GPU, CPU and networking technologies, supporting customer discussions, architecture reviews.
Establishing and evolving optimization and monitoring methodologies, using analytics and tooling to detect bottlenecks, reduce downtime, and ensure system health at scale.
Participate in customer-facing engagements, including roadmap sessions, post-deployment reviews, and incident retrospectives, helping to craft the customer experience and influence NVIDIA’s infrastructure strategy.

What we need to see:

Over 4 years leading teams and 8+ overall years in service operations in large data centers, focusing on infrastructure performance.
Bachelor’s, Master’s, or PhD in Computer Science, Engineering, or a related field, with shown technical leadership in data center, server, and network operations.
Proficiency in both Japanese and English, demonstrating clear communication of technical topics across multicultural teams and with customers.
Deep expertise in data center architecture and operations, including servers, GPUs, NICs, networking topologies, storage systems, and Linux-based environments.
Strong analytical, solving problems, and decision-making skills, capable of identifying root causes, driving continuous improvement, and delivering resilient technical solutions.
Strong communication, time management, and organizational skills, along with experience in leading complex projects, guiding technical teams, and meeting important metrics.

Ways to stand out from the crowd:

Deep familiarity with AI infrastructure and workflows, including training/inference pipelines, MLOps/DevOps tools, containerization (Docker, Kubernetes), and large-scale system deployments.
Knowledge of data center infrastructure operations, including safety, security, environmental controls, and standard operating procedures.
Strong interpersonal and collaboration skills, with the ability to lead discussions, influence outcomes, and build positive relationships with both internal and external collaborators.

Apply Now View Full Posting

RemoteJob Guru

Menu

Manager, Solutions Architecture - Continuous Bringup and Optimization

Job Description

Additional Content