The data behind intelligence that acts

We produce high-quality training data for agentic AI, embodied robotics, and foundation models โ€” from real environments, with human expertise, at production scale.

50+
Dataset Products
200+
Domain Experts on Demand
30+
University Partnerships
8
Languages in Production
Trusted by teams building with
NVIDIA NeMo HuggingFace Terminal-Bench LeRobot GR00T
Data for every stage of the AI stack
From pre-training corpora to post-training alignment, from digital agents to physical robots โ€” we cover the full spectrum of modern AI data needs.
๐Ÿฆพ

Embodied AI & Physical AI

Egocentric video demonstrations, UMI gripper data, teleoperation recordings, hand pose + SLAM trajectories. Real humans in real environments โ€” hotels, factories, kitchens, warehouses.

Egocentric Video UMI / Teleop Hand Pose SLAM LeRobot Format
๐Ÿค–

Agentic Trajectories

Complete observe-think-act decision traces from AI agents solving real tasks in Docker-isolated terminals. Multi-step, auto-verified, SFT-ready.

Terminal Agents Browser Agents RL Environments pytest Verified
๐ŸŽฎ

World Models

High-fidelity gameplay recordings with frame-aligned inputs, cinematographic footage, and dynamic video corpora. Fuel for video generation and world simulation models.

AAA Gameplay 60fps + Keylog Cinematographic HUD-free
๐Ÿ“Š

Eval & Benchmarks

Expert-designed evaluation suites: agent skill testing, GPU kernel benchmarks, multimodal science exams, structured image QA, engineering drawing analysis.

Agent Skills KernelBench Multimodal Science Rubric Scoring
๐Ÿ’ป

Code & Visual Generation

Structured prompt โ†’ code โ†’ rendered output triples with rubric evaluation. SVG, CSS animations, 3D scenes, interactive apps. Multi-dimensional quality scoring.

SVG / HTML / CSS 3D Animations Rubric Scored 500+ Pairs
๐Ÿ“š

STEM, Multilingual & Domain

Math/science problems with CoT solutions, K12 across 5+ languages, financial agent data, medical records, traditional medicine databases, SFT corpora in Southeast Asian languages.

EN / ZH / ID / VI / MS STEM Finance Medical
From specification to delivery
We don't resell public datasets. Every data product is built from scratch through a rigorous, repeatable pipeline designed for model-moving quality.
01

Co-Design

We work with your ML team to define task specifications, annotation schemas, quality rubrics, and acceptance criteria โ€” aligned to your model architecture and training loop.

02

Collect

Domain experts and trained operators collect data in real environments โ€” factory floors, hotel rooms, kitchens, Docker terminals โ€” not synthetic simulations.

03

Validate

Automated QA (pytest, rendering checks, format validators) plus human expert review. Only data that passes both layers ships. Correction data available on request.

04

Deliver

Your native format โ€” Zarr, Parquet, Chat JSON, HuggingFace datasets, LeRobot, RLDS. S3-compatible API, bulk export, or custom integration.

Built different
Most data companies sell you labels. We build you pipelines.
๐Ÿญ

Real environments, not simulated ones

Our operators work in actual hotels, factories, warehouses, and kitchens. The data captures real-world physics, lighting, occlusion, and human variability that simulation can't replicate.

๐Ÿ”ฌ

Rubric-driven quality, not crowd consensus

Every dataset ships with explicit quality rubrics and automated verification. No crowd voting. No ambiguous majority labels. Deterministic, reproducible quality metrics.

๐ŸŒ

Asia-Pacific execution, global delivery

Hong Kong HQ with deep operations in Mainland China and US entity (Isotope LLC). We bridge China's data execution capability with Western client standards and compliance.

๐Ÿงฉ

Format-native, not format-converted

We build data in your training format from day one โ€” not collect in one format and convert later. LeRobot, GR00T, OpenAI Chat, Terminal-Bench, HuggingFace โ€” all first-class.

Proven at scale
1,000+

Agent tasks designed & verified

Terminal-Bench standard tasks across 6 domains โ€” algorithms, ML engineering, debugging, system admin, Git, data ops โ€” each with Dockerfile, pytest, and auto-grading.

40+

Real-world scenes captured

Egocentric demonstrations across hospitality, logistics, manufacturing, and custom SOP environments. Multi-camera, multi-annotation, production-validated pipelines.

8

Languages in production

English, Chinese, Indonesian, Vietnamese, Malay, and more. K12 curricula, SFT corpora, TTS voice libraries โ€” each built to local education and linguistic standards.

๐Ÿค—
Browse demo datasets on HuggingFace
huggingface.co/obaydata
Built by specialists, not crowd workers
Our data is produced by vetted domain experts across academia and industry โ€” not anonymous crowdsourcing platforms. Every dataset reflects deep domain knowledge.
๐ŸŽ“
30+
University partners
CS, robotics, medical, finance, and linguistics departments across China, Hong Kong, and Southeast Asia
๐Ÿ”ฌ
200+
Domain experts on demand
Robotics engineers, ML researchers, physicians, financial analysts, linguists, and SWE professionals
๐Ÿญ
15+
Industry verticals
Robotics, healthcare, finance, manufacturing, logistics, education, telecom, legal, and more
โœ…
100%
Expert-reviewed output
Every dataset goes through domain expert review before delivery โ€” no auto-approved crowd output
Robotics & Embodied AI
Mechanical engineers, computer vision researchers, manipulation specialists. Trained on UMI, teleop, and egocentric capture protocols.
Software Engineering & Agents
Senior SWEs with production experience in debugging, ML ops, system admin. Design and verify Terminal-Bench tasks.
Medical, Finance & STEM
Licensed physicians, CFA holders, PhD-level scientists. Produce and validate domain-specific datasets with expert-grade accuracy.
Enterprise-ready data operations
We operate with the governance, security, and legal infrastructure that US enterprise clients require.
๐Ÿ›๏ธ
US Legal Entity
Isotope LLC registered in the United States. Contracts, invoicing, and IP assignment under US law.
๐Ÿ”’
Data Security
PII redaction, access-controlled storage, encrypted transfer. NDA-protected workflows standard on all projects.
๐Ÿ“‹
IP Clean & Licensed
All data produced under work-for-hire or licensed agreements. Full IP transfer to client. No open-source contamination.
๐ŸŒ
Cross-Border Compliant
HK entity bridges US clients with APAC operations. Data handling compliant with GDPR principles and US export regulations.
โœ“ SOC 2 Type II aligned practices
โœ“ Standard MSA & DPA available
โœ“ S3-compatible secure delivery
โœ“ Audit trail on all annotations
Deep expertise across modalities
Every vertical is backed by production infrastructure, domain specialists, and proven delivery.
๐Ÿ‘๏ธ

Egocentric Video Data

Wearable multi-camera capture (head + dual wrist) with optional Manus haptic gloves. Real human demonstrations in real environments.

  • 1080p video: head + 2ร— wrist perspectives
  • Hand pose reconstruction (world & camera coords)
  • SLAM camera trajectory estimation
  • Atomic skill annotations with frame timestamps
  • Correction data and speed-controlled capture
๐Ÿค–

Embodied AI Data (UMI & Teleop)

Multi-modality robotic data: handheld UMI grippers, VR/exoskeleton teleoperation, multi-view RGB-D, force/tactile sensing.

  • UMI: wrist-view + 6-DoF gripper, no robot required
  • Teleop: joint angles, end-effector, contact dynamics
  • Format: MCap, RLDS, HDF5, Zarr โ€” LeRobot compatible
  • Sub-200ms latency, millisecond sensor sync
๐Ÿ›ค๏ธ

Agentic Trajectories

Complete observe-think-act traces in real Docker-isolated terminals. Ideal for SFT of coding and agentic models.

  • Terminal-Bench standard: Dockerfile + task.yaml + pytest
  • Real execution: every command runs in a live container
  • Structured JSON: state_analysis โ†’ explanation โ†’ commands
  • Domains: algorithms, ML, debug, Git, sysadmin, data ops

We don't just sell datasets.
We build your data.

Every model has unique data needs. We co-design collection SOPs, annotation schemas, and QA rubrics with your team โ€” then execute at scale in real environments. From pilot batch to production volume, we iterate until the data moves your metrics.

  • 1

    Co-Design

    Define tasks, schemas, and quality rubrics with your ML team

  • 2

    Pilot Batch

    Small-scale production for validation and iteration

  • 3

    Review & Adjust

    Refine SOP based on your model's feedback signals

  • 4

    Scale Production

    Full deployment with automated QA and continuous delivery

  • 5

    Format & Deliver

    Zarr, Parquet, Chat JSON, HuggingFace โ€” your native format

New Oriental Bay Ltd.
Real data from real environments.
Because intelligence is learned, not generated.

New Oriental Bay is a full-stack AI training data company built to serve the next generation of intelligent systems โ€” from large language models and agentic AI to embodied robotics and reinforcement learning environments. We believe the quality ceiling of any AI system is set by its training data, and we exist to raise that ceiling.

From our Hong Kong headquarters, we coordinate data operations across Mainland China and the US, combining deep local execution capability with global delivery standards. Our teams include domain experts in robotics, software engineering, medical science, finance, and education โ€” ensuring every dataset reflects real-world complexity, not synthetic shortcuts.

Whether you need egocentric human demonstrations captured on factory floors, agent trajectories recorded in live Docker terminals, RL environment interaction logs, or multi-language STEM problem sets verified by subject-matter experts โ€” we design, collect, annotate, validate, and deliver. Customization is not an add-on. It's how we work.

Contact
Office Rm 13/F, Hollywood Plaza, Nathan Road, Mong Kok, Hong Kong
Web obaydata.com