The data behind intelligence that acts

We produce high-quality training data for agentic AI, embodied robotics, and foundation models — from real environments, with human expertise, at production scale.

Explore Our Data Browse Samples on HuggingFace

50+

Dataset Products

200+

Domain Experts on Demand

30+

University Partnerships

Languages in Production

What We Build

Data for every stage of the AI stack

From pre-training corpora to post-training alignment, from digital agents to physical robots — we cover the full spectrum of modern AI data needs.

🦾

Embodied AI & Physical AI

Egocentric video demonstrations, UMI gripper data, teleoperation recordings, hand pose + SLAM trajectories. Real humans in real environments — hotels, factories, kitchens, warehouses.

Egocentric Video UMI / Teleop Hand Pose SLAM LeRobot Format

🤖

Agentic Trajectories

Complete observe-think-act decision traces from AI agents solving real tasks in Docker-isolated terminals. Multi-step, auto-verified, SFT-ready.

Terminal Agents Browser Agents RL Environments pytest Verified

🎮

World Models

High-fidelity gameplay recordings with frame-aligned inputs, cinematographic footage, and dynamic video corpora. Fuel for video generation and world simulation models.

AAA Gameplay 60fps + Keylog Cinematographic HUD-free

📊

Eval & Benchmarks

Expert-designed evaluation suites: agent skill testing, GPU kernel benchmarks, multimodal science exams, structured image QA, engineering drawing analysis.

Agent Skills KernelBench Multimodal Science Rubric Scoring

💻

Code & Visual Generation

Structured prompt → code → rendered output triples with rubric evaluation. SVG, CSS animations, 3D scenes, interactive apps. Multi-dimensional quality scoring.

SVG / HTML / CSS 3D Animations Rubric Scored 500+ Pairs

📚

STEM, Multilingual & Domain

Math/science problems with CoT solutions, K12 across 5+ languages, financial agent data, medical records, traditional medicine databases, SFT corpora in Southeast Asian languages.

EN / ZH / ID / VI / MS STEM Finance Medical

How We Work

From specification to delivery

We don't resell public datasets. Every data product is built from scratch through a rigorous, repeatable pipeline designed for model-moving quality.

Co-Design

We work with your ML team to define task specifications, annotation schemas, quality rubrics, and acceptance criteria — aligned to your model architecture and training loop.

Collect

Domain experts and trained operators collect data in real environments — factory floors, hotel rooms, kitchens, Docker terminals — not synthetic simulations.

Validate

Automated QA (pytest, rendering checks, format validators) plus human expert review. Only data that passes both layers ships. Correction data available on request.

Deliver

Your native format — Zarr, Parquet, Chat JSON, HuggingFace datasets, LeRobot, RLDS. S3-compatible API, bulk export, or custom integration.

Why OBayData

Built different

Most data companies sell you labels. We build you pipelines.

🏭

Real environments, not simulated ones

Our operators work in actual hotels, factories, warehouses, and kitchens. The data captures real-world physics, lighting, occlusion, and human variability that simulation can't replicate.

🔬

Rubric-driven quality, not crowd consensus

Every dataset ships with explicit quality rubrics and automated verification. No crowd voting. No ambiguous majority labels. Deterministic, reproducible quality metrics.

🌏

Asia-Pacific execution, global delivery

Hong Kong HQ with deep operations in Mainland China and US entity (Isotope LLC). We bridge China's data execution capability with Western client standards and compliance.

🧩

Format-native, not format-converted

We build data in your training format from day one — not collect in one format and convert later. LeRobot, GR00T, OpenAI Chat, Terminal-Bench, HuggingFace — all first-class.

By The Numbers

Proven at scale

1,000+

Agent tasks designed & verified

Terminal-Bench standard tasks across 6 domains — algorithms, ML engineering, debugging, system admin, Git, data ops — each with Dockerfile, pytest, and auto-grading.

40+

Real-world scenes captured

Egocentric demonstrations across hospitality, logistics, manufacturing, and custom SOP environments. Multi-camera, multi-annotation, production-validated pipelines.

Languages in production

English, Chinese, Indonesian, Vietnamese, Malay, and more. K12 curricula, SFT corpora, TTS voice libraries — each built to local education and linguistic standards.

🤗

Browse demo datasets on HuggingFace

huggingface.co/obaydata

Expert Network

Built by specialists, not crowd workers

Our data is produced by vetted domain experts across academia and industry — not anonymous crowdsourcing platforms. Every dataset reflects deep domain knowledge.

🎓

30+

University partners

CS, robotics, medical, finance, and linguistics departments across China, Hong Kong, and Southeast Asia

🔬

200+

Domain experts on demand

Robotics engineers, ML researchers, physicians, financial analysts, linguists, and SWE professionals

🏭

15+

Industry verticals

Robotics, healthcare, finance, manufacturing, logistics, education, telecom, legal, and more

✅

100%

Expert-reviewed output

Every dataset goes through domain expert review before delivery — no auto-approved crowd output

Robotics & Embodied AI

Mechanical engineers, computer vision researchers, manipulation specialists. Trained on UMI, teleop, and egocentric capture protocols.

Software Engineering & Agents

Senior SWEs with production experience in debugging, ML ops, system admin. Design and verify Terminal-Bench tasks.

Medical, Finance & STEM

Licensed physicians, CFA holders, PhD-level scientists. Produce and validate domain-specific datasets with expert-grade accuracy.

Trust & Compliance

Enterprise-ready data operations

We operate with the governance, security, and legal infrastructure that US enterprise clients require.

🏛️

US Legal Entity

Isotope LLC registered in the United States. Contracts, invoicing, and IP assignment under US law.

🔒

Data Security

PII redaction, access-controlled storage, encrypted transfer. NDA-protected workflows standard on all projects.

📋

IP Clean & Licensed

All data produced under work-for-hire or licensed agreements. Full IP transfer to client. No open-source contamination.

🌐

Cross-Border Compliant

HK entity bridges US clients with APAC operations. Data handling compliant with GDPR principles and US export regulations.

✓ SOC 2 Type II aligned practices

✓ Standard MSA & DPA available

✓ S3-compatible secure delivery

✓ Audit trail on all annotations

Capabilities

Deep expertise across modalities

Every vertical is backed by production infrastructure, domain specialists, and proven delivery.

👁️

Egocentric Video Data

Wearable multi-camera capture (head + dual wrist) with optional Manus haptic gloves. Real human demonstrations in real environments.

1080p video: head + 2× wrist perspectives
Hand pose reconstruction (world & camera coords)
SLAM camera trajectory estimation
Atomic skill annotations with frame timestamps
Correction data and speed-controlled capture

🤖

Embodied AI Data (UMI & Teleop)

Multi-modality robotic data: handheld UMI grippers, VR/exoskeleton teleoperation, multi-view RGB-D, force/tactile sensing.

UMI: wrist-view + 6-DoF gripper, no robot required
Teleop: joint angles, end-effector, contact dynamics
Format: MCap, RLDS, HDF5, Zarr — LeRobot compatible
Sub-200ms latency, millisecond sensor sync

🛤️

Agentic Trajectories

Complete observe-think-act traces in real Docker-isolated terminals. Ideal for SFT of coding and agentic models.

Terminal-Bench standard: Dockerfile + task.yaml + pytest
Real execution: every command runs in a live container
Structured JSON: state_analysis → explanation → commands
Domains: algorithms, ML, debug, Git, sysadmin, data ops

Our Core Offering

We don't just sell datasets.
We build your data.

Every model has unique data needs. We co-design collection SOPs, annotation schemas, and QA rubrics with your team — then execute at scale in real environments. From pilot batch to production volume, we iterate until the data moves your metrics.

1
Co-Design
Define tasks, schemas, and quality rubrics with your ML team
2
Pilot Batch
Small-scale production for validation and iteration
3
Review & Adjust
Refine SOP based on your model's feedback signals
4
Scale Production
Full deployment with automated QA and continuous delivery
5
Format & Deliver
Zarr, Parquet, Chat JSON, HuggingFace — your native format

About Us

New Oriental Bay Ltd.

Real data from real environments.
Because intelligence is learned, not generated.

New Oriental Bay is a full-stack AI training data company built to serve the next generation of intelligent systems — from large language models and agentic AI to embodied robotics and reinforcement learning environments. We believe the quality ceiling of any AI system is set by its training data, and we exist to raise that ceiling.

From our Hong Kong headquarters, we coordinate data operations across Mainland China and the US, combining deep local execution capability with global delivery standards. Our teams include domain experts in robotics, software engineering, medical science, finance, and education — ensuring every dataset reflects real-world complexity, not synthetic shortcuts.

Whether you need egocentric human demonstrations captured on factory floors, agent trajectories recorded in live Docker terminals, RL environment interaction logs, or multi-language STEM problem sets verified by subject-matter experts — we design, collect, annotate, validate, and deliver. Customization is not an add-on. It's how we work.

Contact

Email simon.su@obaydata.com

Office Rm 13/F, Hollywood Plaza, Nathan Road, Mong Kok, Hong Kong

Web obaydata.com

The data behind intelligence that acts

Embodied AI & Physical AI

Agentic Trajectories

World Models

Eval & Benchmarks

Code & Visual Generation

STEM, Multilingual & Domain

Co-Design

Collect

Validate

Deliver

Real environments, not simulated ones

Rubric-driven quality, not crowd consensus

Asia-Pacific execution, global delivery

Format-native, not format-converted

Agent tasks designed & verified

Real-world scenes captured

Languages in production

Egocentric Video Data

Embodied AI Data (UMI & Teleop)

Agentic Trajectories

We don't just sell datasets.We build your data.

Co-Design

Pilot Batch

Review & Adjust

Scale Production

Format & Deliver

We don't just sell datasets.
We build your data.