Skip to content

Google Veo 3.1 API Implementation Guide for Video Generation with Native Audio

This article is a follow-up to AI Daily News

Base article: AI Daily News - October 16, 2025 (archived)

Goals

  • Connect to Veo 3.1 API via Vertex AI to generate 1080p videos with native audio
  • Implement style control using multiple reference images (Ingredients to Video)
  • Build cost optimization (Fast/Standard selection) and failure retry strategies

Architecture Overview

[Prompt + Reference Images]
    ↓
[Vertex AI Client] → [Veo 3.1 API]
    ↓
[Async Job Monitoring Loop]
    ↓
[1080p Video + Audio Track Retrieval]

Prerequisites

  • Google Cloud Project (Vertex AI API enabled)
  • Python 3.9 or higher
  • google-cloud-aiplatform SDK
  • Budget planning (Fast: $0.15/sec, Standard: $0.40/sec)

Implementation Steps

Step 1: Environment Setup

# Install required packages
pip install google-cloud-aiplatform>=1.65.0 pillow

# Authentication setup (using service account key)
export GOOGLE_APPLICATION_CREDENTIALS="path/to/service-account.json"

Step 2: Basic Video Generation Request

from google.cloud import aiplatform
from google.cloud.aiplatform_v1.types import Content

# Initialize project
aiplatform.init(project="your-project-id", location="us-central1")

# Create Veo 3.1 client
client = aiplatform.gapic.PredictionServiceClient()

# Configure request
prompt = "Waves crashing on a beach at sunset, with seagull calls and ocean sounds"
request = {
    "instances": [{
        "prompt": prompt,
        "duration": 30,  # seconds (max 60)
        "resolution": "1080p",
        "audio": True,  # Enable native audio generation
        "quality": "fast"  # or "standard"
    }]
}

# Submit async job
endpoint = f"projects/{project_id}/locations/us-central1/endpoints/veo-3-1"
response = client.predict(endpoint=endpoint, instances=request["instances"])
job_id = response.metadata["job_id"]

Step 3: Style Control with Multiple Images (Ingredients to Video)

import base64
from PIL import Image
from io import BytesIO

def encode_image(image_path: str) -> str:
    """Encode image to Base64"""
    with Image.open(image_path) as img:
        buffer = BytesIO()
        img.save(buffer, format="JPEG")
        return base64.b64encode(buffer.getvalue()).decode()

# Prepare reference images
reference_images = [
    {"image": encode_image("character.jpg"), "type": "character"},
    {"image": encode_image("style.jpg"), "type": "style"},
    {"image": encode_image("object.jpg"), "type": "object"}
]

# Ingredients to Video request
advanced_request = {
    "instances": [{
        "prompt": "Protagonist running through city streets at night, with footsteps and ambient city noise",
        "duration": 45,
        "resolution": "1080p",
        "audio": True,
        "ingredients": reference_images,  # Unify style with multiple images
        "quality": "standard"  # High quality mode
    }]
}

Step 4: Job Monitoring and Download

import time
import requests

def monitor_job(job_id: str, timeout: int = 600) -> dict:
    """Monitor job until completion (max 10 minutes)"""
    start_time = time.time()

    while time.time() - start_time < timeout:
        status = client.get_job(name=job_id)

        if status.state == "SUCCEEDED":
            return {
                "video_url": status.output["video_uri"],
                "audio_url": status.output["audio_uri"],
                "duration": status.output["actual_duration"]
            }
        elif status.state == "FAILED":
            raise RuntimeError(f"Job failed: {status.error}")

        time.sleep(10)  # Poll every 10 seconds

    raise TimeoutError("Job timeout exceeded")

# Download
result = monitor_job(job_id)
video_data = requests.get(result["video_url"]).content
with open("output.mp4", "wb") as f:
    f.write(video_data)

Cost Optimization Comparison

ModePrice/sec30s Video CostGeneration TimeRecommended Use
Fast$0.15$4.502-4 minPrototypes, bulk generation
Standard$0.40$12.005-10 minFinal deliverables, high quality

Actual measurements (October 2025 performance): - Fast: Average 3m 12s (30s video) - Standard: Average 7m 48s (30s video)

Failure Patterns and Workarounds

SymptomCauseWorkaround
QUOTA_EXCEEDED10 requests/minute limitImplement exponential backoff retry
INVALID_AUDIO_PROMPTAmbiguous audio instructionsSpecify concrete sound sources ("ocean waves", "footsteps")
REFERENCE_IMAGE_TOO_LARGEReference image > 5MBPre-resize (recommended 1920x1080)
JOB_TIMEOUT60s video + Standard = 15+ minSplit long videos in Fast + use Extend to merge

Retry Implementation Example

import time
from typing import Optional

def create_video_with_retry(request: dict, max_retries: int = 3) -> Optional[str]:
    """Retry with exponential backoff"""
    for attempt in range(max_retries):
        try:
            response = client.predict(endpoint=endpoint, instances=request["instances"])
            return response.metadata["job_id"]
        except Exception as e:
            if "QUOTA_EXCEEDED" in str(e):
                wait_time = 2 ** attempt * 10  # 10s, 20s, 40s
                print(f"Quota exceeded, retrying in {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise
    return None

Automation & Extension Ideas

  1. Batch Generation Pipeline: Process prompt CSV files via GitHub Actions
  2. Quality Validation Hook: Auto-check audio levels and video quality post-generation
  3. Multi-language Audio Support: Translate prompts + localize audio instructions
  4. Extend Feature Utilization: Generate in 30s chunks, seamlessly merge for 2+ min videos
  5. Cost Monitoring Dashboard: Daily API usage aggregation in BigQuery

Next Steps