Google Veo 3.1 API Implementation Guide for Video Generation with Native Audio¶
This article is a follow-up to AI Daily News
Base article: AI Daily News - October 16, 2025 (archived)
Goals¶
- Connect to Veo 3.1 API via Vertex AI to generate 1080p videos with native audio
- Implement style control using multiple reference images (Ingredients to Video)
- Build cost optimization (Fast/Standard selection) and failure retry strategies
Architecture Overview¶
[Prompt + Reference Images]
↓
[Vertex AI Client] → [Veo 3.1 API]
↓
[Async Job Monitoring Loop]
↓
[1080p Video + Audio Track Retrieval]
Prerequisites¶
- Google Cloud Project (Vertex AI API enabled)
- Python 3.9 or higher
google-cloud-aiplatformSDK- Budget planning (Fast: $0.15/sec, Standard: $0.40/sec)
Implementation Steps¶
Step 1: Environment Setup¶
# Install required packages
pip install google-cloud-aiplatform>=1.65.0 pillow
# Authentication setup (using service account key)
export GOOGLE_APPLICATION_CREDENTIALS="path/to/service-account.json"
Step 2: Basic Video Generation Request¶
from google.cloud import aiplatform
from google.cloud.aiplatform_v1.types import Content
# Initialize project
aiplatform.init(project="your-project-id", location="us-central1")
# Create Veo 3.1 client
client = aiplatform.gapic.PredictionServiceClient()
# Configure request
prompt = "Waves crashing on a beach at sunset, with seagull calls and ocean sounds"
request = {
"instances": [{
"prompt": prompt,
"duration": 30, # seconds (max 60)
"resolution": "1080p",
"audio": True, # Enable native audio generation
"quality": "fast" # or "standard"
}]
}
# Submit async job
endpoint = f"projects/{project_id}/locations/us-central1/endpoints/veo-3-1"
response = client.predict(endpoint=endpoint, instances=request["instances"])
job_id = response.metadata["job_id"]
Step 3: Style Control with Multiple Images (Ingredients to Video)¶
import base64
from PIL import Image
from io import BytesIO
def encode_image(image_path: str) -> str:
"""Encode image to Base64"""
with Image.open(image_path) as img:
buffer = BytesIO()
img.save(buffer, format="JPEG")
return base64.b64encode(buffer.getvalue()).decode()
# Prepare reference images
reference_images = [
{"image": encode_image("character.jpg"), "type": "character"},
{"image": encode_image("style.jpg"), "type": "style"},
{"image": encode_image("object.jpg"), "type": "object"}
]
# Ingredients to Video request
advanced_request = {
"instances": [{
"prompt": "Protagonist running through city streets at night, with footsteps and ambient city noise",
"duration": 45,
"resolution": "1080p",
"audio": True,
"ingredients": reference_images, # Unify style with multiple images
"quality": "standard" # High quality mode
}]
}
Step 4: Job Monitoring and Download¶
import time
import requests
def monitor_job(job_id: str, timeout: int = 600) -> dict:
"""Monitor job until completion (max 10 minutes)"""
start_time = time.time()
while time.time() - start_time < timeout:
status = client.get_job(name=job_id)
if status.state == "SUCCEEDED":
return {
"video_url": status.output["video_uri"],
"audio_url": status.output["audio_uri"],
"duration": status.output["actual_duration"]
}
elif status.state == "FAILED":
raise RuntimeError(f"Job failed: {status.error}")
time.sleep(10) # Poll every 10 seconds
raise TimeoutError("Job timeout exceeded")
# Download
result = monitor_job(job_id)
video_data = requests.get(result["video_url"]).content
with open("output.mp4", "wb") as f:
f.write(video_data)
Cost Optimization Comparison¶
| Mode | Price/sec | 30s Video Cost | Generation Time | Recommended Use |
|---|---|---|---|---|
| Fast | $0.15 | $4.50 | 2-4 min | Prototypes, bulk generation |
| Standard | $0.40 | $12.00 | 5-10 min | Final deliverables, high quality |
Actual measurements (October 2025 performance): - Fast: Average 3m 12s (30s video) - Standard: Average 7m 48s (30s video)
Failure Patterns and Workarounds¶
| Symptom | Cause | Workaround |
|---|---|---|
QUOTA_EXCEEDED | 10 requests/minute limit | Implement exponential backoff retry |
INVALID_AUDIO_PROMPT | Ambiguous audio instructions | Specify concrete sound sources ("ocean waves", "footsteps") |
REFERENCE_IMAGE_TOO_LARGE | Reference image > 5MB | Pre-resize (recommended 1920x1080) |
JOB_TIMEOUT | 60s video + Standard = 15+ min | Split long videos in Fast + use Extend to merge |
Retry Implementation Example¶
import time
from typing import Optional
def create_video_with_retry(request: dict, max_retries: int = 3) -> Optional[str]:
"""Retry with exponential backoff"""
for attempt in range(max_retries):
try:
response = client.predict(endpoint=endpoint, instances=request["instances"])
return response.metadata["job_id"]
except Exception as e:
if "QUOTA_EXCEEDED" in str(e):
wait_time = 2 ** attempt * 10 # 10s, 20s, 40s
print(f"Quota exceeded, retrying in {wait_time}s...")
time.sleep(wait_time)
else:
raise
return None
Automation & Extension Ideas¶
- Batch Generation Pipeline: Process prompt CSV files via GitHub Actions
- Quality Validation Hook: Auto-check audio levels and video quality post-generation
- Multi-language Audio Support: Translate prompts + localize audio instructions
- Extend Feature Utilization: Generate in 30s chunks, seamlessly merge for 2+ min videos
- Cost Monitoring Dashboard: Daily API usage aggregation in BigQuery
Next Steps¶
- GitHub Actions Automation Implementation Guide - Batch processing automation
- Video Quality Validation Workflow - Automated checks with FFmpeg
- Gemini API Integration Pattern - Seamless connection from text to video generation