TimShih

Generative AI

Pose Tracking

VisionControl. AI

"Solving the unpredictability of Generative AI in commercial photography."

This project introduces a pose-guided AI pipeline designed to reduce the high operational costs for small businesses. By integrating real-time pose tracking with stable diffusion, I replaced the 'black box' nature of AI with a controlled, professional-grade production workflow.

Year

Spring 2024

Role

AI Product Designer

Duration

14 Weeks

Problem Framing & Technical Pivot

During my internship, I observed that the financial and time costs of traditional photoshoots were prohibitive for small businesses. My role involved researching AI-driven solutions to streamline this process. Initially, I used LLMs to craft prompts for Text-to-Image generation, but found the results too inconsistent for professional use.

Finance Barriers

Small brands lack the budget and time for professional models, studios, and lengthy post-production

Unpredict Output

Text-to-image is a 'blind box'—AI can create, but it lacks the precision for specific poses.

Workflow Gaps

The missing link: Traditional methods are too slow, while basic AI is too random for commercial use

To achieve higher precision, I pivoted to ControlNet, leveraging specific reference images to provide structural guidance and ensure the AI-generated poses aligned perfectly with our requirements.

Opportunity

How might we create brand imagery by giving small businesses precise, intuitive control over AI-generated poses?

direction

To address this, I want developed a hardware-software integrated pipeline using real-time cameras and API-driven automation.

Process

ML5.JS

I came across ML5.js and its ability to track poses through a webcam using red skeletal lines. With this, I began integrating the ML5.js code into JavaScript project to bring the interactive elements to work.

Javascript

Because ML5.js tracks motion in real-time, I needed a way to freeze a single frame. I solved this by creating a trigger: when the button is pressed, JavaScript executes a function that extracts only the red skeletal lines from the code and saves the result as a PNG file.

Stable Diffusion

After obtaining the skeletal maps, I used ControlNet within Stable Diffusion to generate images based on those poses. The technical challenge was a significant learning curve for me, as I had no prior experience with server setup or API integration. I seek guidance from my instructor, which allowed me to successfully navigate these technical hurdles.

After Thoughts

Over these 14 weeks, I have successfully delivered a Proof of Concept that validates my vision. However, there is still significant room for evolution. My future roadmap includes seamlessly integrating diverse Models and LoRAs, streamlining the user experience (UX) for better intuitiveness, and incorporating LLMs to help users articulate and refine their character prompts more effectively.

This course has been incredibly inspiring. While AI replacement is a common concern today, I believe it's actually a call to action. By identifying and reimagining how specific fields operate through AI-driven systems, we don't just replace old methods—we pave the way for the next wave of innovation.