No data? No problem: A scrappy guide to in-house data generation with AI

“Data is the new oil” claimed Clive Humby, a popular data science entrepreneur, twenty years ago.

And if you’re running an AI company, you know it is even more critical today. Data was always a strategic resource to drive your company forward, and make good decisions.

But today, data is at the core of your product and your business. No data = no business in the AI world.

But who cares when you have powerful pre-trained LLMs working for you under the hood?

The problem lies in one word: specialisation.

When everyone has access to the same models, with the same training data, and is working on the same problem as your company, where is your edge? You don’t have one.

In the end, you have to get or build a dataset to fine-tune your models, and this can be complex, long, or expensive. But it doesn’t have to be.

Within one of our startups, Jack Burton, we had a few constraints coming from our design partners, and our vision:

We worked on voice transcripts from customers chatting on the phone, but most models had been trained with formal written conversations from the Internet
A large portion of the customers experience speech disfluency, making it harder to extract information from the chat
Given the sensitive nature of the data, we couldn’t use that data to fine-tune our models
We’re moving fast, and we didn’t want to spend too much time or money on building the data set

So we generated a huge dataset using the best resources we had: the team and the tools we already use. And the total cost? £0.

Generating data with Zoom and Screen Recorder

We didn’t organise fake calls or hire actors, we simply recorded our daily standup meetings on Zoom. These were natural conversations between teammates, not scripted dialogues.

That gave us real, everyday speech: people interrupting each other, hesitating, switching topics—exactly the kind of messy input our AI would need to handle in production.

Since our team is distributed across several countries, we also captured a rich variety of English accents and speaking styles. That diversity turned out to be a hidden advantage, especially for training the model to handle disfluency and non-standard phrasing.

In just a few days, we had hours of realistic, relevant, and zero-cost training data.

Of course you can do it using the software you already use internally, Zoom simply makes it easier because the recording feature is built in. You don’t even have to limit yourself to recording standup meetings, as long as you seek permission to record!

Converting with FFMPEG

Once we had all our Zoom recordings, the next step was making them usable. These were full video files, with heavy, inconsistent formats, and not really compatible with transcription tools.

So we used FFMPEG, the Swiss army knife of media processing, to batch-convert all our recordings into clean, lightweight audio files.

We stripped out the video, normalised the audio levels, and exported everything as .wav files. No frills, no UI, just a few command-line scripts and everything was ready to go.

Transforming data with Deepgram

Next, we used Deepgram to transcribe the audio.

It gave us detailed, timestamped transcripts, including hesitations, filler words, false starts, etc. Most transcription tools try to clean this up. We didn’t want that.

We used Deepgram because it is super fast, really precise and cheap for the usage we had.

And we used their playground, which made it really easy to get the data out, without writing any code.

We needed the mess, because that’s what real speech looks like, and it’s what our AI would hear when parsing live customer calls.

Deepgram helped us turn useless data into useful data, by keeping the raw, unpolished quality of natural conversations. It didn’t miss a word, even when you couldn’t hear it properly. Impressive.

Labelling with Label Studio

Once we had transcripts, we turned to Label Studio to annotate them. Our goal wasn’t to label traditional entities like names or dates—it was to teach the model how humans really speak.

So we focused on marking fillers, disfluencies, and non-linear sentence structures.

Label Studio is easy to use, open-source and you can host it within minutes.

Because our internal team knows the product and the type of noise we’re trying to filter out, they were the perfect people to label this data.

Instead of outsourcing to generic annotators, we kept the process in-house—faster, cheaper, and way more accurate.

Note: Technically using engineers to do this can be more costly, but it would have taken a longer time to find the right people to do it and train them.

Also, we used Label Studio's prediction tool to automatically label the data, and our team only had to correct some labels. Each iteration of training made the model more accurate, reducing the human effort each cycle

In the end, we had a labelled dataset perfectly tailored to fine-tuning our models, that cost us the total amount of £0.

Need more data? Look around you

There is a good chance that you can already train your models, not with the data you already collect, but with the data you generate instead, you and your team.

All this data is lying around you, unexploited until now, so don’t let it fly away. Be clever, use the tools that are already in your workflow, be it Zoom, Slack, Meet, Discord, Jira, Trello... collect, transform, label and use!

Free data = free money to train your AIs.

At The49, we don’t wait for perfect conditions, we build with what we’ve got. Want help getting your AI idea off the ground? Let’s talk. Or for more on Jack Burton, visit jackburton.ai.