Creating Training Data For Text Classification In Google Cloud Vertex AI

In the coming posts I will be doing a few deep dives on Google Vertex AI. This post focusses on data engineering and following a data-centric approach to AI. Datasets is the first step in the Vertex AI workflow.

5 min readMar 27, 2023

In a previous post on Vertex AI I gave an overview of Vertex AI within the context of LLMs and Generative AI. In this post I consider the practicalities around engineering training data.

Data-centric AI is the discipline of systematically engineering the data used to build an AI system.
~ DCAI

A data-centric approach to training data for AI, and in this case text classification, demands a continuous life-cycle as described in the image below.

Starting with the ability to explore training data via a latent space. A latent space can be described as an environment where data is compressed in such a way that patterns, clusters and other insights emerge from the data.

Following exploration, a human-in-the-loop process with weak supervision is required for identifying classes and applying those class label to the data.

A big vulnerability and current void within Vertex AI is this process of Data Centric AI. The data presented to Vertex AI needs to be already engineered and structured for training.

The Vertex AI formatting requirements for JSON & CSV data files is highly complex and takes effort to produce. As seen in the JSON formatting below:

{"textGcsUri":"gs://cloud-ai-platform-cbb21882-3e0b-4f11-88b9-21bd2fb3a35e/dataset-3591873590502359040/preprocessed_example/4214918245292965888/75173735366921/text.txt",
"languageCode":"",
"classificationAnnotation":{"displayName":"achievement",
"annotationResourceLabels":{"aiplatform.googleapis.com/annotation_set_name":"2418892595758366720"}},"dataItemResourceLabels":{}}

With the path pointing to the text below, which is labeled as: achievement

My eldest son who is 27 just got word he has a new job after finishing his bachelors degree. This made me very happy!

The JSON portion above is only for one labeled record, below is a text file view of a file which contains thousands of records.

The data is formatted in such a way, that it is tightly integrated with the Google Cloud data bucket structure which adds complexity.

Vertex AI is a no-code studio environment to build, deploy & scale machine learning (ML) models. Managed ML tools are available for a myriad of use-cases.
~ Google Cloud Vertex

As seen below, once the data is imported the text is visible with each label assigned to the text. Basic functionality is available like filtering, searching and editing the training data.

Something I found curious, is the fact the via Vertex AI it is possible to request human labellers add labels to data.

According to Google, Vertex AI data labelling tasks allow you to work with human labellers to generate highly accurate labels for your collection of data.

Prices for the service are computed based on the type of labelling task.

For a text classification task, units are determined by text length (every 50 words is a price unit) and the number of human labellers.

For example, one piece of text with 100 words and 3 human labellers counts for 100 / 50 * 3 = 6 units. The price for single-label and multi-label classification is the same.

In Closing

A hallmark of so-called traditional NLU Engines is the ease with which data can be entered. This is definitely not the case with Vertex AI.

Functionality for a continuous process of data exploration, curation, structuring (engineering) of data and ingestion is not defined or enabled.

Shipping data off to independent human labellers seems counterintuitive and I would rather opt for an automated process with human supervision.

The multi-modal nature of Vertex AI bodes well for the future of Foundation Models with the inclusion of text, tabular data, images and video.

⭐️ Please follow me on LinkedIn for updates on Conversational AI ⭐️

I’m currently the Chief Evangelist @ HumanFirst. I explore and write about all things at the intersection of AI and language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces and more.

NLU design tooling

“Conversation Designer, Retail, 10k+ employees The tool that turned conversation designers, into NLU designers” ★★★★★…

www.humanfirst.ai

https://www.linkedin.com/in/cobusgreyling

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don’t already…

cobusgreyling.medium.com

The Cobus Quadrant™ Of NLU Design

NLU design is vital to planning and continuously improving Conversational AI experiences.

cobusgreyling.medium.com

The Cobus Quadrant™ Of Conversation Design Capabilities

∗ This is part one of a two part series, please also take a look part two, the Cobus Quadrant of NLU Design.

cobusgreyling.medium.com

Large Language Models, Generative AI & Google Cloud Vertex AI

Google launched Vertex AI 18 May 2021 at Google I/O and it seems like the product has faired well considering all the…

cobusgreyling.medium.com

Foundation Conversational AI Technologies Landscape

And the rapid expansion of Large Language Model (LLM) enablement.

cobusgreyling.medium.com

The Foundation Large Language Model (LLM) & Tooling Landscape

There is an ever growing list of Generative AI Applications, which can be broken down into eight broad categories.

cobusgreyling.medium.com

Large Language Models Are Forcing Conversational AI Frameworks To Look Outward

With fragmentation being forced on frameworks it will become increasingly hard to be self-contained. I also consider…

cobusgreyling.medium.com

Vertex AI | Google Cloud

Send feedback Build, deploy, and scale machine learning (ML) models faster, with fully managed ML tools for any use…

cloud.google.com

HappyDB

A Corpus of 100,000 Crowdsourced Happy Moments

www.kaggle.com

Creating Training Data For Text Classification In Google Cloud Vertex AI

In the coming posts I will be doing a few deep dives on Google Vertex AI. This post focusses on data engineering and following a data-centric approach to AI. Datasets is the first step in the Vertex AI workflow.

In Closing

NLU design tooling

“Conversation Designer, Retail, 10k+ employees The tool that turned conversation designers, into NLU designers” ★★★★★…

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don’t already…

The Cobus Quadrant™ Of NLU Design

NLU design is vital to planning and continuously improving Conversational AI experiences.

The Cobus Quadrant™ Of Conversation Design Capabilities

∗ This is part one of a two part series, please also take a look part two, the Cobus Quadrant of NLU Design.

Large Language Models, Generative AI & Google Cloud Vertex AI

Google launched Vertex AI 18 May 2021 at Google I/O and it seems like the product has faired well considering all the…

Foundation Conversational AI Technologies Landscape

And the rapid expansion of Large Language Model (LLM) enablement.

The Foundation Large Language Model (LLM) & Tooling Landscape

There is an ever growing list of Generative AI Applications, which can be broken down into eight broad categories.

Large Language Models Are Forcing Conversational AI Frameworks To Look Outward

With fragmentation being forced on frameworks it will become increasingly hard to be self-contained. I also consider…

Vertex AI | Google Cloud

Send feedback Build, deploy, and scale machine learning (ML) models faster, with fully managed ML tools for any use…

HappyDB

A Corpus of 100,000 Crowdsourced Happy Moments

Written by Cobus Greyling

No responses yet