Labeling and Data Augmentation with SageMaker

This skill explores the use of AWS SageMaker for labeling and data augmentation in machine learning projects. It covers the historical context of human-assisted AI, the transition to automated processes, and the use of tools like SageMaker Ground Truth and Data Wrangler for image classification tasks. The skill emphasizes the importance of data augmentation techniques to enhance model accuracy and discusses the practical application of these tools in creating efficient machine learning workflows.

Full lesson from AWS Machine Learning. Preview the IT training 23,000+ organizations trust.

59m 7 Videos 8 Questions

Skill 10 of 20 in AWS Machine Learning

Skill Introduction

Data Prep for Image Classification

Image classification models are used for many tasks today, but how do they work?

Well, instead of a gremlin hiding in some server rack somewhere, image classification is just clever math (maths for you UK folks). That is indeed what all machine learning is...machines are good at numbers, and if we can transform things into numbers, then we can probably build an ML model for it.

Knowledge Check

What is the purpose of labeling in the context of creating an image classification model?

ATo assign a name to the patterns recognized by the model, helping it identify similar patterns in new data.
BTo increase the resolution of images for better model accuracy.
CTo convert images into a different file format for processing.
DTo enhance the color contrast of images for clearer visualization.

Verify your team's readiness — Request a Demo to verify practice assessments, completion reporting, and CSV / SCORM exports on the Team plan.

Creating Our SageMaker Domain

Déjà vu all over again. We'll get mighty used to creating SageMaker domains in this course, and that's a good thing.

AWS is a very entrepreneurial company, and different teams are encouraged to work in small, nimble groups. This is one of the reasons why some products might seem to have similar functionality to other products, or some services don't fully integrate with other AWS services upon release. This is the double-edged sword of rapid software development and, in my opinion, allows AWS to compete so well in the marketplace. This condition is just something you have to get accustomed to when working within the AWS ecosystem.

Knowledge Check

In the prior video, why are we using the Custom path rather than the Quick Setup option?

ATo restrict public internet access for our Domain
BTo use IAM Identity Center as our single sign-on
CTo restrict permissions to just S3 buckets in the US-EAST-1 region
DTo avoid AWS upcharges for using the Quick Setup option
ETo allow other AWS accounts to access our Domain

Verify your team's readiness — Request a Demo to verify practice assessments, completion reporting, and CSV / SCORM exports on the Team plan.

Using Ground Truth to Label Our Dataset

What is a "goat?" Well, to a machine learning model, it is a collection of 1's and 0's that we humans have told it that we call a "goat."

Ground Truth labeling jobs aren't just restricted to labeling for a machine learning model. During any major system conversion project, such as a new Enterprise Resource Planning (ERP) software implementation, there is usually a massive data cleansing effort for master data. Master data is the data about our products, customers, services, orders, shipments, and so forth. (In contrast, transactional data is the data that is recorded through the interaction of master data such as inventory records, shipment receipts, etc.)

We could easily use Ground Truth labeling jobs to create an orderly way to review any master data for accuracy and cleanliness before we convert it to the new system. We might even be able to outsource this task to external resources if the data review does not require any specialized company-specific knowledge.

Knowledge Check

Which of the following statements are true about the process of labeling data using SageMaker Ground Truth? (Choose THREE)

ASageMaker Ground Truth can automate the creation of a manifest file for labeling jobs.
BMechanical Turk can be used as a human-based workforce for labeling tasks.
CLabeling jobs can be assigned to private teams within an organization.
DSageMaker Ground Truth only supports image data for labeling.
ELabeling jobs must always be completed manually without any automation.

Verify your team's readiness — Request a Demo to verify practice assessments, completion reporting, and CSV / SCORM exports on the Team plan.

SageMaker Data Wrangler

Just the name Data Wrangler makes me want to don my cowboy hat, jump on my horse, and ride off into the sunset. Oh, who am I kidding... I'll probably just watch Blazing Saddles again.

Data augmentation can also take other forms as well. We've already augmented some of our other data sets by combining and supplementing data from other sources. The idea is that you want your data set to be as supportive as you can to your end objective. That also means that if you have attributes in your data that do not impact your objective, you are usually best served to remove those attributes, as they can just create noise.

However, there might be some non-obvious correlations to our objective in that supposedly irrelevant data. There have been many cases where scientists have fed data into a training process and received surprising and unexpected results. This is where experimentation with ML models could come into play.

Knowledge Check

True or False: One objective of data augmentation is to expose the training process to more variations than we might currently have.

AData augmentation involves creating new training data by applying transformations such as rotation, resizing, and adding noise to existing images.
BTRUE

Verify your team's readiness — Request a Demo to verify practice assessments, completion reporting, and CSV / SCORM exports on the Team plan.

Reviewing Our Augmented Data and Teardown

Let's take a look at our output and reset everything back to square one.

Later on, we're going to use some specialized Python libraries that will help us augment our image data much faster and at more volume than we did using Data Wrangler. As I mentioned, Data Wrangler is designed to be a user-friendly casual data handling tool that is fine for basic stuff but a bit limiting for doing things at a larger scale.

Knowledge Check

In what order might we have to delete the SageMaker AI domain components, first to last?

This interactive assessment is available in the full learning experience.

Verify your team's readiness — Request a Demo to verify practice assessments, completion reporting, and CSV / SCORM exports on the Team plan.

Validation: Labeling and Data Augmenting with SageMaker

Let's test our knowledge so far with some exam-style questions.

Knowledge Check

A machine learning engineer is using Amazon SageMaker Ground Truth to label a dataset of images for a binary classification task. The dataset is large, and the team wants to reduce labeling costs while maintaining high accuracy. Which approach should the engineer take?

AUse fully manual labeling with a private workforce to ensure maximum accuracy
BEnable automated data labeling with active learning to reduce the number of images requiring human review
CUse Amazon SageMaker Data Wrangler to preprocess the images before labeling
DDeploy a pre-trained model in SageMaker to label the entire dataset automatically

Verify your team's readiness — Request a Demo to verify practice assessments, completion reporting, and CSV / SCORM exports on the Team plan.

Knowledge Check

An ML engineer is using Amazon SageMaker Data Wrangler to augment a dataset of product images for a computer vision model. The model is underperforming due to limited variation in lighting conditions. Which Data Wrangler transformation should the engineer apply?

AAdjust brightness and contrast of the images
BApply text normalization to image metadata
CRemove duplicate images from the dataset
DConvert images to grayscale to standardize colors
ELaunch a project to collect a new set of ground truth data

Verify your team's readiness — Request a Demo to verify practice assessments, completion reporting, and CSV / SCORM exports on the Team plan.

Knowledge Check

A machine learning team is using Amazon SageMaker Data Wrangler to augment a small dataset of audio clips for a speech recognition model. The model struggles with background noise variations. The team also needs to integrate the augmented data with a SageMaker Ground Truth labeling job for transcript verification. Which is the BEST approach the team take to address both augmentation and labeling needs?

AUse a custom transform in Data Wrangler to add synthetic background noise to the audio clips and export the dataset to an S3 bucket for a Ground Truth labeling job
BApply text augmentation in Data Wrangler to generate synthetic transcripts and use Ground Truth for audio labeling
CUse Ground Truth to label the audio clips first, then import the labeled data into Data Wrangler for noise augmentation
DConfigure Ground Truth to perform audio augmentation and labeling in a single workflow

Verify your team's readiness — Request a Demo to verify practice assessments, completion reporting, and CSV / SCORM exports on the Team plan.

Question Walkthrough

View Transcript

Skill Introduction

0:00Welcome back, beautiful learners to another skill. In 1769, Wolfgang von Kempl

0:06en wanted

0:06desperately to impress Queen Maria Theresa, and he vowed to do just that. Six

0:12months later,

0:13Kemplen proudly unveiled a contraption he dubbed the Turk. It was a machine

0:18capable

0:19of playing other humans in chess and defeating them pretty soundly. Well, the

0:23Queen was indeed

0:25impressed, as were many others who dared to challenge this mysterious humanoid

0:29device to a match.

0:30Throughout the late 1700s, in the first half of the 1800s, this mechanical Turk

0:36toured around

0:37the globe, defeating notable figures like Benjamin Franklin and Napoleon Bonap

0:42arte. The device was

0:43sold several times to new owners looking to bring this modern marvel to

0:48audiences around the world

0:50for a fee, of course. Some people regarded it as a technological breakthrough

0:54while others insisted

0:56that there had to be some supernatural powers at work. There were skeptics

1:00though, Edgar Allen

1:01Poe and Charles Babbage publicly denounced the device as a hoax, and they were

1:06right. It wasn't

1:08until 1857 that one of the sons of a former owner of the device came clean and

1:13admitted the truth.

1:14The mechanical Turk was in fact powered by humans. The owner would hire chess

1:20masters to hide inside

1:21the cabinet and play chess with human opponents using a series of pulleys and

1:26levers. Well,

1:27that was way back in the 1800s, surely the same stunt wouldn't fly today. Oh

1:33yes, oh yes it does.

1:35All the tech giants have at one point or another made use of click workers to

1:40either overtly or

1:41covertly supplement their respective AI capacities. Amazon does this through a

1:48service

1:48interestingly called Amazon Mechanical Turk. Back in 2015, Facebook did this

1:54covertly when they tried

1:56to launch a personal assistant called M touting cutting edge AI, but humans

2:01were really doing the

2:03work. And then there is the Nate shopping app in which its founder claimed that

2:09it had industry

2:10leading AI capabilities. And that helped him secure $40 million of investor

2:15money. Only problem was

2:17the AI capabilities were in fact call center workers. And as of this recording

2:22that has landed

2:23its founder in a very sticky situation. For all the marvelous technology that

2:28we have out there,

2:29there are still things that require humans. And in this skill, we're going to

2:33dig into a

2:33little of both with one of our data sets, a little bit of human and a little

2:37bit of machine. Let's get

2:38started.

Data Prep for Image Classification

0:00Okay, in our last skill, we talked about SageMaker, and I mentioned that we're

0:05kind of caught

0:05between two minds right now.

0:07We have the SageMaker next gen, and then we have SageMaker, which is kind of

0:12the traditional

0:13SageMaker has been around for a while, and AWS has named that SageMaker AI.

0:19And it's kind of unfortunate, but we find ourselves in this in-between world,

0:24and we

0:24have to know a little bit about both of these tools.

0:27Because over time, eventually, everything is going to gravitate more towards

0:32the SageMaker

0:33next gen.

0:34But in the meantime, these tools over here are still very popular, and they're

0:39used quite

0:40a bit.

0:41So there is going to be a huge transition over the course of probably a few

0:45years or so before

0:47everything is finally over here.

0:49Despite what Amazon wants you to do is drop everything over in the old world

0:53and lean

0:54into the new world, well, we are going to talk about one of those tools that,

0:57as far

0:57as I can tell, is not yet available in the new world.

1:01So if we look at SageMaker AI, we looked at SageMaker Studio, we looked at Can

1:06vas.

1:07We also have some tools here inside Canvas, where we can access it via Canvas,

1:12and that

1:12tool is called Data Wrangler.

1:15Now what is Data Wrangler?

1:16Well, it is yet another ETL tool that allows us to do some different stuff.

1:22It's a little bit more simplistic than glue or data brew, and it's more geared

1:28towards

1:29getting data in shape for true machine learning processes.

1:34So we're going to play around a little bit with this.

1:37We're not going to go too deep at all, because in all likelihood, if you're

1:42doing major machine

1:43learning workloads, you're probably not going to use Data Wrangler because it's

1:47kind of

1:47like the Fisher Price edition of being able to manipulate and augment your data

1:53.

1:53But it's pretty decent for what it is.

1:56And the real thing is that, yes, it is on the exam blueprint.

1:59So I have a responsibility to show you how it works and be sure that you know

2:04how to

2:05get hands on with this thing.

2:07Now something else we are going to use here is within the umbrella of SageMaker

2:12Ground

2:12Truth, and we are going to do some labeling.

2:17Now what is labeling?

2:18Well, I'm glad you asked.

2:20Let's get to it.

2:21So let's say that we are going to create an image classification model.

2:26Now there are several different ways to create image classification models.

2:30And one of the ways to do that is to create something called a convolutional

2:34neural network

2:35or a CNN.

2:37And so let's say I sketched this little drawing here, I made a little house,

2:43and we fed that

2:44into a computer, and guess what, that computer is going to digitize my drawing

2:49into all these

2:50little pixels and all these little blocks.

2:53So the next step we can take with that is we can say, hey, let's assign a

2:58weight or some

2:59sort of numeric value to whether this block is white or black.

3:05And so in this case over here, we have zero as white and we have one as black.

3:09This is a very simple color palette here.

3:12We're just using black and white and we can use these weights right here.

3:16Well, they're not really weights, they're just values.

3:18So now we have a numeric representation of what this little sketch looks like.

3:24And so what we can do is we can use a machine learning process and we can train

3:29a convolutional

3:30neural network to begin to understand these patterns and look for these

3:36patterns in other

3:37new drawings or new images that we may feed in.

3:42That process is called training.

3:44So we are giving the CNN model a bunch of data to look at and we're saying, hey

3:49, find

3:50more or get better at finding more of what this thing is.

3:55Part of finding what this thing is is we have to give this thing a name and

3:59that's called

4:00a label.

4:01So we would say, hey, this is a house.

4:03So what we'd want to do is have that model learn that this particular pattern

4:08of ones

4:09and zeros.

4:10You can kind of forget about the black and white here.

4:13Computers don't see that.

4:14They just see ones and zeros here.

4:16This particular pattern of ones and zeros equals a house.

4:21And so therefore in the future, if we feed this thing into our model, our model

4:26is going

4:26to say, hey, that sure looks really close to what you told me was a house

4:31before.

4:32So I'm going to say, hey, this is a house.

4:34And I think maybe I have 98% confidence that what you're giving me here, I've

4:40seen something

4:41like that before and I'm going to call it a house.

4:45That is how image classification works.

4:48It's not smoking mirrors.

4:49It's just all mathematics behind the scene.

4:52So if we were just saying, hey, try to detect a house in this image is this

4:56image a house,

4:57we could train our model on a bunch of different houses here and we can say,

5:02yes, it is a house

5:03or if it doesn't match, if it has no similarities to something that we've fed

5:09into our model

5:10before, the model is going to say, no, I don't think that that is a house at

5:14all.

5:15So this is something called binary classification.

5:18And it's called binary classification because there's only two choices here,

5:22you're either

5:22a house or you're not a house.

5:25And we also have something else.

5:27And that is a multiple classification.

5:30Most image classification models nowadays are able to do multiple class

5:34ifications.

5:35So in other words, you'll send in an image and maybe it's a house, maybe it's a

5:39cat,

5:40maybe it's a goat and that model will be able to say with a degree of certainty

5:45, whether

5:46it thinks it's a house, a cat, a goat, a car or whatever else that particular

5:50model was

5:51trained on.

5:52And then as multiple classifications, in some cases, some models can do multi

5:57label classification.

5:59And let's say that this is a house, but maybe it's a white house.

6:03And it's going to say, hey, that is a color white.

6:06I also know that is a house.

6:08Maybe it's a brick white house or something like that.

6:11So they can get pretty specific.

6:12So let's say we have our picture of a goat here and we train our model on this

6:17little

6:17guy right here.

6:19And then for goat, it has a mental image, it has a mathematical image, not a

6:24mental image.

6:25We have a mental image.

6:26Our model has a mathematical image, a bunch of ones and zeros as to what

6:31represents a

6:32picture of what we call a goat.

6:35The model doesn't know what it's looking at.

6:37It doesn't know that it's looking at an animal unless we have trained it and

6:41said, hey,

6:42not only is this a goat, this is also an animal.

6:45So it's going to start building those relationships.

6:48So what happens if we pass in a different picture?

6:52This is a different goat.

6:54It's still a goat, but it's from a different perspective.

6:57The goat has different colors, it has different horn sizes here.

7:01Now if we just trained our model just on this image right here and we sit and

7:05we pass

7:06in this image right here, it's not going to be able to identify that very

7:10accurately

7:11as a goat because it has never seen an example of a goat that is not looking at

7:17us straight

7:17on with little horns and ears here.

7:20It has never seen this perspective of an image of a goat.

7:24So what we need to do is we need to send in a bunch of pictures here so we can

7:29start generating

7:30a bunch of little maps and connections that represent what we call a goat.

7:36So we would take this picture and this picture and this picture and this

7:39picture and this

7:39picture and we would label all those, goat, goat, goat, goat, and we would send

7:44them into

7:44our model and the model will start building similarities and relationships and

7:50ratios

7:51and begin to build out this map of here's what a goat looks like in many

7:56different scenarios.

7:58So the more information we send in during the training process, the more

8:02accurate our

8:03model is going to be at making the decisions that we want it to make.

8:09Let's look at our Lego data set here.

8:12If you happen to have looked at the features dot CSV file that came with the

8:16set, you'll

8:17see that it does have a feature in there.

8:19What is a feature?

8:20Well, a feature is just an attribute of one of our items here.

8:24And one of the features that it's calling out is helmet.

8:26Well, there's only one in here that has a helmet and I suppose you could say

8:32this is

8:32maybe a helmet, but this is not a helmet that's a hood, that's a hood, I don't

8:36know what

8:36that is.

8:37But anyway, we just have one person or one Lego wearing a helmet.

8:42So I didn't really want to just use helmet or no helmet because we don't have a

8:46whole

8:47lot of data.

8:48We don't have a whole lot of variety in our data set.

8:51So what I decided to do is I looked at the data set and I see that some of

8:56these figures,

8:57we can see their teeth, their mouth is open and for others, their mouth is

9:02closed.

9:03So what I decided to do is let's do a binary classifier here of not mouth open

9:09or mouth

9:10open.

9:11So we have some examples right there, granny with her mouth open versus some of

9:14our other

9:15characters over here, some sort of empire official right here, I believe, but

9:19you can

9:20see they do not have their mouth open.

9:23We can use these two sets kind of divided amongst themselves to begin to train

9:29our model.

9:30We can train our model and send in these pictures and say, hey, these are Legos

9:34with

9:35their mouth open and these are Legos with their mouth not open.

9:39Now what I anticipate our model is going to do is to kind of key off this

9:44little white

9:45space right here because that is something that is consistent across these

9:50different

9:51images.

9:52And over here, you see no white space right here.

9:54So it's my best guess that the model is probably going to be using that

10:00difference to really

10:01define whether the mouth is open or not open.

10:05Now is that good or bad?

10:07Well, we don't know because if we feed in a new Lego with the mouth open and

10:11let's say

10:12that it just has a black entry in here, like they're screaming or something and

10:16they have

10:16no teeth visible, well, is that going to rate as a mouth open or a mouth not

10:22open?

10:23We will have to see that.

10:24As I said before, we need a lot of pictures and a lot of samples to send into

10:29our model

10:30training process.

10:31How many do we need?

10:33That's a good question.

10:34We can send in fewer images or more images.

10:37Now if we have fewer images to work with, we are going to get better results

10:41doing something

10:41called transfer learning with a pre train model and I'll explain that in a

10:45second.

10:46If we have a bunch of images, if we're dealing with something is very

10:50specialized, maybe we

10:52can build our CNN from scratch.

10:55We don't have to rely on a pre train model.

10:58What is a pre train model?

10:59Well, a pre train model is just a model that's just already been trained on

11:04some stuff.

11:05In our case, we would probably want to pick a model out there that has already

11:09been trained

11:10how to look at pictures and there are plenty of open source models out there

11:14that we can

11:15use as a starting point to get us started and then we can use this transfer

11:20learning

11:21process.

11:23What that just means is that we're going to start with this pre train model and

11:26we're

11:26going to feed in our own data to make it more specialized on the topic that we

11:32're trying

11:33to build our ultimate model to be able to understand.

11:37The difference here is we can either start with dry pasta in a box or if we

11:43want to do

11:44the artisan path here, we can create our own pasta.

11:48We're going to get to the same point.

11:50This is going to be a little bit quicker.

11:52This is going to take a little bit longer.

11:54This is going to be more specialized.

11:56Maybe we don't have a pre train model that it's very good for what we're trying

11:59to do,

12:00but there are lots of pre train models out there for image recognition and

12:04image classification.

12:05So we are going to pursue this path right here.

Creating Our SageMaker Domain

0:00Alright, so we're out at our console and our first order of business is to

0:04create or recreate our SageMaker AI domain.

0:08Now, if you happen to leave yours running, then you can just use it.

0:12But I deleted mine and I showed you how to delete yours as well because it's

0:17not exactly so straightforward.

0:19But now you know how to delete stuff, you can recreate it and just kind of do

0:24your own thing.

0:25So I'm going to create a domain here and we are going to use the setup for

0:30organizations.

0:31As I mentioned before, it allows us to use IAM identity center.

0:35If we just use setup for a single user, it only offers IAM authentication,

0:40which kind of complicates things because we are on the IAM identity center

0:45train, which is where we want to be.

0:47So I'm just going to call this SageMaker, come up with your own ultra-creative

0:51name. If you would like, we are using identity center here.

0:55And we only have one user group in our identity center. That's administrators

1:01and I'm going to click on next.

1:03Now, I'm going to go through here and I'm just going to uncheck the S3 bucket

1:07access.

1:08And I'm going to check S3 full access because I am doing this in a sandbox

1:13account, which you should be to.

1:16We don't really care very much about access here. We just want full access.

1:21If by chance you did not delete the role from last time, if you remember when

1:26we created a SageMaker domain last time,

1:28we added an additional role there. So if you have already created that role,

1:35and this is my role down here that was created,

1:38it'll probably say something like SageMaker execution role. Then whatever date

1:42you ended up creating it, you can feel free to use that.

1:46Now, if you remember, what we did is we added EC2 full access and we are going

1:52to need that access on this role.

1:55So you can do this two different ways. If you happen to delete that original

1:58role that was created, what I'd ask you to do here is just select full access

2:03for S3 and then go back out there and add whatever the name of the role that

2:09gets created, go back out there and add EC2 full access.

2:13I will show that to you here. So I'm going to IAM. We'll open that in a new tab

2:20, and we're going to find that role.

2:22And let's see, what is that role name here? SageMaker execution role with a

2:29bunch of numbers.

2:31There it is SageMaker execution role. And if you drill into that, if we scroll

2:36down here, we see EC2 full access.

2:40Now, we're going to need this access when we start working with data Wrangler.

2:44So be sure that that is in place.

2:46Otherwise, you will get errors and you'll get frustration. I'm going to use my

2:51existing role here. That should work fine.

2:54And I'm just going to leave all the defaults here. Click on next.

2:59Leave everything on. Click on next.

3:02And I want to use a public internet access. Now we talked about this at length

3:07last time in the other skill.

3:09This means that our resources inside this deployment, this domain deployment

3:15will have internet access.

3:16There are cases where you do not want that to be the case. In my case, I don't

3:20really care.

3:21I'm going to choose I'll choose a private subnets here. Private one, two, three

3:27, and I believe that privates are recommended because otherwise you may not get

3:31some

3:32functionality. And security group, we have to choose a secure group. I'm just

3:36going to choose the default one because it's pretty wide open.

3:39I don't really have anything that I need to change here. Click on next. And

3:43this is just a confirmation screen.

3:46And I'm going to click on submit. There we go. Submit. And now I'm just waiting

3:50a few seconds until our domain is created.

3:53So it's going to take a little while to stand it up and get everything squared

3:57away. And then after that,

3:59if you remember, what we need to do next is come in here and add a user profile

4:03. So I'm just going to pause the video until it is time to do that.

4:06All right. So our domain is now in service SageMaker domain. I'm going to go

4:12back into it, go to user profile, assign users, and check my little name right

4:19there, assign users.

4:20All right. So I have a user profile assigned. So once the user is added there,

4:27I'm going to launch an app.

4:28I'm going to launch canvas. So this is going to create our application and

4:34allow us to launch this in the future.

4:37All right. So I launch canvas. Now you may get this message. This is the first

4:43time I've ever seen this message, but I'm looking here selected Kendra indexes.

4:46We're not using Kendra. So we can probably disregard it may bite me in the

4:51future. But anyway, we want to go down here to data Wrangler and make sure we

4:54can access it.

4:55You can see the UI is pretty colorful and it's relatively simply laid out. We

5:00're going to revisit this when we start playing with data Wrangler in a few

5:04videos.

5:04But for now, we want to turn our attention back to SageMaker AI and we're going

5:09to create a labeling job.

Using Ground Truth to Label Our Dataset

0:00Okay, so remember back when I was talking about this process where we need

0:04several images and we're going to give them names

0:06Here of what they are then we're going to send them into our model training

0:10process

0:11And that is going to help our model develop a better understanding of what a

0:16goat is

0:17And so to do that we have to number one find a bunch of pictures of goats and

0:22number two

0:22We have to tell the model that hey this thing that you're looking at right now

0:26is a goat

0:27And I shouldn't say looking at this thing that you're analyzing and you're

0:32stripping down to a bunch of ones and zeroes

0:34That representation is something that we refer to as goat

0:38So we need to label our data set these are labels

0:43Sometimes your data comes pre-label

0:45Sometimes your data does not come pre-labeled in our case

0:50We have our data set here and we want to divide them into not mouth open and

0:56mouth open

0:57Well, we could do that pretty easily manually just by looking at each of the 33

1:03images there

1:04But no, no, no, that's no fun

1:06Let's say that we are dealing with maybe several tens of thousands of images

1:11that we have to have somebody look at

1:14Because maybe the machine learning capabilities are not capable of determining

1:19whether a mouth is open or not

1:21I'm I'm certain that there are models out there that can easily do this

1:25But just kind of play along with me. I'm trying to take this on a very basic

1:29step-by-step process here

1:31So let's say that we have to label these two sets of data

1:35Well, we need to do a labeling job and we are going to use SageMaker ground

1:40truth to create a labeling job

1:43For our data set. So let's head out in the console and do that right now

1:47Alright, so our SageMaker domain is working if you were able to successfully

1:53start up canvas

1:54Then you should be good there we're gonna turn our attention to our ground

1:58truth and create a labeling job

2:00So we're gonna scroll on here to where it says ground truth and click on

2:04labeling jobs

2:05So we're being asked to create a labeling job

2:09What I want to do is I want to create a job that represents a

2:14Labeling activity. So if we go out here to s3 if you remember we uploaded our

2:20Lego images out there

2:22so let's go in their Lego faces and

2:24Had this feature CSV, but we're not gonna use that if we go into this double-

2:30zero folder

2:30You have all these images right here and those are the ones that I was showing

2:36in the presentation

2:37So these are the images that we want to label right now. We don't know what

2:41they are

2:42We don't know whether they have their mouths open or their mouths shut. So we

2:45need to create a labeling job now

2:47Of course, we're dealing with kind of small stakes here. We're only dealing

2:50with 33 images

2:52But just imagine you're dealing with a bunch of images

2:56Let's say millions or hundreds of thousands of images and you need all those

3:01labels

3:01You need somebody to look at those things and say whether or not. Yeah, that's

3:06a cat

3:06No, that's not a cat or something like that chances are you probably do that

3:09with automation without having to put it in front of humans

3:11But there are complex complex tasks that sometimes require human intervention

3:17We are going to create a labeling job now

3:19This job is going to be exceedingly simple

3:21But I just want to get you through the process of setting one of these things

3:25up

3:25So if you need to do this in the future or you see an opportunity to do this

3:29you kind of know what you're doing

3:31So I'm gonna give this job a name. I'm just gonna call this mouth

3:34open or not

3:38And we have a couple different options here. We can do automated data setup or

3:43manual data setup

3:44Now automated data setup is going to read our s3 bucket and it's gonna do us a

3:49favor

3:49It's gonna create what's called a manifest file

3:52And if we did not do that we would have to manually create a manifest file

3:56Basically, this is just an inventory of the images or whatever data we want to

4:00have labeled in our case

4:03We can do that automatically and it's gonna save us some work

4:06so I'm just gonna check that little box right there and check that double zero

4:10directory choose and

4:12For the output the same location as the input data set. Yep, that's fine

4:18The data type is we were talking about images and the I am role

4:24I want to create a brand new I am role now

4:27I can specify this specific bucket that I want to be able to access or provide

4:32access to and that's probably a good thing

4:35So I'm gonna go up here Scott Plutcher Lego faces. Let's try this

4:39See if that flies

4:42There we go and then I'm going to click on complete data setup

4:48now what it's trying to do right now is go out there to that directory and

4:53Create our manifest file for us. So if we go out there to that directory, let's

4:58do a refresh

4:58Do we see something in here? There it is right there data manifest

5:03So we have a manifest and a JSON file. So that is going to be used when we're

5:08creating our labeling job

5:10So now we have our data prepared. The next step is we come down here to the

5:15task type

5:16And there are several different task types here text video

5:20point cloud generative AI

5:22Custom all that other stuff we can choose image because that's what we're doing

5:27and we are doing an image

5:29Classification now you remember we just have a single label here or we could

5:33have multi labels in other words

5:35This is a human and this is also a vehicle in our case. We're just doing image

5:40classification with a single label

5:42We're interested in whether or not that image has their mouth open or not

5:46So we can scroll down here click on next and now we are asked what type of

5:52workers do we want to assign this to?

5:55So this is pretty interesting. So we have mechanical Turk which I mentioned

5:59before and that is a human-based

6:02Workforce and we can actually send this job out to that

6:06mechanical Turk army and then humans would look at our pictures and follow our

6:11instructions and decide whether that mouth is open or not or

6:15We can do it private which is what we're going to do that just means maybe we

6:19have a team ourselves

6:20Maybe we've built our own team inside our own department or inside our own

6:24company

6:25We can do that too or we can do vendor managed

6:29Maybe the tasks that we're being asked to do or we're asking our workers to do

6:34is very very specialized

6:36Maybe it requires specialized engineering knowledge or specialized legal

6:39knowledge or something like that

6:41Well, we can probably find a company that would provide those specialized

6:46skills to be able to properly label our data

6:50I'm just going to give this a team name and I'm just going to say my team and

6:55Here is how I can invite my private annotators. I can put their email address

7:01in here. So I'm going to put my email address

7:03Okay, now task time out. This is the mask maximum time that a worker can work

7:10on a single task

7:11What this means is that if a worker starts working they're going to give

7:16certain answers and then after a period of five minutes

7:19they're going to say thank you very much come back later and then it has to

7:23time out basically the task has to

7:26Reset and then you would come back in there and it'd give you five more minutes

7:30to do stuff

7:31And the idea here

7:32I think the idea here is that you don't want people working on tasks for hours

7:36at a time because maybe their

7:38accuracy would deteriorate so for our case because I am the only worker and

7:45I'm pretty assured that I'm going to be able to get through 33 images and be

7:50able to classify them properly

7:52So I'm going to change this to 30 minutes

7:55And I should be able to do that. Here's some organization information

7:59We can enter a description of our organization. We'll just say my org and then

8:04contact email

8:05Who do they contact if they have questions?

8:08Or issues

8:10Chances are you'd probably if you were doing this at scale you'd probably put

8:14some sort of group message or some sort of automated

8:16Support desk email address here. I mentioned that we can also automate data

8:22labeling

8:23We could probably easily use this option right here to determine whether the

8:28mouth is open or is not open on our Lego

8:30Images because I'm sure that SageMaker has a model that has already been at

8:35least marginally trained on that task

8:37But we are going to do this the old-fashioned way. We are going to do it the

8:41manual way

8:41So I'm going to keep on scrolling down here. Now here is what the worker is

8:47going to see

8:48So what it's done is it's pulled a sample image from our collection and we have

8:52to enter a brief description here

8:54Let's say select

8:56Select whether the mouth is open or not open. We will say mouth open and

9:03We will say mouth. We'll just say closed. Well, we'll do not mouth not open

9:08Because this is a binary classification or either interested in whether the

9:13mouth is open or it is not open

9:14I could say closed, but that may get a little confusing. So

9:17We have this little instruction here. We can add some more instructions over

9:23here

9:23I can even preview it but last time I tried this it didn't quite work. I'm not

9:27exactly sure why

9:28Well, actually it worked this time. That's kind of cool. Here is a preview.

9:32Here's what our workers are going to see

9:34They can see the instructions

9:35You can see we could have added some stuff over here

9:38Select whether the mouth is open or not open and we have these two choices here

9:43And the worker has to look at this image and say mouth not open and then they

9:47would submit it

9:48Here's a little submit button. Now this didn't actually do anything

9:52This was just a preview and what this is going to send back to ground truth is

9:56it's going to send back this record right here

9:59It's going to say hey for this picture. The label is mouth not open

10:03Let us try this. Let's go ahead and create

10:07And it's going to take a little bit of time to create so i'm just going to

10:10pause the video until it is done creating this

10:13Labeling job. Well, well

10:15I won't pause the video because it just created it, but it does take a little

10:19bit of time here. Let's see

10:21Yeah, I'll just pause the video. Probably in the meantime, you can go check

10:25your inbox

10:25And ideally you'll have an invitation to be part of this labeling job team. So

10:32let me do that

10:33All right, so here is what was in my inbox a little invitation. You are being

10:37invited

10:37Here's a link here. It gives us a username and below this

10:41It gives me a little temporary password that I can use so I am going to click

10:45on this link here

10:47Okay, so here is the link that I clicked on going to sign in with my email

10:51address here

10:52And paste in the temporary password and it should prompt me for a new password.

10:58I'm going to enter my new password here

11:00Here we go

11:04Send

11:06And get out of here

11:07What it is doing is it landing me on this little jobs list here

11:13So if I had multiple jobs that I was assigned to if we go back here to our

11:17labeling job

11:18Go down to ground truth labeling workforces

11:21If we go under private, you'll see here is our workforce now

11:26We just have one worker on our workforce now you can invite new workers

11:31And they would be able to provide input to multiple jobs you can assign

11:37multiple jobs

11:38And here's the listing here. It says we have a job out here image

11:41classification single label

11:44And it's available and that's when it was created. So I'm going to click on

11:48start working

11:49It gives us instructions here and it's asking select whether the mouth is open

11:55or not open

11:56Notice the time here. It's given us a little countdown timer and we can either

12:01decline this task

12:02We can say hey, we don't want to do this. We can release it or we can just skip

12:06it

12:06I'm going to select not open it submit

12:11Mouth is open submit and so I'm going to go through all 33 of these and see how

12:17many I can do

12:18I'm just going to pause it here and hopefully I can get through all of them

12:21before it times me out

12:23If I can't get through 33 in 30 minutes, then shame on me

12:28So I finished up all the tasks that it gave me now

12:31It did not give me all 33 of those it gave me a decent chunk there

12:37So I do believe that we have done most of them. Yep. There it is

12:4220 of 33 objects have been classified and that sounds about right it gave me

12:47about 20

12:48And if we go into it, we have our labeled objects right here

12:54And the labels have been stored or they are being stored out in our output

13:00location here. So we can open this up

13:04And we can see annotations worker response iteration one

13:08And then here's all the the image IDs there and we can see this little file

13:15And let's download that

13:18Take a peek

13:20Here is the file that I downloaded

13:24It's given a lot of information here for that particular image. It's saying

13:29mouth not open

13:29Gives the timestamp and when it was stored and all that stuff

13:33So that is very useful because we can use that in the future to be able to

13:37decide whether that image

13:39Classified is labeled as mouth not open or mouth open. Let's go ahead and close

13:44this

13:44Let me go back out to my worker tool here and let's refresh it

13:50All right, there we go. So we got another swing in it here

13:54We're going to start working and this is probably going to allow us to

13:58Finish off the data set. Now sometimes they will give you the same image

14:03multiple times

14:04And that's just good practice because it wants to get multiple opinions on the

14:08label of this

14:09So you can kind of triangulate. What's the proper label? So I'm just going to

14:13finish off

14:14Labeling whatever it's going to give me and we're going to see if that is

14:18enough to complete our data set

14:20All right, I think that was the last of them it dropped me off in this screen

14:25right here. I've refreshed a few times

14:28Got no more tasks. I'm going to go back over here to the labeling job

14:31And let's see what we're sitting at

14:34Refresh sometimes it takes a little time to update this

14:38So i'm just going to pause the video and see if it's going to update that to

14:42the number of objects that we had that we're trying to label

14:45All right, so it has updated same 32 of 33. I don't know where the last one

14:51went

14:51But anyway, this is just demonstrating how to use this tool

14:55In fact, we are not really going to use this data right now

14:58But what I want to do is end this job. So i'm going to go up here to stop

15:04That is going to stop the labeling job

15:07Basically, it's going to close this opportunity for people to do labeling

15:12And just the fact that I closed it it decided to update and give me all 33

15:17So that's what I was hoping to see

15:19So just refresh here. It's going to take a little while to stop it

15:23But anyway, that's how we can set up a labeling job and then we can use that

15:27data to label our images

15:30but

15:32In our case, i'm going to kind of cheat

15:34So I am going to go back out here to lego faces and I have manually labeled

15:41these

15:41I just basically sorted them

15:43So what i'm going to do and you can do the same thing

15:46Let's just presume that we conducted our labeling job and we went out there to

15:52the data in here and we looked at all the mouth open and not open and we

15:57Use some formulas to assign different values and we ended up with something

16:03that looks like this

16:05So here we have folders here mouth open

16:09folders with mouth not open

16:11So I just manually did that i'm going to click on upload and get all that stuff

16:15in there

16:15So what this is going to do is drop these images into those respective folders

16:19there

16:20And then that's something that we can use next when we're messing about with

16:25data wrangler

SageMaker Data Wrangler

0:00In a prior video, I mentioned that we're going to have better success if we get

0:03a bunch of images

0:04of our classifications. So here we have our classifications, not mouth open,

0:09mouth open.

0:10But if we go back here to our data set, we only have 33 samples here. And of

0:16those 33 samples,

0:17there are a few, only a few that actually have their mouth open. So this is not

0:22going to be a

0:23very useful data set. So what are we to do? Well, one of the things we could do

0:29is go out and do a

0:30whole bunch of searching and see if we can find more Lego faces here that

0:34represent the two classes

0:36that we want to classify. Another thing that we can do is something called data

0:41augmentation.

0:42So let's say we have our image right here. Well, if we feed this into a model

0:47training process,

0:49this image is going to be different than this image. It's the same image, of

0:54course, but it's

0:55just rotated, maybe 30 degrees or something like that. And the training process

0:59isn't going to know

1:00that this is the same image. In fact, it's going to look at it exactly the same

1:04way. It's going to

1:05say, okay, what is this? What does this represent? And we can say this is a

1:09mouth open image. This

1:12is a different image than this, and this, it doesn't have any color, it's black

1:16and white.

1:17And our model is going to interpret that with a bunch of ones and zeros and say

1:21, hey, okay,

1:22I understand that you're telling me that this is a mouth open. So here's what I

1:26see significant

1:27about this picture. Here's what I see significant about this picture, so forth

1:32and so on.

1:33This is yet another different image to the model. We've just flipped it and

1:38zoomed it in a little

1:39bit. And in this image, we put some noise on it, some speckles. And in this

1:43image, we've turned up

1:45the contrast, made it look a little bit different, flipped it on its side there

1:49. So all these images,

1:50when it comes to image recognition training, as far as the model is concerned,

1:55are brand new

1:56images, these are totally separate images. So do you see where I'm going here?

2:00What we can do is

2:01use a technique called data augmentation, where we can take our relatively

2:06small data set,

2:08and then we can run it through some processes such that we can increase the

2:13population of training

2:15data that we have available to send into our model training process. And

2:19therefore, hopefully,

2:21our model is going to develop more accuracy. We are going to use data Wrangler

2:28to do that process.

2:30Next up, we're going to play around data Wrangler and let's get canvas cranking

2:34here.

2:35We can open canvas from here. If you didn't see that, I just clicked on canvas.

2:40And then it selected

2:41the user profile and I clicked on open canvas. And there's canvas. I think we

2:46can disregard

2:47that Kendra permissions. We're not using Kendra. So I'm going to click on data

2:52Wrangler. Now from

2:53here, I can import the data. So I'm going to go up here, import image. And then

2:59we have different

3:00options here. I'm going to import it from s3. And so what I can do is find our

3:07data. And I want to

3:08import let's work on mouth open first. So I'm going to click on that. And we

3:14should have our data

3:17being imported right now, no errors, that's a good thing. And so this kind of

3:22looks similar to

3:23glue or data brew or other ETL tools. But this is our source data. Now I can

3:30click on this little

3:30tab right here and it's going to show me the data itself. So here's all the

3:34pictures. And you can see

3:36it has all the mal this open. So what we can do from here is we can add some

3:40transformations

3:41because our goal here is to augment this data with some different shapes and

3:47kind of flavors

3:48of data. So we can go back here and click on add a transformation. Let's say

3:53that we want to

3:55we'll do a resize or no, we'll do rotate. What we'll do resize first and rotate

4:00. So resize crop,

4:02how many pixels? I'll say 100 and preview. So let's see what this is going to

4:09look like here.

4:10Well, it looks like it pushed in a little bit. Let me see if I can resize a

4:16little bit more. Let's

4:17do 200. See if that looks a bit different. I want something that's kind of

4:23markedly different.

4:24There we go. That's better. So basically we just zoomed into the face here. Now

4:29according to the

4:30model, when we're training the model, this is going to be a totally separate

4:33picture, an image,

4:34then the other images that we're basing this on. So from here, we can click on

4:39add. And if we

4:40want to do other transformations, we can do that too. Let's say I can corrupt

4:45the image. And I use a

4:47noise function there, severity, let's really corrupt it and see what happens

4:51here.

4:52All right, yeah, that's pretty corrupted there. Add some noise. Sure, we like

4:59that. All right,

5:00so this is how our data is shaping up right now. So we've zoomed in, we've res

5:05ized the images,

5:06we corrupted the images. All right, so one of the other things we can do is if

5:10we don't find

5:11something here that is to our liking, we can create our own, we can come down

5:15here and do a

5:16custom transform. And here we can actually paste in our own PySpark or pandas

5:21or Python code to do

5:23whatever we want it to do. But I'm not going to mess with that, get rid of that

5:28, get rid of that,

5:28let's go back to the data flow here. So what we have is we're resizing the

5:32image, we're corrupting

5:34the image. Now I can choose to export this data now. And I'm going to export it

5:40, let's say,

5:41Lego faces, mouth open, mouth open there, apply, just wanted to get that into

5:51this entry here.

5:52And I'm going to add another directory underneath that, we're going to say this

5:59is resize. Oops, resize

6:00noise. So there we go. And I'm going to add an export. And what that's going to

6:10do is that's

6:10going to basically start this export job process entire data set auto job

6:15configuration. Sure.

6:17Now, one of the other things that we could have done, let me show you back here

6:21is from here,

6:22we can define an export as well. And I think I'll go ahead and do that just for

6:27demonstration

6:27purposes, choose that, choose that, get that. There we go, apply, get that in

6:36there, mouth open,

6:39and we'll just say resize. So what this is allowing us to do is create

6:44different dropping off points.

6:47So we can have a whole cascade of these things. So what this is going to do is

6:51this is going to

6:52export just resized images. And then this is going to export resized and

6:58corrupt images. Now we can

7:00add more transformations here as well, and their own respective exports. So we

7:05can do this again

7:06and again and again. And what that allows us to do is compound the amount of

7:11data that we're

7:12producing. So forever, every run, we can create more and more data. So we can

7:16say, hey, run this

7:18once every 15 minutes or so. And then before long, we're going to have a decent

7:22stack of data.

7:24But let me go ahead and run this right now. Click on export. And we're going to

7:30say process

7:31entire data sets. We also want to click on that as well. Because that's going

7:35to allow us to export

7:36this data set. And then this also as well. And I'm just going to scroll down

7:42here, scroll down here,

7:43scroll down here, everything else looks good. All right, so our processing job

7:48has started.

7:49So now what we can do is go check out the job. So to find the job, we have to

7:55go back here to

7:55SageMaker, and we go down here under processing right here, and we check out

8:01processing jobs.

8:02And there's our job. So our job is running, and we're going to let it finish.

8:07It usually takes or

8:08it probably won't take five minutes or something like that. And then in the

8:11next video, we're going

8:12to go out and check out what it output it.

Reviewing Our Augmented Data and Teardown

0:00Alright, so our processing job is done. Let's hop out to S3 here, see if we can

0:06see what it did.

0:07Just scroll down to Lego Faces, mouth open, and if we scroll all the way down.

0:16Sorry, I have this

0:17really large, probably on your screen, the layout's much better. But if we go

0:22down here to resize noise

0:24and resize, if we go underneath there, we have all our files here, and these

0:30are the resized files.

0:31And we should see the same thing here, if we go under resize noise, these are

0:38the resized noise

0:40edition of the files. One of the things that we would have to deal with here is

0:44you can see

0:45that it kept the same file name. Ideally, when we're going to be doing this

0:48training, we're putting

0:49all this stuff in the same folder, and we're going to be pointing our CNN model

0:54, our pre-trained model,

0:56rather, at that collection of images. But here's the limitation of data Wr

1:02angler is that you can

1:03see here, it's nice and friendly. But if you're doing this at scale, you're not

1:08going to use this UI.

1:09It doesn't really have all the features that you would really want when you're

1:13trying to augment

1:14and transform your data. Sure, it does transforms here, but it's still pretty

1:19limited. What most

1:21data scientists will do is they will probably use like a Jupiter notebook or a

1:25Python script or

1:26something like that. In the future, when the time comes, and we need to augment

1:30this data,

1:31we are going to use a Jupiter notebook. And we are going to use some special

1:35Python functions that

1:36are purpose built for augmenting our image data. And they will be able to crank

1:41out hundreds of

1:42images, really as many images as we need in a few minutes. So I just wanted to

1:47take you through

1:47this tool just so you know what it's all about, what it can do, kind of what it

1:52's used for,

1:53just so you have some sense of when somebody says data Wrangler versus AWS glue

2:00versus glue

2:01databrew, you have an idea about the differences of those tools. Now let's tear

2:07things down.

2:09So really, we only created our domain. So we only have to tear down our domain.

2:15Now,

2:15one of the things if you remember, is first of all, we have to get rid of the

2:19spaces and any

2:20applications that are running in that come up here to delete, yes, delete, see

2:25if we can delete

2:25this. As you saw in the last video, sometimes this gets a little squirrely. And

2:29of course,

2:30this is no exception here. So I'm going to go back over to my session here, I'm

2:35just going to close

2:36out, close out of here, we'll go over here to resources, see if we can find

2:42that's our canvas

2:44running, we're going to stop that. And I think it stopped it, or it is in the

2:48process of stopping.

2:49There it is deleting, stopped. All right, now I go back to my space, click here

2:57delete.

2:57Now hopefully we should be able to delete our space. All right, space was

3:04successfully

3:05deleted. Let's go over to our user profiles, drill in here, delete our user,

3:11delete.

3:13And if everything's cool, now we should have an active delete domain button,

3:22yes, delete my domain, delete, delete. All right, see you later until next time

3:29SageMaker AI domain.

3:31That's really all we have to do. We are of course going to spin up domains

3:36again. So we

3:36are going to get very good at spinning these things up. The reason I shut them

3:40down is because

3:41I don't know when the next time you're going to start up the next video. And I

3:44certainly don't

3:45want you to accidentally leave stuff running that's going to cost you money. So

3:49that's why we're

3:50deleting things and spinning them up with every unique skill.

Validation: Labeling and Data Augmenting with SageMaker

0:00Let's take a look at these questions. First question, I'm going to let you read this.

0:04Hopefully you already have. Which approach should the engineer take? So let's start down here at

0:09the bottom. Use SageMaker Data Wrangler to pre-process the images before labeling. Well,

0:15yeah, we could probably do that. But if we're talking about a really large data set, chances

0:20are you're probably not going to use Data Wrangler for that. And it doesn't really say what we're

0:25going to be doing with this pre-processing. And we're not really asked up here to pre-process

0:30anything. Let's try the next one. Deploy a pre-trained model in SageMaker to label the

0:35entire data set automatically. Well, we could try that, but this kind of presumes that our data set

0:45is easy to label or there's a pre-trained model out there that we can use to label it. So I'm

0:51going to put a question mark there. Let's see if there's a better one. Use fully manual labeling

0:55with a private workforce to ensure maximum accuracy. Well, yes, we could absolutely do

1:01this as well. But one of the things we're asked about here is trying to reduce costs. So this is

1:07probably going to be the most costly way of labeling that data. Let's keep on going. Enable

1:12automated data labeling with active learning to reduce the number of images requiring human

1:17review. Yes, that is something we can do with Ground Truth. If you remember when we were setting

1:21up our Ground Truth job, it had that little checkbox there that we could say, yeah, try some

1:26automated labeling. So that is an option to reduce cost. I like that answer best of all. All right,

1:34next question. We have a data set, but they have trained a model using that data set and it's

1:39underperforming due to limited variation in lighting conditions. And this is not an uncommon

1:47problem. Once you train your model, you actually do some testing and you figure out, hey, it's not

1:53great in all situations. So what do you have to do? Well, you have to go back to your Ground Truth

1:58data and say, hey, do I have to change this or maybe adjust it in some way so that I can give

2:04the model that experience that it's not really getting correct? So which data wrangler transform

2:11should the engineer apply? Let's try from the bottom here. Remove duplicate images from the data

2:15set. No, that's not going to help us one bit. Convert images to grayscale to standardize the

2:21colors. Well, you may be thinking, well, yeah, if we just scale everything to gray, then we have no

2:27colors and that would be more appropriate. So we could just scale the input image to gray,

2:33but I don't really like this one. This is not probably going to help us with lighting conditions.

2:39Adjust the brightness and contrast of the images. Yeah, yeah, that could help if we were to take our

2:46existing data set and then add some variation by adjusting the brightness and contrast. Maybe

2:52that'll help the model train more accurately for those real world situations where we do have

2:58variations in lighting conditions. I'm going to put a little check mark beside this one because

3:03this is a pretty strong answer. Launch a project to collect new set of ground truth data. Well,

3:10we could do that, but I don't think that's called for. We can try some other things like we can try

3:15this adjusting the brightness and contrast before we undertake this new project. Plus it says which

3:21data wrangler transform should we use? And this is not a data wrangler transform. Last option,

3:29apply text normalization to image metadata. Well, that's not going to help us one bit. It appears

3:34we're using this for computer vision, so text normalization is kind of irrelevant. So I am going

3:41to choose that one as my final answer. All right, final question. You can read all this stuff, which

3:46is the best approach the team can take to address both augmentation and labeling needs. Let's start

3:53from the bottom here. Apply text augmentation and data wrangler to generate synthetic transcripts

3:59and use ground truth for audio labeling. Well, that's kind of backwards of what we're being asked

4:06here. I suppose we could use text augmentation to generate synthetic transcripts, but audio labeling,

4:15what is that going to do? What we're being asked here is transcript verification. So ideally we are

4:21sending into ground truth the transcript and then it's supposed to be judged as to whether it's

4:27accurate or not. Let's keep on going. Configure ground truth to perform audio augmentation

4:32and labeling in a single workflow. Well, ground truth does not really help us do audio augmentation

4:41or any sort of data augmentation. It just helps us facilitate data labeling. So we can eliminate

4:47this one. This is not a good option. Use ground truth to label the audio clips first. Okay,

4:52yeah, we can do that. Then import the labeled data into data wrangler for noise augmentation. Well,

4:59I suppose we could do this. Let's keep on going to see if there's a better option here. Use a

5:04custom transform in data wrangler to add synthetic background noise to the audio clips

5:10and export the data set to an S3 bucket for ground truth labeling job. Yes, this is probably

5:16more accurate or something that I would do rather than this one right here because in data wrangler,

5:24although it does not handle synthetic background noise for audio clips, we can absolutely do it

5:30with a custom transform and then exporting that out to an S3 bucket. Well, that's the starting point

5:36for ground truth labeling jobs. So of all these options here, this is probably my best choice. And

5:43remember, maybe these aren't all perfect options, but it says here the best, and I would say that's

5:49probably the best of these other options. So I hope this has been informative for you, and I'd

5:54like to thank you for viewing.

Part of AWS Machine Learning

10/20

What your team gets

SSO / SAML for enterprise identity
Per-team completion and skill-coverage reporting
Trainerbot AI — answers in Slack, trained on 5,800+ hours of IT content
559 expert-led IT courses

Built for IT teams — not a DIY content library.

$749per seat / year

See team plans

HIPAA
PCI-DSS
CMMC

Request a Demo Explore Team Plans

Preparing for a cloud migration, a security audit, or a team ramp? See the fit in a 20-minute demo.

Learning on your own? Browse individual plans ($499/year) →

23,000+organizations trust CBT Nuggets

More platform capabilities

Team assignments and playlist enrollment
Hands-on virtual labs for real-world readiness
Practice exams powered by N2K
Live insights on team progress and skill gaps
See platform features →

Security & compliance overview →

Team training path

Turn this skill into assignable team training

This free skill is a preview of the courses your team can assign, track, and report on with CBT Nuggets.

AWS

AWS Machine Learning

Assign the full course, track completion, and connect this skill to your team's readiness plan.

For teams

Build a path around this skill

See how courses, reporting, labs, and Trainerbot fit your rollout.

Request a Demo

What's next?

Ready to keep going?

For your team

Bring this training to your team

See how CBT Nuggets helps IT teams close skills gaps, hit compliance targets, and prove training ROI.

Request a Demo

Just need AWS Machine Learning? Enroll from $300/yr (20 skills) →

Learning on your own? Browse individual plans ($499/year) →

Request a Demo