Liwaiwai Liwaiwai
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • Learning
  • About
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • Learning
  • About
Liwaiwai Liwaiwai
  • /
  • Artificial Intelligence
  • Machine Learning
  • Robotics
  • Engineering
    • Architecture
    • Design
    • Software
    • Hybrid Cloud
    • Data
  • Learning
  • About
  • Data
  • Machine Learning

Effective Management Of Data Sources In Machine Learning

  • May 29, 2023
  • liwaiwai.com

Machine learning and artificial intelligence have been a buzzword as tools based on these technologies, like Chat GPT, became available to the general public.

In this article, I would like to have a peek into the ML backstage, especially the data it ingests– the essential fuel for any ML model no matter what it is applied to.


Partner with liwaiwai.com
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

There are two major aspects of data when it comes to selecting and processing it: quality and quantity. Also, there is a third factor affecting the practical side of ML: economic feasibility. I would like to share my personal experience of working with different data sources and finding the precise case-tailored balance between data quality, quantity, and available resources

Types of Data Sources

Two primary data types exist, each requiring distinct processing approaches prior to their utilization in machine learning. Firstly, there is user-generated data, which entails meticulous logging of user activity and behavior. In domains like online advertising, actions such as clicks and conversions are collected, logged, and transformed into labeled data.

Secondly, we encounter data that lacks inherent labeling but necessitates ML model training. To address this, manual data labeling by human annotators becomes crucial. A prime example is the MNIST dataset, a renowned computer vision dataset compiled and labeled by humans.

Another strategy for handling non-labeled data emerges when resource constraints hinder staffing sufficient annotators for labeling training data. In such cases, the proxy-label method can be employed. For instance, users can report inappropriate content, which can then be labeled accordingly. However, this approach may introduce noise and occasionally require the attention of the annotators’ team.

Now, let’s delve into the approaches for effectively managing the most challenging data type: non-labeled data.

Gathering Datasets with Human Annotators

Building datasets employing human annotators is a commonly used technique in machine learning. Its two main limitations are cost and time consumption. 


Costs can turn into a major challenge if a dataset is large and processing it means hiring many people. Also, the nature of data may require a high level of expertise and, consequently, employing expensive specialists. For instance, annotating medical images or legal documents can only be done by highly trained staff. 

Time can be an issue for several reasons. First, it is obvious that large datasets take a long while (or an impractical number of people) to process. Sometimes it can also be difficult to recruit annotators willing to engage in long-time projects. Second, annotators should be adequately trained to deliver the desired level of quality. Such training can be time consuming, especially when high expertise levels are required. 

Read More  How You Can Automate ML Experiment Tracking With Vertex AI Experiments Autologging


Let me share the lessons I learned while working on an image quality classifier for an online shop. Our solution was built to automatically detect blurry, cropped, and images with an unprofessional background. In order to accomplish this, we needed to gather a training sample. Here is what made the process more effective

Annotate data in batches 
As always, we had a limited budget for human raters, so splitting the annotation process into batches was extremely useful. Initially, our raters weren’t producing high-quality results, and when we discovered this, we had only spent about 5% of the budget. To address the issue we collaborated with the team of annotators to refine the guidelines and process in order to reduce the number of errors (more on this in the next section).

Detecting the data quality issue early on allowed us to allocate the remaining 95% of the budget to a well-trained team of human annotators.
Another advantage of sending data in batches is that it enabled us to implement active learning. If we had sent the entire dataset to the annotators all at once, we wouldn’t have been able to utilize this strategy

Sample batches with active learning
Random Sampling is generally a good initial approach, but with this particular project there were two problems:

The negative class was overrepresented, because good images were a majority in the database. Also, with this approach, we could not easily add “difficult” images (with borderline predictions of about 0.5) to a training sample

The solution was to sample with Active Learning.

We followed three steps:

  1. Randomly sampled some images;
  2. Got predictions for every image using the current model;
  3. Sorted images by predictions in descending order, and sent images with the highest probabilities OR borderline images to human annotators.

Switching from random sampling to an active learning strategy enabled us to increase the percentage of the positive class:

  • 2x for images with an unprofessionally made background;
  • 4x for partially displayed images;
  • 33% for blurry images

Track annotators’ quality
It’s quite common when working with human annotators to collect several responses per item and assign the final label using the supermajority rule.

Read More  Can Robots Write? Machine Learning Produces Dazzling Results, But Some Assembly Is Still Required

For example, let’s assume we collect three responses per image and if two answers indicate that the image is blurry we assign it as the final label. Intuitively it seems like this procedure should significantly increase the accuracy of the data.

However,  if we make simple calculations we will see that the increase in the probability of the final label being correct is small and it’s more important to work on the accuracy of the individuals.

Let’s assume each annotator has a minimum probability to give a correct answer for a given question; let us put this probability as ‘p’. Now let us calculate what is the minimum probability for at least two raters to give a correct answer; we shall put this probability as ‘q’. Then, q is a sum of probabilities of two events: all three raters giving a correct answer (p^3) and any two raters giving a correct answer 3*(1-p)*p^2. In the table below we can see how q changes depending on p.

As we can see, the final probability (q) does not change much if we apply the supermajority rule with a minimum of three responses. That is why it is very important to track the quality of each annotator.

In our case, the following process helped: 

  1. Introducing a “golden set”, created by well trained annotators. We used this set to calculate accuracy for each annotator assigned to our project;
  2. Having a biweekly AMA where annotators could ask the questions about controversial images;
  3. Introducing final exam and minimum performance threshold.

An annotator could only start rating if they pass the final exam and got a score higher than the threshold.

Reducing human involvement 

Sometimes, a human-based approach may fail or become too impractical to employ. In many cases, there might be an elegant solution to bail you out. Let us have a look at some of them:


Proxy Values
Sometimes we could be creative and use a proxy value for a label instead of building a dataset employing human annotators. For instance, once I was developing automation that would allow us to identify family-friendly properties listed on an accommodation-booking website at scale.

There are several proxy signals for this case -we can look at the review ratings left by travelers and labeled all the hotels with good ratings left by families as “family-friendly”. Also, we can create a “non family-friendly” dataset obtained in a similar way. As an alternative, we can just measure the share of family bookings. If the share of such bookings over a set period (say, a year) exceeds a certain threshold then we can label the accommodation as family-friendly. This works because people who book hotels usually do extensive research and we just use their research results expressed as their decision.

Read More  Statistics For Dummies: The Null Hypothesis


Data augmentation
In this case, data from an already processed dataset is reproduced in an altered form and then fed back into the model. Working on my image quality project, I took the existing good photos and used basic graphic tools to make the good photos bad. In particular, I blurred the good images and added these “bad” images to the training set. I did the same with randomly cropping the images as well, so it displayed only part of the product. These simple transformations brought me +8.08% to ROC AUC

As you see, human input in machine learning is essential. However, managing it may be a matter of survival for many tech projects. The rule of thumb is opting for less but more qualified staff, setting reference datasets, and extracting as much as possible from the actions of users, employing them basically as free annotators. If you want to entertain yourself with a mind game I can suggest an interesting task. Imagine you manage a media platform that uses crowdsourcing, publishing written content produced by authors of different backgrounds, skills, and quality.

Initially, selecting, editing, and publishing good articles and rejecting bad ones was entrusted to human editors. But as your platform gains popularity, editors become overwhelmed by the workload. Boosting your staff is not an option, as your budget is limited.

Try to think out ways to reduce the number of the editors’ incoming tasks by means of ML:

  1. Rejecting AI-generated texts;
  2. Turning down texts that are mere compilations of older ones;
  3. Using historical data to predict articles’ readability and user engagement, rate them accordingly, and prioritize editors’ tasks according to this rating.

By: Kristina Fedorenko (ML eng at Meta, biking through London, mom of happy bunny Victor)
Originally published at Hackernoon

Source: cyberpogo.com


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

liwaiwai.com

Related Topics
  • Data Management
  • Machine Learning
  • ML
You May Also Like
View Post
  • Artificial Intelligence
  • Data

Applying Generative AI To Product Design With BigQuery DataFrames

  • September 21, 2023
Microsoft and Adobe
View Post
  • Artificial Intelligence
  • Machine Learning
  • Platforms

Microsoft And Adobe Partner To Deliver Cost Savings And Business Benefits

  • September 21, 2023
Data
View Post
  • Artificial Intelligence
  • Machine Learning
  • Technology

UK Space Sector Has Sights Set On Artificial Intelligence And Machine Learning Professionals

  • September 15, 2023
View Post
  • Data
  • Platforms

Microsoft And Oracle Expand Partnership To Deliver Oracle Database Services On Oracle Cloud Infrastructure In Microsoft Azure

  • September 14, 2023
Data
View Post
  • Artificial Intelligence
  • Engineering
  • Machine Learning
  • Platforms

How Verve Group Transforms Customer Experiences With Google Cloud Vertex AI

  • September 11, 2023
View Post
  • Artificial Intelligence
  • Data
  • Platforms
  • Software Engineering
  • Technology

Combining AI With A Trusted Data Approach On IBM Power To Fuel Business Outcomes

  • September 11, 2023
View Post
  • Artificial Intelligence
  • Machine Learning
  • Technology

ListenField Enables Farmers To Harvest The Benefits Of AI And Machine Learning

  • September 7, 2023
View Post
  • Data
  • Learning

Resources to Take Your Charts From Bland to Beautiful

  • September 7, 2023
A Field Guide To A.I.
Navigate the complexities of Artificial Intelligence and unlock new perspectives in this must-have guide.
Now available in print and ebook.

charity-water



Stay Connected!
LATEST
  • 1
    NASA’s Mars Rovers Could Inspire A More Ethical Future For AI
    • September 26, 2023
  • 2
    Oracle CloudWorld 2023: 6 Key Takeaways From The Big Annual Event
    • September 25, 2023
  • 3
    3 Ways AI Can Help Communities Adapt To Climate Change In Africa
    • September 25, 2023
  • Robotic Hand | Lights 4
    Nvidia H100 Tensor Core GPUs Come To Oracle Cloud
    • September 24, 2023
  • 5
    AI-Driven Tool Makes It Easy To Personalize 3D-Printable Models
    • September 22, 2023
  • 6
    Applying Generative AI To Product Design With BigQuery DataFrames
    • September 21, 2023
  • 7
    Combining AI With A Trusted Data Approach On IBM Power To Fuel Business Outcomes
    • September 21, 2023
  • Microsoft and Adobe 8
    Microsoft And Adobe Partner To Deliver Cost Savings And Business Benefits
    • September 21, 2023
  • Coffee | Laptop | Notebook | Work 9
    First HP Work Relationship Index Shows Majority of People Worldwide Have an Unhealthy Relationship with Work
    • September 20, 2023
  • 10
    Huawei Connect 2023: Accelerating Intelligence For Shared Success
    • September 20, 2023

about
About
Hello World!

We are liwaiwai.com. Created by programmers for programmers.

Our site aims to provide materials, guides, programming how-tos, and resources relating to artificial intelligence, machine learning and the likes.

We would like to hear from you.

If you have any questions, enquiries or would like to sponsor content, kindly reach out to us at:

[email protected]

Live long & prosper!
Most Popular
  • Intel Innovation 1
    Intel Innovation 2023
    • September 15, 2023
  • 2
    Microsoft And Oracle Expand Partnership To Deliver Oracle Database Services On Oracle Cloud Infrastructure In Microsoft Azure
    • September 14, 2023
  • 3
    Real-Time Ubuntu Is Now Available In AWS Marketplace
    • September 12, 2023
  • 4
    IBM Brings Watsonx To ESPN Fantasy Football With New Waiver Grades And Trade Grades
    • September 13, 2023
  • 5
    Document AI Workbench Is Now Powered By Generative AI To Structure Document Data Faster
    • September 15, 2023
  • /
  • Artificial Intelligence
  • Explore
  • About
  • Contact Us

Input your search keywords and press Enter.