Build A Fully Custom ML Model from Zero to Hero

Overview

Data scraping serves as the foundation upon which powerful, custom models can be trained and fine-tuned with. ChatGPT was born entirely from data scraped from the internet. All great AI algorithms start with great data.

To keep pace with the rapid growth of AI, companies must shift focus to becoming more data-oriented.

Challenge

An enterprise client in the Education industry wanted to create a content-classification model to automate the process of classification of untagged text entries that were entered by hundreds of thousands of unique users. The end goal would be to feed raw text entries to a custom ML model which would classify the entry into the correct education category such as “Calculus”, “World History”, “Psychology”, etc. In order to properly source and clean the data and then train the ML model, we divided this project into several distinct phases.

Solution Overview

Understanding Client Needs

Meetings with stakeholders in the company helped us inform our strategy for what external web data could be used to create the desired ML model. In particular, we honed in on a couple of large websites that contained a sufficient combination of text-tag pairings that could be sourced, cleaned, and used to train an ML model to fit our client's needs.

Data Scraping Layer

After identifying a large website containing both textual entries and associated content tags, a bot was written to programmatically map out the entire website, and extract and load the website into AWS cloud database storage. Advanced data scraping techniques such as header setting, proxy rotation, and JS fingerprinting were used to extract data at scale. We integrated proper multi-threading techniques to extract data at a respectable rate to not weigh heavily on the target web server. Once the raw data extraction is complete, we prepare the data to move on to the data engineering phase.

Data Engineering Layer

As the scrape progresses, the amount of text-file data being extracted crescendos from single-digit gigabytes to hundreds and thousands of gigabytes upon completion.  In order to properly focus and digest this colossal amount of data, we parse the files down into more consumable portion sizes by cleaning and filtering out unneeded HTML and extraneous text snippets. At the conclusion of this phase, we have a clearly defined schematic of data that is ready to begin using for model training and fine-tuning.

Machine Learning Layer

Having completed the heavy lifting of data ETL, we spin up a distributed Dask cluster to efficiently distribute computing resources to create our ML model. Different model methods were tested and assessed for their performance before settings on a decision tree-based ML algorithm as the most effective text classification algorithm.  The completed ML model and weights were shared with the client for their continued internal use.

Report Generation

This final ML model was then applied to generate classifications for all untagged text entries.

Afterward, we generated a report for our client detailing the distribution of the user texts by subject type in order to answer our client’s question regarding which subject categories were most popular as well as inform our client’s strategies into which subject areas they should invest most of their attention to in the future.

Conclusion & Next Steps

While the potential of artificial intelligence is exciting, our experience is companies often over-eagerly jump into model building while overlooking fundamental data engineering layers below the ML layer. Making the investment to build a solid data engineering foundation greatly increases the quality of output at the ML stage.