COMMENT
Artificial intelligence (AI) is rapidly altering nearly every aspect of our daily lives, from the way we work to the way we absorb information to the way we determine our leaders. Like any technology, AI is amoral, but it can be used to advance society or cause harm.
Data is the genes that power AI applications. It’s DNA and RNA rolled into one. As they often say when building software systems: “garbage in/garbage out”. AI technology is only as accurate, safe and functional as the data sources it relies on. The key to ensuring that AI delivers on its promises and avoids its nightmares lies in the ability to keep the garbage out and prevent it from proliferating and replicating across millions of AI applications.
This is called data provenance, and we can’t wait another day to implement controls that prevent the future of our AI from becoming a giant pile of garbage.
Bad data leads to AI models that can propagate cybersecurity vulnerabilities, disinformation and other attacks globally in seconds. Of today Generative AI (GenAI) are incredibly complex, but, in essence, they simply predict the next best block of data to produce, given an existing prior data set.
A precision measurement
A ChatGPT-like model evaluates the set of words that make up the original question asked and all the words in the model’s answer so far to calculate the next best word to return. He does this repeatedly until he decides he has given a sufficient answer. Suppose we evaluate the model’s ability to put together words that make sentences that are well-formed, grammatically correct, relevant to the topic, and generally relevant to the conversation. If so, today’s models are surprisingly good – a measure of precision.
Dive deeper whether the text produced by artificial intelligence always conveys “correct” information. and appropriately indicates the level of confidence of the information conveyed. This reveals problems that arise from models that predict very well on average, but not so well in edge cases, which is a robustness issue. This can be exacerbated when poor data from AI models is stored online and used as future training data for these and other models.
Poor results can replicate on an unprecedented scale, causing an AI negative cycle.
If an attacker wanted to aid this process, they could intentionally encourage the production, storage and dissemination of further malicious data, leading to further misinformation coming out of chatbots or something as nefarious and scary as autopilot models of cars who decide they have to do it. quickly steering a car to the right despite objects being in the way if they “see” a specially crafted image in front of them (hypothetically, of course).
After decades, the software development industry, led by the Cybersecurity Infrastructure Security Agency, is finally implementing a safe by design structure. Safe by design requires that cybersecurity be the basis of the software development process and one of its fundamental principles requires the cataloging of each software development component: a software bill of materials (SBOM) – to strengthen security and resilience. Finally, security is replacing speed as the most critical go-to-market factor.
Protecting AI projects
Artificial intelligence needs something similar. The AI feedback loop prevents common cybersecurity defense techniques of the past, such as tracking malware signatures, creating perimeters around network resources, or scanning human-written code for vulnerabilities . We need to make safe AI design a requirement during the infancy of the technology so that AI can be made safe long before Pandora’s box is opened.
So how do we solve this problem? We should take a page from academia. We train students with highly curated training data, interpreted and transmitted to them through a field of teachers. We continue this approach to teach adults, but adults are expected to do more data curation themselves.
AI model training should take a two-step curated data approach. To start, basic AI models would be trained using current methodologies using massive amounts of less curated datasets. These large basic language models (LLMs) would be more or less analogous to an infant. Entry-level models would then be trained with highly curated datasets similar to how children are educated and raised to become adults.
The effort to create large and curated training datasets for all types of objectives will not be small. This is analogous to all the effort that parents, schools and society make to provide a quality environment and quality information to children as they (hopefully) become functioning, value-added contributors to society. This is the level of effort required to create quality datasets to train quality, well-functioning, minimally corrupt AI models, and it could lead to an entire industry of AI and humans working together to teach models to artificial intelligence to be good at their objective work. .
The current state of the AI education process shows some signs of this two-step process. But, due to the early stage of GenAI technology and the industry, too much training takes the less polished, first-stage approach.
When it comes to AI safety, we can’t afford to wait an hour, let alone a decade. AI needs a 23andMe application that allows for complete review of “algorithm genealogy” so developers can fully understand the “family” history of AI to prevent chronic problems from replicating, infecting systems critical factors we rely on every day and creating economic and social damage. this may be irreversible.
Our national security depends on it.