Understanding Cyc, the AI database

Cyc is a project I have never personally used, it was created before my time, but its referenced often in AI history and its a project that is still going. It is a knowledge base with the goal of “codifying, in machine-usable form, the millions of pieces of knowledge that compose human common sense”. According to their site: “Cyc is the most advanced machine reasoning platform for Enterprises”. They have essentially created a “generalized expert system” with rules for most domains so it can be used across everywhere. Instead of starting from expert domains, they started from the ground up with basic rules like “babies cry”, “humans are organisms”, “all organisms die”, etc. Expert systems stored rules for specific domains like in healthcare or law. Expert systems are generally seen as a dead end now compared to statistical machine learning.

Douglas Lenat created this project in 1984, so its almost 40 years old. Doug’s definition of intelligence: “Intelligence is ten million rules.”

They have raised supposedly $7-25 million in funding to build out their project. In terms of its impact on AI, I would say it has basically failed. I don’t think it advanced the state of the art in AI and its certainly not bringing us AGI. Doug said around 2016 that enough knowledge has been collected for it to now be ready to use by a wider audience, but I haven’t seen or heard of recent companies using it. As a company, I think they have done a good job, I haven’t seen other AI companies survive for almost 40 years. It looks like they have 50 full-time employees and they generate revenue by powering AI engines for many other companies. It looks like their business model is essentially to be an IBM like consulting company for AI projects that need systems with lots of rules. They then build custom software for clients on top of their Cyc engine. Supposedly lots of government projects have used them.

Examples of its knowledge base:

(#$isa #$BillClinton #$UnitedStatesPresident)
  • “Bill Clinton belongs to the collection of U.S. presidents.”
(#$capitalCity #$France #$Paris)
  • “Paris is the capital of France.”
  • “you can’t be in two places at the same time,”
  • “you can’t pick something up unless you’re near it,”
  • “Every tree is a plant”
  • “Plants die eventually”

Cyc can infer “Garcia is wet” from the statement “Garcia is finishing a marathon run”, employing its knowledge that running a marathon entails high exertion, that people sweat at high levels of exertion, and that when something sweats it is wet. All of these rules must be coded into a machine readable format like the ones I shared above.

The database supposedly has 160,000+ concepts and 3-25 million facts. I dont know what the difference between concepts and facts are in their system.

Not Grounded

Concepts are not grounded or embodied in our world like all human concepts are. I can’t blame them for this, currently no AI system is grounded. To have a database with millions of facts, how does the concept of above or behind work with just facts? It “knows” that water makes you wet and cold because of the database of rules that say water does this, not from any kind of grounding or knowledge of the real world.

Children vs Experts

The goal of expert systems are to build AI’s with knowledge in specific domains so that they can do things faster than regular people. Cyc was building multiple expert systems in one system to contain all knowledge. The thing is, these systems can’t understand or do anything that a 3 year old baby can do. I’d rather have a system that had the knowledge that a 3 year old baby has.

Hidden from researchers and the public

Its totally fine to be a commercial project that doesn’t share anything publicly, most companies do this. The most information we know about them recently that I could find is here: https://news.ycombinator.com/item?id=21781597 . Almost all large scientific endeavors these days are open, or at least partially open. Its hard to have trust and build on something that is a secretive project. Open source and open science has shown us projects can thrive by being open. Cyc did release OpenCyc for a while, a free version of Cyc, but they killed it after a few years. If the system was more open, more experiments could be run on it to test the limits of the engine and to propose other ways to improve.

Not usable from day one

For the project to be useful, it needs to have millions of facts stored in its system which means years of time to build it out. Doug initially estimated it would take hundreds of human years to build the system out. I don’t know what the atmosphere was like in the early 1980’s, but it was a closed source project from the beginning and so no one could really test and use it. The information I found is all third party accounts of it. Even today, almost 40 years later, we still can’t freely test it out to use it. Other big AI projects like GPT-3 are not 100% freely usable, but at least certain scientists can test and review it.

Insights from Doug

Doug has an interview from 2021 you can watch, lots of fascinating stuff: ttps://www.youtube.com/watch?v=3wMKoSRbGVs

For all of their clients, they respect their clients’ proprietary information, but if it is common knowledge that needs to be added to support their system, then all that knowledge becomes available to everyone. He said “Don’t get caught up in the academia game where you compromise your AI work and do a tiny project just so you can get a tiny paper published that maybe a few people will read”. If you know you are working on a big AI project, you need to keep pushing on it and working on it if you know you are on to something. I admire his tenacity. They have not been in the academia publishing game and have continued to build out their platform in the background. OpenCyc was built to get customers to go to their paid platform, but according to him ,what happened instead was people thought OpenCyc was good enough and people just used that instead. That is the reason they supposedly killed it off. In his interview, Doug mentions he knows he doesn’t have a lot of time left (over 70 at time of interview) and so he wants to accelerate Cyc’s contribution by getting it more business and allowing it to have metaphysical contributions. He hopes his legacy is to be remembered as “One of the pioneers or inventors of the AI that is ubiquity similar to the way we look at the pioneers of electricity and someone who wasn’t afraid to spent several decades on a project while other forces are incenting people to go for short term rewards”.

Single modality

Back in the 1980’s we didn’t even have multi modality systems. Even today, most systems are still focused on a single modality, text processing. Many of the recent advancements in machine learning have been around image understanding and sound comprehension. I think it would make the system more useful if it added more modalities to it so it can understand and use its knowledge in other modalities. For example in vision, could it understand that someone in a store has a weapon and use that info to notify police that a violent crime is about to occur? How about in sound, could you combine the Cyc engine with sound and if you hear the sound of something breaking and a baby crying, could you make an inference on what to do next?

Not learning on its own

With all the millions of pieces of knowledge in the system, you would imagine it would be at a point where it could learn on its own. As far as I know each fact still needs to be manually input into the system. Even if its not learning over time, maybe even automated ways to get new candidate data into the system would make it more useful.

Once again, without access to the system, its hard to see if they have changed much, but given this recent interview, I don’t think its technology has improved that much. It would be interesting to use the Cyc knowledge base as a benchmark and ground truth for building other machine learning systems. In other words, build machine learning models that train on data and then use the Cyc engine to compare its learned model to Cyc and use that as a guide to improve its learning.

Given all that has changed in the past 40 years in the AI landscape, would it be possible to redo in a Cyc 2.0? Deep Learning and statistical methods have really been pushing the forefront on what AI technologies can do, but none of them still understand anything about the actual world. That given, I think its possible to make a better and more usable knowledge representation graph. I would try to create a database like the one discussed here. I would make it usable from nearly day one, at least make most of the benchmarks public so we can compare performance of it compared to other systems. If it must be a self sustaining business, make the technology available on some sort of freemium or trial software. I would build the software similar to spacy where they let users directly build on top of their software for free and then they charge a premium for other features. With today’s automated data collection, I would imagine most of the facts that Cyc has can be crawled from the internet.

Building it faster

Given all the advancements in machine learning these days and the power of the internet, I think a next generation Cyc could probably be built much faster than the original one. Could you have algorithms automatically figure out the rules that were manually input? You would need to have a very realistic physics simulator.

[ grounding questions]









Leave a comment

Your email address will not be published. Required fields are marked *