TL;DR:

This is the subject of my undergraduate honors thesis at the University of Utah, and I’m doing the bulk of the work this summer. For a TL;DERTTL;DR, 1too long; didn’t even read the TL;DR see my poster (pdf).

My research is on plan-based natural language (NL) processing and generation, as an alternative to machine learning methods. ‘Plan based’ NL methods treat speech as a sequence of actions that, when strung together, have some effect on the hearer. I’m adopting a method for plan-based NL generation (writing sentences), and modifying it to work for NL parsing. I believe this has the potential to better utilize known information and context, and won’t need large datasets. It’s intended for longer interactions, and is situated as a TRPG game.

NL Now

Natural language (NL) is human understandable text, and computers are working on processing (understanding) and generating (writing) it. The best natural language processing is done with machine learning, and a fair amount of generation uses learned models. You’ve seen these: Google Translate uses machine learning. Many news stories are at least partially automatically generated. Anything you post online is likely processed by a learned model or five, seeking better ways to advertise to you.

Natural language processing still struggles on many fronts, however. Learned models often seem quite realistic but they have little to no deeper understanding. Common sense understanding remains an issue. A great model might recognize that “Colorless green ideas sleep furiously” is statistically unlikely, but a more subtle problem like answering “Can a hippo fit in a barrel?” evades us. Long self-referential texts are also a difficult problem, as are multiple cross-referencing documents. Some progress is being made here, but it’s difficult for a learned algorithm to identify even nearby textual relationships. It’s even harder to do situated language, with pointing and references to nearby objects or mutually known out-of-sight objects.

Another large issue is the lack of labeled data. Machine learning requires data—and lots of it. Data scarcity isn’t as big of a problem here as it is elsewhere, but it remains an issue. The internet has a lot of text, but little of it is labeled, and a lot of it is only in English. Soon we’ll have done all that can be done on the already labeled datasets, like translation, captioned images, and chat. Specialized domains require specialized datasets, and non-English applications need non-English datasets. For example, making a smart building that understands maintenance requests would require labeled data about that particular building, its occupants, and language specific to repairs. (“My lights are flickering. Could you replace them with the lightbulbs Uutoni has?”) And if the building was in Namibia, all the data would need to be in Khoekhoe, the Namibian national language.

Advantages of Plan Based NL

I’m working on a non-machine learning method for natural language processing/generation because I believe it offers some solutions to the above problems. Planning doesn’t use statistics (aka big data), it uses an authored domain. That’s not to say it couldn’t use big datasets to make the domain, just that it can use hand-authorship.

I think planning could also help with common sense understanding, because it integrates right into a system for modeling the user’s goals. A big part of commonsense is a mutual understanding of each other’s goals. When I say “take a seat”, I don’t mean just any seat, I mean a seat at my table because we’re at a lunch meeting. Another big part of commonsense is mutual context—understanding which waiter “our waiter” is. Plan-based NL also has an advantage here, because it integrates right into a model of the world. In general, plan-based NL will better integrate into any symbolic logic, while learned approaches don’t integrate well with anything.2That is, learned models are usually piles of linear algebra that don’t hook well into anything but what it’s trained for. Models trained together, or models trained to operate on symbols work okay.

Disadvantages

The Natural Language field moved to machine learning for good reason. Logic based approaches can be brittle, slow, and require expensive experts. The Plan-based NL I’m proposing wouldn’t work in a Siri-like role, because it wouldn’t be integrated into the goal model or world model that gives it advantages. I’m developing this system in part for an Automated Game Master, which has those models in place. This method also requires handwriting a domain, which is a laborious process.3I’m hoping the domain might eventually be learnable, but that’s a long way off. One of my goals with this research is to make domain authorship easier, but it will still require a human to sit down and think real hard. This is the tradeoff for not relying on data, and it won’t work for many applications. Because I’m handwriting them, these domains won’t be fully fleshed out—they’ll work for describing a limited game, but won’t generalize well. If a player uses words not defined in our domain, the system probably won’t handle it gracefully.

This is basic research, and it won’t be great. Hopefully, it will work well enough at some tasks to warrant more research.

The Approach:

I’m basing my work off the plan based language generation systems SPUD, CRISP, and mSCRISP. The CRISP4I use CRISP to refer to the whole progression of these systems, which operate off the same basic approach. system models speech as the creation of syntax trees, like this one.

CRISP creates syntax trees by piecing together a bunch of incomplete trees, called LTAGs. 5The parts of speech in the full tree don’t match those in the LTAGs because the full tree was generated with the Stanford Parser, not XTAG. We start with the “shoot” tree, attach the “I” and “monster” trees, then attach the “the” tree to the NP (noun phrase) with “monster”. The LTAGs below contain all the info needed for how it can be correctly attached to other trees.

This isn’t CRISP’s idea, it’s the idea behind the XTAG parser (they created the database for all these LTAGs). Rather than parse sentences into syntax trees, like XTAG, CRISP generates entirely new sentences that communicate something. CRISP takes the LTAGs and attaches some meaning to them. (Like that the shooter will be specified by NP0, and the victim by NP1.) Mostly, CRISP has done referential expressions. If CRISP wants to specify rabbit_03, it will generate an expression that differentiates rabbit_03 from rabbit_02, rabbit_01, and farmer_joe. So, if rabbit_03 is the only blue rabbit, CRISP makes “the blue one”. (Or if farmer_joe is also blue, “the blue rabbit”.)

CRISP does all this using a specialized domain description, and off-the-shelf planners, which is very neato. The domain’s actions describe the creation of each syntax node in an LTAG, the grammatical requirements (like the down arrows on the NPs in “shoot”), and how the meaning changes with each addition. The “shoot” action initializes a group of all objects that might be the shooter (NP0). When “I” is adjoined to NP0, that group is narrowed down to just the speaker.

My Additions: Plan Recognition

The primary thing I’m doing is using CRISP’s specialized domains to also process language, not just generate it. Given a statement like “the blue rabbit”, I will translate it into the observed speech actions (those incomplete trees, one for each word), and run it through a standard plan recognition algorithm. The plan recognition pieces the actions together in the correct order, and tells us what the speaker’s intended meaning most likely was. (To denote rabbit_03.)

This isn’t that useful alone, except perhaps to fill in missing words. (“the blue” becomes “the blue rabbit” or “the blue farmer” with equal likelihood.) With a complete statement, however, we could just figure out the effect of these actions in sequence. Plan recognition becomes useful when it has access to more observations about the speaker—what they’ve done in the past. With this history, the algorithm can better judge what the speaker’s goals are.

Plan recognition for language, then, works best over long interactions where the human works towards goals (i.e. aren’t aimlessly exploring). That fits a lot of scenarios—cooperative human-robot teams, long-term digital assistant, tutorial AIs, etc. In particular, this fits a tabletop role-playing game.

My Additions: Game Incorporation

In the context of a game, if a player has rock_07 in their possession, “I throw the rock at the rabbit” probably indicates they’re throwing their rock, not rock_02 which the player doesn’t even know about. The game6game engine, technically I’m adopting (GME) already uses a planning domain to model all this—why not make that accessible to the plan recognition algorithm? Suddenly the plan recognition algorithm isn’t just trying to recognize the sentence’s meaning—it’s recognizing the action the player is attempting. It’s recognizing the player’s goal: injuring rabbit_03, or perhaps obtaining rabbit_03’s meat.

I outline why I think tabletop role-playing games (TRPGs) are specifically suited to research on another page. What I’m making is functionally an automatically generated text adventure with natural language support, but I like to think of it as a text-only, two-player TRPG.7TRPGs are usually played aloud with 4-8 people. It’s an easy way to get humans to interact with the language system, and an easy simplified model for real life interactions. It introduces the problem of reference resolution, while framing it in a context with larger goals for the plan recognizer to work with. Also, creating an AI Dungeon Master is a Cool™ thing to say you research.

To modify CRISP’s approach to work with goal recognition, I need to integrate their domain with the domain of a game, so that describing an in-game action has the same effect as that action. If the goal recognizer can’t see a way for the LTAGs “shoot the goblin” to have any effect on goblins, it can’t differentiate between goblin_01 (who is a friend) and goblin_04 (who just attacked). So I’m working on how to best integrate the two domains. My idea, for now, is to include a ‘compile’ action, which translates the language-y effects of “I shoot the goblin” into the in-game effects of shoot(player_01, goblin_04).

My Additions: Authoring A Domain

Authoring the domain I describe is… Hard. Very hard, I’m learning. If plan-based NL is to be a viable low-data alternative to learned models, authoring the domain must be cheap. Theoretically, the only prerequisite knowledge needed to make a language domain is a good grasp of English8So far as I can tell, English is the only language with a database of LTAGs. This is unfortunate, but not unexpected. and its parts of speech, and familiarity with the application language support is needed for. Practically, what’s needed is all that plus familiarity with XML and CRISP’s formatting, the technical skills to get XTAG even compiled9I’m not bitter about how XTAG is decades old. Not at all. , and a lot of grit to copy over XTAG’s LTAGs to XML for CRISP. I hope that as I struggle with this myself, I can write a more efficient tool for writing domains. This isn’t the main goal, but I’d like to prove that plan-based language doesn’t need expensive experts to work.

Steps:

  • How does plan recognition work on CRISP language domains, without integration into a game?
    • How does it do recognizing the goals of fully specified language plans?
    • How does it do on full language plans that are out of order?
    • How does it do on language plans with missing words?
    • How does it do at differentiating between two possible meanings of words?
  • How do we integrate a CRISP-style language domain into a plan-based game’s domain?
  • How do these integrated domains perform for generation and parsing?
    • For any integrated domain, how expressive is it? Can a planner describe every action sufficiently?
    • How does a plan recognizer do on these integrated domains, compared to a CRISP-only domain, without gameplay?
    • How does extended gameplay effect results?
  • How do we define integrated domains easily?
    • Specialized tools?
    • Learned? Partially learned, partially authored?

Progress:

This research will form my undergraduate honors thesis at the University of Utah. It’s in progress. I will update this page as I make progress. See also my weekly posts.

6/3/19 — Initial model (as described above) theorized. Tools working to author an experimental domain.

Notes   [ + ]

1. too long; didn’t even read the TL;DR
2. That is, learned models are usually piles of linear algebra that don’t hook well into anything but what it’s trained for. Models trained together, or models trained to operate on symbols work okay.
3. I’m hoping the domain might eventually be learnable, but that’s a long way off.
4. I use CRISP to refer to the whole progression of these systems, which operate off the same basic approach.
5. The parts of speech in the full tree don’t match those in the LTAGs because the full tree was generated with the Stanford Parser, not XTAG.
6. game engine, technically
7. TRPGs are usually played aloud with 4-8 people.
8. So far as I can tell, English is the only language with a database of LTAGs. This is unfortunate, but not unexpected.
9. I’m not bitter about how XTAG is decades old. Not at all.