978-1-4244-7903-0/10/$26.00 ©2010 IEEE
19
SLT 2010
WHATISLEFTTOBEUNDERSTOODINATIS?GokhanTurDilekHakkani-T¨urLarryHeckSpeechatMicrosoft|MicrosoftResearchMountainView,CA,94041gokhan.tur@ieee.orgdilek@ieee.orglarry.heck@microsoft.comABSTRACTOneofthemaindataresourcesusedinmanystudiesoverthepasttwodecadesforspokenlanguageunderstanding(SLU)researchinspokendialogsystemsistheairlinetravelinformationsystem(ATIS)corpus.TwoprimarytasksinSLUareintentdetermination(ID)andslotfilling(SF).Recentstudiesreportederrorratesbelow5%forbothofthesetasksemployingdiscriminativemachinelearn-ingtechniqueswiththeATIStestset.Whiletheselowerrorratesmaysuggestthatthistaskisclosetobeingsolved,furtheranalysisrevealsthecontinuedutilityofATISasaresearchcorpus.Inthispa-per,ourgoalisnotexperimentingwithdomainspecifictechniquesorfeatureswhichcanhelpwiththeremainingSLUerrors,butinsteadexploringmethodstorealizethisutilityviaextensiveerroranalysis.Weconcludethatevenwithsuchlowerrorrates,ATIStestsetstillincludesmanyunseenexamplecategoriesandsequences,hencere-quiresmoredata.Betteryet,newannotatedlargerdatasetsfrommorecomplextaskswithrealisticutterancescanavoidover-tuningintermsofmodelingandfeaturedesign.Webelievethatadvance-mentsinSLUcanbeachievedbyhavingmorenaturallyspokendatasetsandemployingmorelinguisticallymotivatedfeatureswhilepre-servingrobustnessduetospeechrecognitionnoiseandvarianceduetonaturallanguage.IndexTerms—spokenlanguageunderstanding,ATIS,discrim-inativetraining1.INTRODUCTIONSpokenlanguageunderstanding(SLU)aimstoextractthemeaningofthespeechutterances.Whileunderstandinglanguageisstillcon-sideredanunsolvedproblem,inthelastdecade,avarietyofpracticalgoal-orientedconversationalunderstandingsystemshavebeenbuiltforlimiteddomains.Thesesystemsaimtoautomaticallyidentifytheintentoftheuserasexpressedinnaturallanguage,extractassociatedargumentsorslots,andtakeactionsaccordinglytosatisfytheuser’srequests.Insuchsystems,thespeaker’sutteranceistypicallyrecog-nizedusinganautomaticspeechrecognizer(ASR).ThentheintentofthespeakerisidentifiedfromtherecognizedwordsequenceusinganSLUcomponent.Finally,adialogortaskmanager(DM)inter-actswiththeuser(notnecessarilyinnaturallanguage)andhelpstheuserachievethetaskthatthesystemisdesignedtosupport.Intheearly90s,DARPA(DefenseAdvancedResearchProgramAgency)initiatedtheAirlineTravelInformationSystem(ATIS)pro-ject.TheATIStaskconsistedofspokenqueriesonflight-relatedin-formation.AnexampleutteranceisIwanttoflytoBostonfromNewYorknextweek.Understandingwasreducedtotheproblemofex-tractingtask-specificarguments,suchasDestinationandDepartureDate.Participatingsystemsemployedeitheradata-drivenstatisticalapproach[1,2]oraknowledge-basedapproach[3,4,5].Almostsimultaneouslywiththesemanticframefilling-basedSLUapproaches,anewtaskemergedmotivatedbythesuccessoftheearlycommercialinteractivevoiceresponse(IVR)applicationsusedincallcenters.TheSLUwasframedasclassifyingusers’utter-ancesintopredefinedcategories(calledasintentsorcall-types)[6].Thebiggestdifferencebetweenthecallclassificationsystemsandsemanticframefillingsystemsisthattheformerdoesnotex-plicitlyseektodeterminetheargumentsprovidedbytheuser.Themaingoalisroutingthecalltoanappropriatecallcenterdepart-ment.Theargumentsprovidedbytheuserareimportantonlyinthesensethattheyhelpmaketherightclassification.WhilethishasbeenatotallydifferentperspectiveforthetaskofSLU,itwasac-tuallycomplementarytotemplatefillinginthateachcall-typecanbeviewedasatemplatetobefilled.Forexample,inthecaseoftheDARPAATISproject,whiletheprimaryintent(orgoal)wasFlight,usersalsoaskedaboutmanyotherthingssuchasGroundtransportationorAirplanespecifications.Theprogramalsodefinedspecializedtemplatesfortheselessfrequentintents.Thisledtoaseamlessintegrationofintentdetermination(ID)andslotfilling(SF)basedSLUapproaches.Thisintegratedapproachactuallyyieldedimprovedend-to-endautomationratesascomparedtothepreviousdecoupledandsequentialapproaches.Forexample,Jeongetal[7]proposedtomodelthesetwosystemsjointlyusingatriangularchainconditionalrandomfield(CRF).Inthispaper,ratherthanfocusonspecifictechniquesorfeaturestoimproveIDandSFaccuracy,ourgoalistoassessthecontinuedutilityoftheATIScorpusgiventhetwodecadesofresearchithassupported.Inthenextsection,webrieflydescribetheATIScor-pusandthendiscusstheevaluationmetricsforIDandSF.InSec-tion4,wepresentthestate-of-the-artdiscriminativetrainingeffortsforbothIDandSFforthetaskofATIS.Finally,inSections5and6wepresentourdetailedanalysesontheerrorswehaveseenusingIDandSFmodels,respectively,withcomparableperformancetothosereportedintheliterature.Wewillshowthat,bycategorizingtheer-roneouscasesthatremainafterN-foldcrossvalidationexperiments,ATISisstillusefulandsuggestsfutureresearchdirectionsinSLU.2.AIRLINETRAVELINFORMATION(ATIS)CORPUSAnimportantby-productoftheDARPAATISprojectwastheATIScorpus.ThiscorpusisthemostcommonlyuseddatasetforSLUresearch[8].Thecorpushasseventeendifferentintents,suchasFlightorAircraftcapacity.Thepriordistributionis,however,heav-ilyskewed,andthemostfrequentintent,Flightrepresentsabout70%ofthetraffic.Table1showsthefrequencyoftheintentsinthiscorpusfortrainingandtestsets.Inthispaper,weusetheATIScorpusasusedinHeandYoung[9]andRaymondandRiccardi[10].Thetrainingsetcontains4,978ut- 20
IntentTrainingSetTestSetAbbreviation2.4%3.6%Aircraft1.6%0.9%Airfare9.0%5.8%Airline3.4%4.3%Airport0.5%2.0%Capacity0.4%2.4%City0.3%0.6%DayName0.1%0.1%Distance0.4%1.1%Flight73.1%71.6%FlightNo0.3%1.0%FlightTime1.2%0.1%GroundFare0.4%0.8%GroundService5.5%4.0%Meal0.1%0.6%Quantity1.1%0.9%Restriction0.3%0.1%Table1.Thefrequencyofintentsforthetrainingandtestsets.UtteranceHowmuchisthecheapestflightfromBostontoNewYorktomorrowmorning?Goal:AirfareCostRelativecheapestDepartCityBostonArrivalCityNewYorkDepartDate.RelativetomorrowDepartTime.PeriodmorningTable2.AnexampleutterancefromtheATISdataset.terancesselectedfromtheClassA(contextindependent)trainingdataintheATIS-2andATIS-3corpora,whilethetestsetcontains893utterancesfromtheATIS-3Nov93andDec94datasets.Eachutterancehasitsnamedentitiesmarkedviatablelookup,includ-ingdomainspecificentitiessuchascity,airline,airportnames,anddates.TheATISutterancesarerepresentedusingsemanticframes,whereeachsentencehasagoalorgoals(a.k.a.intent)andslotsfilledwithphrases.Thevaluesoftheslotsarenotnormalizedorinterpreted.AnexampleutterancewithannotationsisshowninTable2.3.EVALUATIONMETRICSThemostcommonlyusedmetricsforIDandSFareclass(orslot)errorrate(ER)andF-Measure.ThesimplermetricERforIDcanbecomputedas:ERID=#misclassifiedutterances#utterancesNotethatoneutterancecanhavemorethanoneintent.AtypicalexampleisCanyoutellmemybalance?Ineedtomakeatransfer.Inmostcases,wherethesecondintentisgeneric(agreeting,smalltalkwiththehumanagent)orvague,itisignored.Ifnoneofthetrueclassesisselected,itiscountedasamisclassification.ForSF,theerrorratecanbecomputedintwoways:ThemorecommonmetricistheF-measureusingtheslotsasunits.Thismetricissimilartowhatisbeingusedforothersequenceclassificationtasksinthenaturallanguageprocessingcommunity,suchasparsingandnamedentityextraction.Inthistechnique,usuallytheIOBschemaisadopted,whereeachofthewordsaretaggedwiththeirpositionintheslot:beginning(B),in(I)orother(O).Then,recallandprecisionvaluesarecomputedforeachoftheslots.Aslotisconsideredtobecorrectifitsrangeandtypearecorrect.TheF-Measureisdefinedastheharmonicmeanofrecallandprecision:F−Measure=2×Recall×PrecisionRecall+PrecisionwhereRecall=#correctslotsfound#trueslotsPrecision=#correctslotsfound#foundslots.4.BACKGROUNDONUSINGDISCRIMINATIVECLASSIFIERSFORSLUWithadvancesinmachinelearningoverthelastdecade,especiallyindiscriminativeclassificationtechniques,researchershaveframedtheIDproblemasasampleclassificationtaskandSFasasequenceclassificationtask.Typically,wordn-gramsareusedasfeaturesaf-terpreprocessingwithgenericentities,suchasdates,locations,orphonenumbers.Becauseoftheverylargedimensionoftheinputspace,largemarginclassifierssuchasSVMs[11]orAdaboost[12]werefoundtobeverygoodcandidatesforIDandCRFs[13]forSF.Totakeintoaccountcontext,therecenttrendistomatchn-grams(asubstringofnwords)ratherthanwords.Asdiscovered,datadrivenapproachesareverywell-suitedforprocessingspontaneousspokenutterances.Thedatadrivenapproachesaretypicallymorerobusttosentencesthatarenotwell-formedgram-matically,whichoccursfrequentlyinspontaneousspeech.Eveninbroadcastconversationswhereparticipantsareverywelltrainedandprepared,alargepercentageoftheutteranceshavedisfluencies:rep-etitions,falsestarts,andfillerwords(e.g.,uh)[14].Furthermore,speechrecognitionintroducessignificant“noise”totheSLUcom-ponentcausedbybackgroundnoise,mismatcheddomains,incor-rectrecognitionofpropernames(suchascityorpersonnames),andreducedaccuracyduetosub-realtimeprocessingrequirements.Atypicalcallroutingsystemoperatesataround20%-30%worderrorrate;oneoutofeverythreetofivewordsiswrong[15].GiventhattheresearchersinthisstudyalsodeterminedthatonethirdoftheIDerrorsareduetospeechrecognitionnoise,robustmethodsforspon-taneousspeechrecognitionarecriticallyimportantforsuccessfulIDandSFinSLUsystems.Tothisend,researchershaveproposedmanymethodsrangingfromN-bestrescoring,exploitingwordcon-fusionnetworks,andleveragingdialogcontextaspriorknowledge(e.g.,[15]).4.1.IntentDeterminationForID,earlyworkwithdiscriminativeclassificationalgorithmswascompletedontheAT&THMIHYsystem[6]usingtheBoostextertool,animplementationoftheAdaBoost.MHmulticlassmultilabelclassificationalgorithm[12].Hakkani-T¨uretal.extendedthisworkbyusingalatticeofsyntacticandsemanticfeatures[16].Discrimi-nativecallclassificationsystemsemployinglargemarginclassifiers(e.g.,supportvectormachines)includeworkbyHaffneretal.[17],whoproposedaglobaloptimizationprocessbasedonanoptimal 21
Correct-Estimatedabbdefghijklmnopqa.Abbreviation302b.Aircraft63c.Airfare641d.Airline372e.Airport1521f.Capacity15132g.City32h.DayName2i.Distance91j.Flight111623k.FlightNo26l.FlightTime1m.GroundFare133n.GroundService36o.Meal5p.Quantity8q.Restriction1Table3.Theconfusionmatrixforintentdetermination.channelcommunicationmodelthatallowedacombinationofhet-erogeneousbinaryclassifiers.Thisapproachdecreasedthecall-typeclassificationerrorrateforAT&T’sHMIHYnaturaldialogsystemsignificantly,especiallythefalserejectionrates.OtherworkbyKuoandLee[18]atBellLabsproposedtheuseofdiscriminativetrainingontheroutingmatrix,significantlyim-provingtheirvector-basedcallroutingsystem[19]forlowrejectionrates.Theirapproachisbasedonusingtheminimumclassificationerror(MCE)criterion.LatertheyextendedthisapproachtoincludeBoostingandautomaticrelevancefeedback(ARF)[20].Cox[21]proposedtheuseofgeneralizedprobabilisticdescent(GPD),correc-tivetraining(CT),andlineardiscriminantanalysis(LDA).Finally,Chelbaetal.proposedusingMaximumEntropymodelsforID,andcomparedtheperformancewithaNaiveBayesapproachwiththeATIScorpus.ThediscriminativemethodresultedinhalftheclassificationerrorratecomparedtoNaiveBayesonthishighlyskeweddataset.Theyhavereportedabout4.8%topclasserrorrateusingaslightlydifferenttrainingandtestcorporathantheoneusedinthispaper.4.2.SlotFillingForSF,theATIScorpushasbeenextensivelystudiedfromtheearlydaysoftheDARPAATISproject.However,theuseofdiscrimina-tiveclassificationalgorithmsismorerecent.Somenotablestudiesincludethefollowing:WangandAcero[22]comparedtheuseofCRF,perceptron,largemargin,andMCEusingstochasticgradientdescent(SGD)forSFintheATISdomain.Theyobtainedsignificantlyreducedsloter-rorrates,withbestperformanceachievedbyCRF(thoughitwastheslowesttotrain).AlmostsimultaneouslyJeongandLee[7]proposedtheuseofCRF,extendedbynon-localfeatures,whichareimportanttodisam-biguatethetypeoftheslot.Forexample,adaycanbethearrivalday,departureday,orthereturnday.Ifthecontextualcuesdisam-biguatingthemarebeyondtheimmediatecontext,itisnoteasyfortheclassifiertochoosethecorrectclass.Usingnon-localtriggerfeaturesautomaticallyextractedfromthetrainingdataisshowntoimprovetheperformancesignificantly.Finally,RaymondandRiccardi[10]comparedSVMandCRFwithgenerativemodelsfortheATIStask.Theyconcludedthatdis-criminativemethodsperformsignificantlybetter,andfurthermore,itispossibletoincorporatea-prioriinformationorlongdistancefea-tureseasily.Forexampletheyaddedfeaturessuchas“Doesthisutterancehavetheverbarrive”.Thisresultedinabout10%rela-tivereductioninsloterrorrate.Thedesignofsuchfeaturesusuallyrequiresdomainknowledge.5.ANALYSISOFINTENTDETERMINATIONINATISInthissection,ourgoalistoanalyzetheerrorsofastate-of-the-artIDsystemfortheATISdomain,clustertheerrors,andthencate-gorizetheerrortypes.Thesecategoriesoferrortypeswillsuggestpotentialareasofresearchthatcouldyieldimprovedaccuracy.Allexperimentsandanalysesareperformedusingmanualtranscriptionsofthetrainingandtestsetstoisolatethestudyfromnoiseintroducedbythespeechrecognizer.5.1.DiscriminativeTrainingandExperimentsForthefollowingexperiments,weusedtheATIScorpusasdescribedpreviouslyinSection2.Sincethesuperiorperformanceofthedis-criminativetrainingalgorithmshasbeenshownbytheearlierwork,wehaveemployedtheAdaBoost.MHalgorithminthisstudy.Weusedonlywordn-gramsasfeatures.WehavenotoptimizedBoost-ingparametersonatuningsetnorlearnedweakclassifiers.Thedataisnormalizedtolowercase,butnostemmingorstopwordremovalhasbeenperformed.TheATIStestsetwasclassifiedaccordingtotheclassesdefinedinTable1.TheIDerrorrateweobtainedwas4.5%,whichiscom-parableto(andactuallylowerthan)towhathasbeenreportedintheliterature.5.2.AnalysisofIntentDeterminationErrorsNext,wecheckedtheIDerrorswiththreetrainingandtestset-ups:1.AllTrain:usesallATIStrainingdatatotrainthemodel,anderrorsarecomputedontheATIStestset.Intotal,thismodelerroneouslyclassifiedonly40utterances(anerrorrateof4.5%).Theintentconfusionmatrixfortheseerrorsispro-videdinTable3. 22
2.25%Train:uses25%ofthetrainingexamplesintheATIStrainingset,anderrorsarecomputedontheATIStestset.Intotal,thismodelerroneouslyclassified65utterances(anerrorrateof7.3%).3.N-fold:usesallexamplesforbothtestingandtrainingin10-foldcrossvalidationexperiments.Intotal,thismodelerro-neouslyclassified162utterances(anerrorrateof3.0%).AsseeninTable3,theproblemismostlythenon-Flightutter-anceserroneouslyclassifiedasFlight.Whileonecauseoftheseer-rorsistheunbalancedintentdistribution,wehavemanuallycheckedeacherrorandclusteredtheminto6categories:1.Prepositionalphrasesembeddedinnounphrases:Theseer-rorsinvolvephrasessuchasCapacityoftheflightfromBostontoOrlando,wheretheprepositionalphrasesuggestsflightin-formation,whereasthedestinationcategoryismainlydeter-minedbytheheadwordofthenounphrase(capacityinthiscase).Sinceclassifierhasnosyntacticfeatures,suchsen-tencesareusuallyclassifiederroneously.Usingfeaturesfromasyntacticparsercanalleviatethisproblem.2.Wrongfunctionalargumentsofutterances:Thiscategoryissimilartothefirstcategorybutthedifferenceisthat,insteadofaprepositionalphrase,theconfusedphraseisasemanticargumentoftheutterance.ConsidertheexampleutteranceWhatdayoftheweekdoestheflightfromBostontoOrlandofly?Theseareerrorsthatcanbesolvedbyusingeitherasyn-tacticparserthatidentifiesfunctionsofphrasesorasemanticrolelabeler.3.Annotationerrors:Theseareutterancesthatwereassignedthewrongcategoryduringmanualannotation.4.Utteranceswithmultiplesentences:Theseareutteranceswithmorethanonesentence.Insuchcases,theintentisusuallyinthelastsentence,whereastheclassificationoutputisbiasedbytheothersentence.5.Other:Theseincludeseveralinfrequenterrortypessuchasambiguousutterances,ill-formulatedqueries,andpreprocess-ing/tokenizationissues:•Ambiguousutterances:Theseerrorsinvolveutteranceswherethedestinationcategoryisnotclearintheutter-ance.AnexamplefromtheATIStestsetislistLosAngeles.Inthisutterance,thespeakerintentcouldei-therbetofindcitiesthathaveflightsfromLosAngelesorflightstoLosAngeles.•Ill-formulatedqueries:Theseareutteranceswhichin-cludeaphrasethatmaymisleadtheclassificationorunderstanding.AnexamplefromtheATIStestsetis:What’stheairfareforataxitotheDenverairport?Inthiscase,thewordairfareimpliesadestinationcate-goryofAirfare,whereaswhatismeantisGroundtrans-portationfare.Thesetypeoferrorsareeasierforhu-manstohandle,butitisnotpresentlyclearhowtheycanberesolvedinautomaticprocessing.•Preprocessing/Tokenizationissues:Theseareerrorsthatcouldberesolvedbyusingadomainontologyorspe-cialpre-processingortokenizationrelatedtothedo-main.Somedomainspecificabbreviationsandrestric-tioncodesareexamplesofthiscategory.ErrorTypeAllTrain25%Train10-Fold142.5%33.8%24.5%222.5%13.8%30.0%32.5%6.1%18.4%40%0%8.0%517.5%12.5%7.2%615.0%33.8%11.7%Table4.ThedistributionoferrorcategoriesforIDusingalland25%ofthetrainingdata,andusingallthetrainingandthetestsetwith10-foldcrossvalidation.Fig.1.Learningcurveforintentdeterminationusingthetrainingdatawiththeoriginalorderandaverageof8shuffledorders.6.DifficultCases:Theseareutterancesthatincludewordsorphrasesthatwerepreviouslyunseeninthetrainingdata.FortheexampleutteranceAresnackservedonTowerAir?,noneofthecontentwordsandphrasesappearwiththeMealcate-goryinthetrainingdata.Table4presentsthefrequencyofeachoftheseerrorsforthethreeexperiments.Asseen,categories1and2constituteamajorityoftheerrors.Bothofthesecategoriescanberesolvedusingasyntac-ticparserwithfunctiontags.However,notethattheATIScorpusishighlyartificialandutterancesaremostlygrammaticalandwithoutdisfluencies.Furthermore,whenworkingwithASR,utterancesmayincluderecognitionerrors.Inamorerealisticscenario,onemightconsidershallowparsingorsyntacticandsemanticgraphs[16]forextractingricherandlinguistically-motivatedfeaturesthatcouldre-solvesuchcases.Figure1showstheerrorrateontheATIStestsetwhenvaryingtrainingsetsizesareused.Whenmanuallyexaminingthetestset,wefoundclustersofsimilarutterancesoccurringoneaftertheother(probablyareutteredbythesameuser).Toeliminatethebiasfromthedatacollectionorder,wealsoestimatedtheerrorwitharandomorderingofthetrainingset,andaveragedtheerrorratesover8suchexperiments.Ascanbeseenfromthisplot,theerrorratekeepsshrinkingasmoredataisadded,suggestingthatmoretrainingdatawouldbebeneficial. 23
6.ANALYSISOFSLOTFILLINGINATISInthissection,ourgoalissimilartotheIDanalysis:analyzethere-sultsofastate-of-the-artSFsystemfortheATISdomainandclustertheerrorsintocategories.6.1.DiscriminativeTrainingandExperimentsFollowingmethodsdescribedintheliterature,weemployedlinearchainCRFstomodeltheslotsintheATISDomain.Weusedonlywordn-gramfeaturesanddidnotuseadevelopmentsettotunepa-rameters.TheATIStestsetwasthenclassifiedusingthetrainedmodel.WeconvertedthedatasetsintotheIOBformatsothatwehaveonlyonewordpersampletoclassify.UsingtheCoNLLeval-uationscript1,theSFF-Measureweobtainedwas93.2%withtheIOBrepresentation2,whichiscomparabletowhathasbeenreportedintheliterature.6.2.AnalysisofSlotFillingErrorsAnalyzingtheSFdecisions,themodelfound2,614of2,837slotswiththecorrecttypeandspanfortheinputoutof9,164words.Wemanuallycheckedeachofthe223erroneouscasesandclusteredtheminto8categories:1.Longdistancedependencies:Theseareslotswherethedis-ambiguatingtokensareoutofthecurrentn-gramcontext.Forexample,intheutteranceFindflightstoNewYorkarrivinginnolaterthannextSaturday,a6-gramcontextisrequiredtoresolvethatSaturdayisthearrivaldate.Thiscategorywaspreviouslyaddressedintheliterature.Forexample,RaymondandRiccardi[10]extractedfeaturesusingmanually-designedpatternsandJeongandLee[7]usedtriggerpatternstocoverthesecases.2.Partiallycorrectslotvalueannotations:Theseareslotsas-signedacategorythatispartiallycorrect;eitherthecategoryorthesub-categorymatchesthemanualannotation.Forex-ample,thewordtomorrowcaneitherbeaDepartDate.RelativeorArriveDate.RelativefortheutteranceflightsarrivinginBostontomorrow.Notethatthesecanoverlapwithotherer-rortypes.3.Previouslyunseensequences:Whilethiscategoryrequiresfurtheranalysis,themostcommonreasonisthemismatchbetweenthetrainingandtestsets.Forexample,mealrelatedslotsaremissedbythemodel(8.0%ofallerrors)becausetherearenosimilarcasesinthetrainingset.Thisisalsothecasefortheaircraftmodels(10.0%),andtravelingtostatesinsteadofcities(3.3%),etc.4.Annotationerrors:Thesearetheslotsthatwereassignedthewrongcategoryduringmanualannotation.5.Other:Theseincludeseveralinfrequenterrortypessuchasambiguousutterances,ill-formulatedqueries,andpreprocess-ing/tokenizationissues:•Ill-formulatedqueries:Theseerrorsusuallyinvolveanungrammaticalphrasethatmaymisleadtheinterpreta-tionoftheslotvalueorthereisinsufficientcontexttodisambiguatethevalueoftheslot.Forexample,intheutteranceFindaflightfromMemphistoTacomadinner,1http://www.cnts.ua.ac.be/conll2000/chunking/output.html2Itis94.7%usingtherepresentationusedby[10],whoreported95.0%ErrorTypePercentage126.9%242.4%357.6%48.4%56.7%Table5.ThedistributionofthetypesoferrorsintheATIStestset.Notethatthesedonotsumto100%assomeerrorsincludemultipletypes.itisnotcleariftheworddinnerreferstothedescriptionoftheflightmeal.•Ambiguousutterances:Theseareutteranceswheretheslotcategoryisnotexplicitgiventheutterance.Forexample,intheutteranceIwouldliketohavetheair-linethatfliesToronto,DetroitandOrlando,itisnotclearifthespeakerissearchingforairlinesthathaveflightsfromTorontotoDetroitandOrlandoorfromsomeotherlocationtoToronto,DetroitandOrlando.•Preprocessing/Tokenizationissues:Theseareerrorsthatcouldberesolvedusingadomainontologyorspecialpre-processingortokenizationrelatedtothedomain.Forexample,intheutteranceWhatairlineisAS,itwouldbehelpfultoknowASisadomainspecificab-breviation.•Ambiguouspart-of-speechtag-relatederrors:Theseareerrorsthatcouldberesolvedifthepart-of-speechtagswereresolved.Forexample,thewordarrivingcanbeaverboranadjective,asintheutteranceIwanttofindtheearliestarrivingflighttoBoston.Inthiscase,theslotcategoryforthewordsearliestarrivingisFlight-Mod,butsincethewordarrivingisveryfrequentlyseenasaverbinthiscorpus,itisassignednoslotcategory.Table5liststhefrequencyofeachoftheseerrors.Categories1,2,and3constitutevastmajorityoftheerrors.Eachofthesecate-goriescanbeattackedusingadifferentstrategy.Category1utter-ancesaretheeasiesttoresolveusingricherfeaturesetsduringdis-criminativetraining.Usinga-prioriinformationmayalsohelpwhenavailable.Also,discoveringlinguisticallymotivatedlongdistancepatternsisapromisingresearchwork.Category2utteranceshappenmainlyduetothenuancebetweenthearriveanddepartconcepts(23.1%ofallerrors),whichareveryhardtodistinguishinsomecasesasintheexampleabove.Category3utterancessimplyrequireabettertrainingsetorhumaninterventionofmanualpatternsastheyareunderrepresentedormissinginthetrainingdata.7.DISCUSSIONANDCONCLUSIONSLeveragingrecentimprovementsinmachinelearningandspokenlanguageprocessing,theperformanceoftheSLUsystemsfortheATISdomainhasimproveddramatically.Around5%errorratefortheSLUtaskimpliesasolvedproblem.Itisclear,however,thattheproblemofSLUisfarfrombeingsolved,especiallyformorereal-istic,naturally-spokenutterancesofavarietyofspeakersfromtasksmorecomplexthansimpleflightinformationrequests.Newdatasetsfromsuchtaskscanavoidover-tuningtooneparticulardatasetintermsofmodelingandfeaturedesign. 24
TherecentFrenchMediacorpus[23]offersasteptowardsthisgoal:ithasthreetimesmoredataandgreaterthana10%concepterrorrateforSF.However,thedatawasnotcollectedfromanop-erationalsystem.Instead,datawascollectedusingawizardofOzsetupwithselectedvolunteers.AnothereffortistheLet’sGodi-alogsystemusedbyrealusersofthePittsburghbustransportationsystem[24].However,SLUannotationsarenotyetavailable.Evenwithsuchlowerrorrates,theATIStestsetincludesmanyexamplecategoriesandsequencesunseeninthetrainingdata,andtheerrorrateshavenotconvergedyet.Inthatrespect,moredatafromjusttheATISdomainmaybeusefulforSLUresearch.TheerroranalysisontheATISdomainshowstheprimaryweak-nessesofthecurrentn-gram-basedmodelingapproaches:Thelocalcontextoverridestheglobal,themodelhasnodomainknowledgetomakeanyinferences,andittriestofitanyutteranceintosomeknownsample,hencenotreallyrobusttoanyout-of-domainutter-ances.ThiswasalsoobservedbyRaymondandRiccardi[10],wheretheCRFmodelfits100%tothetrainingdata.Onepossibleresearchdirectionconsistsofemployinglongerdistancesyntacticallyorse-manticallymotivatedfeatures,whilepreservingtherobustnessofthesystemtothenoiseintroducedbythespeechrecognizerandvarianceduetonaturallanguage.AlesserstudiedsetoftheATIScorpus,ClassDutterances,whicharecontextualqueries,isanothersignificantportionofthiscorpus,waitingtobeunderstood.Whilemostpeopletreatedun-derstandingincontextwithhandcraftedrules(e.g.,[4]),tothebestofourknowledge,theonlystudytowardsbuildingastatisticaldis-coursemodelhasbeenproposedbyMilleretal.[25].8.ACKNOWLEDGMENTSWewouldliketothankChristianRaymondandGiuseppeRiccardiforsharingtheATISdata,revisedforannotationinconsistenciesandmistakes.9.REFERENCES[1]R.Pieraccini,E.Tzoukermann,Z.Gorelov,J.-L.Gauvain,E.Levin,C.-H.Lee,andJ.G.Wilpon,“Aspeechunderstand-ingsystembasedonstatisticalrepresentationofsemantics,”inProceedingsoftheICASSP,SanFrancisco,CA,March1992.[2]S.Miller,R.Bobrow,R.Ingria,andR.Schwartz,“Hiddenunderstandingmodelsofnaturallanguage,”inProceedingsoftheACL,LasCruces,NM,June1994.[3]W.WardandS.Issar,“RecentimprovementsintheCMUspo-kenlanguageunderstandingsystem,”inProceedingsoftheARPAHLTWorkshop,March1994,pp.213–216.[4]S.Seneff,“TINA:Anaturallanguagesystemforspokenlan-guageapplications,”ComputationalLinguistics,vol.18,no.1,pp.61–86,1992.[5]J.Dowding,J.M.Gawron,D.Appelt,J.Bear,L.Cherny,R.Moore,andD.Moran,“Gemini:Anaturallanguagesys-temforspokenlanguageunderstanding,”inProceedingsoftheARPAWorkshoponHumanLanguageTechnology,Princeton,NJ,March1993.[6]A.L.Gorin,G.Riccardi,andJ.H.Wright,“HowMayIHelpYou?,”SpeechCommunication,vol.23,pp.113–127,1997.[7]M.JeongandG.G.Lee,“Exploitingnon-localfeaturesforspokenlanguageunderstanding,”inProceedingsoftheACL/COLING,Sydney,Australia,July2006.[8]P.J.Price,“Evaluationofspokenlanguagesystems:TheATISdomain,”inProceedingsoftheDARPAWorkshoponSpeechandNaturalLanguage,HiddenValley,PA,June1990.[9]Y.HeandS.Young,“Adata-drivenspokenlanguageunder-standingsystem,”inProceedingsoftheIEEEASRUWorkshop,U.S.VirginIslands,December2003,pp.583–588.[10]C.RaymondandG.Riccardi,“Generativeanddiscriminativealgorithmsforspokenlanguageunderstanding,”inProceed-ingsoftheInterspeech,Antwerp,Belgium,2007.[11]V.N.Vapnik,StatisticalLearningTheory,JohnWileyandSons,NewYork,NY,1998.[12]R.E.SchapireandY.Singer,“Boostexter:Aboosting-basedsystemfortextcategorization,”MachineLearning,vol.39,no.2/3,pp.135–168,2000.[13]J.Lafferty,A.McCallum,andF.Pereira,“Conditionalran-domfields:Probabilisticmodelsforsegmentingandlabelingsequencedata,”inProceedingsoftheICML,Williamstown,MA,2001.[14]A.StolckeandE.Shriberg,“Statisticallanguagemodelingforspeechdisfluencies,”inProceedingsoftheICASSP,Atlanta,GA,May1996.[15]N.Gupta,G.Tur,D.Hakkani-T¨ur,S.Bangalore,G.Riccardi,andM.Rahim,“TheAT&Tspokenlanguageunderstandingsystem,”IEEETransactionsonAudio,Speech,andLanguageProcessing,vol.14,no.1,pp.213–222,2006.[16]D.Hakkani-T¨ur,G.Tur,andA.Chotimongkol,“Usingsyntac-ticandsemanticgraphsforcallclassification,”inProceed-ingsoftheACLWorkshoponFeatureEngineeringforMa-chineLearninginNaturalLanguageProcessing,AnnArbor,MI,June2005.[17]P.Haffner,G.Tur,andJ.Wright,“OptimizingSVMsforcom-plexcallclassification,”inProceedingsoftheICASSP,HongKong,April2003.[18]H.-K.J.KuoandC.-H.Lee,“Discriminativetraininginnatu-rallanguagecall-routing,”inProceedingsofICSLP,Beijing,China,2000.[19]J.Chu-CarrollandB.Carpenter,“Vector-basednaturallan-guagecallrouting,”ComputationalLinguistics,vol.25,no.3,pp.361–388,1999.[20]I.Zitouni,H.-K.J.Kuo,andC.-H.Lee,“Boostingandcombin-cationofclassifiersfornaturallanguagecallroutingsystems,”SpeechCommunication,vol.41,no.4,pp.647–661,2003.[21]StephenCox,“Discriminativetechniquesincallrouting,”inProceedingsoftheICASSP,HongKong,April2003.[22]Y.-Y.WangandA.Acero,“Discriminativemodelsforspokenlanguageunderstanding,”inProceedingsoftheICSLP,Pitts-burgh,PA,September2006.[23]H.Bonneau-Maynard,S.Rosset,C.Ayache,A.Kuhn,andD.Mostefa,“SemanticannotationoftheFrenchMEDIAdia-logcorpus,”inProceedingsoftheInterspeech,Lisbon,Portu-gal,September2005.[24]A.Raux,B.Langner,D.Bohus,A.Black,andM.Eskenazi,“Let’sgopublic!takingaspokendialogsystemtotherealworld,”inProceedingsoftheInterspeech,Lisbon,Portugal,September2005.[25]S.Miller,D.Stallard,R.Bobrow,andR.Schwartz,“Afullystatisticalapproachtonaturallanguageinterfaces,”inProceed-ingsoftheACL,Morristown,NJ,1996.