WHAT IS LEFT TO BE UNDERSTOOD IN ATIS?
978-1-4244-7903-0/10/$26.00 ©2010 IEEE 19 SLT 2010 WHATISLEFTTOBEUNDERSTOODINATIS?GokhanTurDilekHakkani-T¨urLarryHeckSpeechatMicrosoft|MicrosoftResearchMountainView,CA,94041gokhan.tur@ieee.orgdilek@ieee.orglarry.heck@microsoft.comABSTRACTOneofthemaindataresourcesusedinmanystudiesoverthepasttwodecadesforspokenlanguageunderstanding(SLU)researchinspokendialogsystemsistheairlinetravelinformationsystem(ATIS)corpus.TwoprimarytasksinSLUareintentdetermination(ID)andslotfilling(SF).Recentstudiesreportederrorratesbelow5%forbothofthesetasksemployingdiscriminativemachinelearn-ingtechniqueswiththeATIStestset.Whiletheselowerrorratesmaysuggestthatthistaskisclosetobeingsolved,furtheranalysisrevealsthecontinuedutilityofATISasaresearchcorpus.Inthispa-per,ourgoalisnotexperimentingwithdomainspecifictechniquesorfeatureswhichcanhelpwiththeremainingSLUerrors,butinsteadexploringmethodstorealizethisutilityviaextensiveerroranalysis.Weconcludethatevenwithsuchlowerrorrates,ATIStestsetstillincludesmanyunseenexamplecategoriesandsequences,hencere-quiresmoredata.Betteryet,newannotatedlargerdatasetsfrommorecomplextaskswithrealisticutterancescanavoidover-tuningintermsofmodelingandfeaturedesign.Webelievethatadvance-mentsinSLUcanbeachievedbyhavingmorenaturallyspokendatasetsandemployingmorelinguisticallymotivatedfeatureswhilepre-servingrobustnessduetospeechrecognitionnoiseandvarianceduetonaturallanguage.IndexTerms—spokenlanguageunderstanding,ATIS,discrim-inativetraining1.INTRODUCTIONSpokenlanguageunderstanding(SLU)aimstoextractthemeaningofthespeechutterances.Whileunderstandinglanguageisstillcon-sideredanunsolvedproblem,inthelastdecade,avarietyofpracticalgoal-orientedconversationalunderstandingsystemshavebeenbuiltforlimiteddomains.Thesesystemsaimtoautomaticallyidentifytheintentoftheuserasexpressedinnaturallanguage,extractassociatedargumentsorslots,andtakeactionsaccordinglytosatisfytheuser’srequests.Insuchsystems,thespeaker’sutteranceistypicallyrecog-nizedusinganautomaticspeechrecognizer(ASR).ThentheintentofthespeakerisidentifiedfromtherecognizedwordsequenceusinganSLUcomponent.Finally,adialogortaskmanager(DM)inter-actswiththeuser(notnecessarilyinnaturallanguage)andhelpstheuserachievethetaskthatthesystemisdesignedtosupport.Intheearly90s,DARPA(DefenseAdvancedResearchProgramAgency)initiatedtheAirlineTravelInformationSystem(ATIS)pro-ject.TheATIStaskconsistedofspokenqueriesonflight-relatedin-formation.AnexampleutteranceisIwanttoflytoBostonfromNewYorknextweek.Understandingwasreducedtotheproblemofex-tractingtask-specificarguments,suchasDestinationandDepartureDate.Participatingsystemsemployedeitheradata-drivenstatisticalapproachoraknowledge-basedapproach.Almostsimultaneouslywiththesemanticframefilling-basedSLUapproaches,anewtaskemergedmotivatedbythesuccessoftheearlycommercialinteractivevoiceresponse(IVR)applicationsusedincallcenters.TheSLUwasframedasclassifyingusers’utter-ancesintopredefinedcategories(calledasintentsorcall-types).Thebiggestdifferencebetweenthecallclassificationsystemsandsemanticframefillingsystemsisthattheformerdoesnotex-plicitlyseektodeterminetheargumentsprovidedbytheuser.Themaingoalisroutingthecalltoanappropriatecallcenterdepart-ment.Theargumentsprovidedbytheuserareimportantonlyinthesensethattheyhelpmaketherightclassification.WhilethishasbeenatotallydifferentperspectiveforthetaskofSLU,itwasac-tuallycomplementarytotemplatefillinginthateachcall-typecanbeviewedasatemplatetobefilled.Forexample,inthecaseoftheDARPAATISproject,whiletheprimaryintent(orgoal)wasFlight,usersalsoaskedaboutmanyotherthingssuchasGroundtransportationorAirplanespecifications.Theprogramalsodefinedspecializedtemplatesfortheselessfrequentintents.Thisledtoaseamlessintegrationofintentdetermination(ID)andslotfilling(SF)basedSLUapproaches.Thisintegratedapproachactuallyyieldedimprovedend-to-endautomationratesascomparedtothepreviousdecoupledandsequentialapproaches.Forexample,Jeongetalproposedtomodelthesetwosystemsjointlyusingatriangularchainconditionalrandomfield(CRF).Inthispaper,ratherthanfocusonspecifictechniquesorfeaturestoimproveIDandSFaccuracy,ourgoalistoassessthecontinuedutilityoftheATIScorpusgiventhetwodecadesofresearchithassupported.Inthenextsection,webrieflydescribetheATIScor-pusandthendiscusstheevaluationmetricsforIDandSF.InSec-tion4,wepresentthestate-of-the-artdiscriminativetrainingeffortsforbothIDandSFforthetaskofATIS.Finally,inSections5and6wepresentourdetailedanalysesontheerrorswehaveseenusingIDandSFmodels,respectively,withcomparableperformancetothosereportedintheliterature.Wewillshowthat,bycategorizingtheer-roneouscasesthatremainafterN-foldcrossvalidationexperiments,ATISisstillusefulandsuggestsfutureresearchdirectionsinSLU.2.AIRLINETRAVELINFORMATION(ATIS)CORPUSAnimportantby-productoftheDARPAATISprojectwastheATIScorpus.ThiscorpusisthemostcommonlyuseddatasetforSLUresearch.Thecorpushasseventeendifferentintents,suchasFlightorAircraftcapacity.Thepriordistributionis,however,heav-ilyskewed,andthemostfrequentintent,Flightrepresentsabout70%ofthetraffic.Table1showsthefrequencyoftheintentsinthiscorpusfortrainingandtestsets.Inthispaper,weusetheATIScorpusasusedinHeandYoungandRaymondandRiccardi.Thetrainingsetcontains4,978ut- 20 IntentTrainingSetTestSetAbbreviation2.4%3.6%Aircraft1.6%0.9%Airfare9.0%5.8%Airline3.4%4.3%Airport0.5%2.0%Capacity0.4%2.4%City0.3%0.6%DayName0.1%0.1%Distance0.4%1.1%Flight73.1%71.6%FlightNo0.3%1.0%FlightTime1.2%0.1%GroundFare0.4%0.8%GroundService5.5%4.0%Meal0.1%0.6%Quantity1.1%0.9%Restriction0.3%0.1%Table1.Thefrequencyofintentsforthetrainingandtestsets.UtteranceHowmuchisthecheapestflightfromBostontoNewYorktomorrowmorning?Goal:AirfareCostRelativecheapestDepartCityBostonArrivalCityNewYorkDepartDate.RelativetomorrowDepartTime.PeriodmorningTable2.AnexampleutterancefromtheATISdataset.terancesselectedfromtheClassA(contextindependent)trainingdataintheATIS-2andATIS-3corpora,whilethetestsetcontains893utterancesfromtheATIS-3Nov93andDec94datasets.Eachutterancehasitsnamedentitiesmarkedviatablelookup,includ-ingdomainspecificentitiessuchascity,airline,airportnames,anddates.TheATISutterancesarerepresentedusingsemanticframes,whereeachsentencehasagoalorgoals(a.k.a.intent)andslotsfilledwithphrases.Thevaluesoftheslotsarenotnormalizedorinterpreted.AnexampleutterancewithannotationsisshowninTable2.3.EVALUATIONMETRICSThemostcommonlyusedmetricsforIDandSFareclass(orslot)errorrate(ER)andF-Measure.ThesimplermetricERforIDcanbecomputedas:ERID=#misclassifiedutterances#utterancesNotethatoneutterancecanhavemorethanoneintent.AtypicalexampleisCanyoutellmemybalance?Ineedtomakeatransfer.Inmostcases,wherethesecondintentisgeneric(agreeting,smalltalkwiththehumanagent)orvague,itisignored.Ifnoneofthetrueclassesisselected,itiscountedasamisclassification.ForSF,theerrorratecanbecomputedintwoways:ThemorecommonmetricistheF-measureusingtheslotsasunits.Thismetricissimilartowhatisbeingusedforothersequenceclassificationtasksinthenaturallanguageprocessingcommunity,suchasparsingandnamedentityextraction.Inthistechnique,usuallytheIOBschemaisadopted,whereeachofthewordsaretaggedwiththeirpositionintheslot:beginning(B),in(I)orother(O).Then,recallandprecisionvaluesarecomputedforeachoftheslots.Aslotisconsideredtobecorrectifitsrangeandtypearecorrect.TheF-Measureisdefinedastheharmonicmeanofrecallandprecision:F−Measure=2×Recall×PrecisionRecall+PrecisionwhereRecall=#correctslotsfound#trueslotsPrecision=#correctslotsfound#foundslots.4.BACKGROUNDONUSINGDISCRIMINATIVECLASSIFIERSFORSLUWithadvancesinmachinelearningoverthelastdecade,especiallyindiscriminativeclassificationtechniques,researchershaveframedtheIDproblemasasampleclassificationtaskandSFasasequenceclassificationtask.Typically,wordn-gramsareusedasfeaturesaf-terpreprocessingwithgenericentities,suchasdates,locations,orphonenumbers.Becauseoftheverylargedimensionoftheinputspace,largemarginclassifierssuchasSVMsorAdaboostwerefoundtobeverygoodcandidatesforIDandCRFsforSF.Totakeintoaccountcontext,therecenttrendistomatchn-grams(asubstringofnwords)ratherthanwords.Asdiscovered,datadrivenapproachesareverywell-suitedforprocessingspontaneousspokenutterances.Thedatadrivenapproachesaretypicallymorerobusttosentencesthatarenotwell-formedgram-matically,whichoccursfrequentlyinspontaneousspeech.Eveninbroadcastconversationswhereparticipantsareverywelltrainedandprepared,alargepercentageoftheutteranceshavedisfluencies:rep-etitions,falsestarts,andfillerwords(e.g.,uh).Furthermore,speechrecognitionintroducessignificant“noise”totheSLUcom-ponentcausedbybackgroundnoise,mismatcheddomains,incor-rectrecognitionofpropernames(suchascityorpersonnames),andreducedaccuracyduetosub-realtimeprocessingrequirements.Atypicalcallroutingsystemoperatesataround20%-30%worderrorrate;oneoutofeverythreetofivewordsiswrong.GiventhattheresearchersinthisstudyalsodeterminedthatonethirdoftheIDerrorsareduetospeechrecognitionnoise,robustmethodsforspon-taneousspeechrecognitionarecriticallyimportantforsuccessfulIDandSFinSLUsystems.Tothisend,researchershaveproposedmanymethodsrangingfromN-bestrescoring,exploitingwordcon-fusionnetworks,andleveragingdialogcontextaspriorknowledge(e.g.,).4.1.IntentDeterminationForID,earlyworkwithdiscriminativeclassificationalgorithmswascompletedontheAT&THMIHYsystemusingtheBoostextertool,animplementationoftheAdaBoost.MHmulticlassmultilabelclassificationalgorithm.Hakkani-T¨uretal.extendedthisworkbyusingalatticeofsyntacticandsemanticfeatures.Discrimi-nativecallclassificationsystemsemployinglargemarginclassifiers(e.g.,supportvectormachines)includeworkbyHaffneretal.,whoproposedaglobaloptimizationprocessbasedonanoptimal 21 Correct-Estimatedabbdefghijklmnopqa.Abbreviation302b.Aircraft63c.Airfare641d.Airline372e.Airport1521f.Capacity15132g.City32h.DayName2i.Distance91j.Flight111623k.FlightNo26l.FlightTime1m.GroundFare133n.GroundService36o.Meal5p.Quantity8q.Restriction1Table3.Theconfusionmatrixforintentdetermination.channelcommunicationmodelthatallowedacombinationofhet-erogeneousbinaryclassifiers.Thisapproachdecreasedthecall-typeclassificationerrorrateforAT&T’sHMIHYnaturaldialogsystemsignificantly,especiallythefalserejectionrates.OtherworkbyKuoandLeeatBellLabsproposedtheuseofdiscriminativetrainingontheroutingmatrix,significantlyim-provingtheirvector-basedcallroutingsystemforlowrejectionrates.Theirapproachisbasedonusingtheminimumclassificationerror(MCE)criterion.LatertheyextendedthisapproachtoincludeBoostingandautomaticrelevancefeedback(ARF).Coxproposedtheuseofgeneralizedprobabilisticdescent(GPD),correc-tivetraining(CT),andlineardiscriminantanalysis(LDA).Finally,Chelbaetal.proposedusingMaximumEntropymodelsforID,andcomparedtheperformancewithaNaiveBayesapproachwiththeATIScorpus.ThediscriminativemethodresultedinhalftheclassificationerrorratecomparedtoNaiveBayesonthishighlyskeweddataset.Theyhavereportedabout4.8%topclasserrorrateusingaslightlydifferenttrainingandtestcorporathantheoneusedinthispaper.4.2.SlotFillingForSF,theATIScorpushasbeenextensivelystudiedfromtheearlydaysoftheDARPAATISproject.However,theuseofdiscrimina-tiveclassificationalgorithmsismorerecent.Somenotablestudiesincludethefollowing:WangandAcerocomparedtheuseofCRF,perceptron,largemargin,andMCEusingstochasticgradientdescent(SGD)forSFintheATISdomain.Theyobtainedsignificantlyreducedsloter-rorrates,withbestperformanceachievedbyCRF(thoughitwastheslowesttotrain).AlmostsimultaneouslyJeongandLeeproposedtheuseofCRF,extendedbynon-localfeatures,whichareimportanttodisam-biguatethetypeoftheslot.Forexample,adaycanbethearrivalday,departureday,orthereturnday.Ifthecontextualcuesdisam-biguatingthemarebeyondtheimmediatecontext,itisnoteasyfortheclassifiertochoosethecorrectclass.Usingnon-localtriggerfeaturesautomaticallyextractedfromthetrainingdataisshowntoimprovetheperformancesignificantly.Finally,RaymondandRiccardicomparedSVMandCRFwithgenerativemodelsfortheATIStask.Theyconcludedthatdis-criminativemethodsperformsignificantlybetter,andfurthermore,itispossibletoincorporatea-prioriinformationorlongdistancefea-tureseasily.Forexampletheyaddedfeaturessuchas“Doesthisutterancehavetheverbarrive”.Thisresultedinabout10%rela-tivereductioninsloterrorrate.Thedesignofsuchfeaturesusuallyrequiresdomainknowledge.5.ANALYSISOFINTENTDETERMINATIONINATISInthissection,ourgoalistoanalyzetheerrorsofastate-of-the-artIDsystemfortheATISdomain,clustertheerrors,andthencate-gorizetheerrortypes.Thesecategoriesoferrortypeswillsuggestpotentialareasofresearchthatcouldyieldimprovedaccuracy.Allexperimentsandanalysesareperformedusingmanualtranscriptionsofthetrainingandtestsetstoisolatethestudyfromnoiseintroducedbythespeechrecognizer.5.1.DiscriminativeTrainingandExperimentsForthefollowingexperiments,weusedtheATIScorpusasdescribedpreviouslyinSection2.Sincethesuperiorperformanceofthedis-criminativetrainingalgorithmshasbeenshownbytheearlierwork,wehaveemployedtheAdaBoost.MHalgorithminthisstudy.Weusedonlywordn-gramsasfeatures.WehavenotoptimizedBoost-ingparametersonatuningsetnorlearnedweakclassifiers.Thedataisnormalizedtolowercase,butnostemmingorstopwordremovalhasbeenperformed.TheATIStestsetwasclassifiedaccordingtotheclassesdefinedinTable1.TheIDerrorrateweobtainedwas4.5%,whichiscom-parableto(andactuallylowerthan)towhathasbeenreportedintheliterature.5.2.AnalysisofIntentDeterminationErrorsNext,wecheckedtheIDerrorswiththreetrainingandtestset-ups:1.AllTrain:usesallATIStrainingdatatotrainthemodel,anderrorsarecomputedontheATIStestset.Intotal,thismodelerroneouslyclassifiedonly40utterances(anerrorrateof4.5%).Theintentconfusionmatrixfortheseerrorsispro-videdinTable3. ...
Recent Comments