WHAT IS LEFT TO BE UNDERSTOOD IN ATIS?

Documents

WHAT IS LEFT TO BE UNDERSTOOD IN ATIS?

978-1-4244-7903-0/10/$26.00 ©2010 IEEE 19 SLT 2010 WHATISLEFTTOBEUNDERSTOODINATIS?GokhanTurDilekHakkani-T¨urLarryHeckSpeechatMicrosoft|MicrosoftResearchMountainView,CA,94041gokhan.tur@ieee.orgdilek@ieee.orglarry.heck@microsoft.comABSTRACTOneofthemaindataresourcesusedinmanystudiesoverthepasttwodecadesforspokenlanguageunderstanding(SLU)researchinspokendialogsystemsistheairlinetravelinformationsystem(ATIS)corpus.TwoprimarytasksinSLUareintentdetermination(ID)andslotﬁlling(SF).Recentstudiesreportederrorratesbelow5%forbothofthesetasksemployingdiscriminativemachinelearn-ingtechniqueswiththeATIStestset.Whiletheselowerrorratesmaysuggestthatthistaskisclosetobeingsolved,furtheranalysisrevealsthecontinuedutilityofATISasaresearchcorpus.Inthispa-per,ourgoalisnotexperimentingwithdomainspeciﬁctechniquesorfeatureswhichcanhelpwiththeremainingSLUerrors,butinsteadexploringmethodstorealizethisutilityviaextensiveerroranalysis.Weconcludethatevenwithsuchlowerrorrates,ATIStestsetstillincludesmanyunseenexamplecategoriesandsequences,hencere-quiresmoredata.Betteryet,newannotatedlargerdatasetsfrommorecomplextaskswithrealisticutterancescanavoidover-tuningintermsofmodelingandfeaturedesign.Webelievethatadvance-mentsinSLUcanbeachievedbyhavingmorenaturallyspokendatasetsandemployingmorelinguisticallymotivatedfeatureswhilepre-servingrobustnessduetospeechrecognitionnoiseandvarianceduetonaturallanguage.IndexTerms—spokenlanguageunderstanding,ATIS,discrim-inativetraining1.INTRODUCTIONSpokenlanguageunderstanding(SLU)aimstoextractthemeaningofthespeechutterances.Whileunderstandinglanguageisstillcon-sideredanunsolvedproblem,inthelastdecade,avarietyofpracticalgoal-orientedconversationalunderstandingsystemshavebeenbuiltforlimiteddomains.Thesesystemsaimtoautomaticallyidentifytheintentoftheuserasexpressedinnaturallanguage,extractassociatedargumentsorslots,andtakeactionsaccordinglytosatisfytheuser’srequests.Insuchsystems,thespeaker’sutteranceistypicallyrecog-nizedusinganautomaticspeechrecognizer(ASR).ThentheintentofthespeakerisidentiﬁedfromtherecognizedwordsequenceusinganSLUcomponent.Finally,adialogortaskmanager(DM)inter-actswiththeuser(notnecessarilyinnaturallanguage)andhelpstheuserachievethetaskthatthesystemisdesignedtosupport.Intheearly90s,DARPA(DefenseAdvancedResearchProgramAgency)initiatedtheAirlineTravelInformationSystem(ATIS)pro-ject.TheATIStaskconsistedofspokenqueriesonﬂight-relatedin-formation.AnexampleutteranceisIwanttoﬂytoBostonfromNewYorknextweek.Understandingwasreducedtotheproblemofex-tractingtask-speciﬁcarguments,suchasDestinationandDepartureDate.Participatingsystemsemployedeitheradata-drivenstatisticalapproachoraknowledge-basedapproach.Almostsimultaneouslywiththesemanticframeﬁlling-basedSLUapproaches,anewtaskemergedmotivatedbythesuccessoftheearlycommercialinteractivevoiceresponse(IVR)applicationsusedincallcenters.TheSLUwasframedasclassifyingusers’utter-ancesintopredeﬁnedcategories(calledasintentsorcall-types).Thebiggestdifferencebetweenthecallclassiﬁcationsystemsandsemanticframeﬁllingsystemsisthattheformerdoesnotex-plicitlyseektodeterminetheargumentsprovidedbytheuser.Themaingoalisroutingthecalltoanappropriatecallcenterdepart-ment.Theargumentsprovidedbytheuserareimportantonlyinthesensethattheyhelpmaketherightclassiﬁcation.WhilethishasbeenatotallydifferentperspectiveforthetaskofSLU,itwasac-tuallycomplementarytotemplateﬁllinginthateachcall-typecanbeviewedasatemplatetobeﬁlled.Forexample,inthecaseoftheDARPAATISproject,whiletheprimaryintent(orgoal)wasFlight,usersalsoaskedaboutmanyotherthingssuchasGroundtransportationorAirplanespeciﬁcations.Theprogramalsodeﬁnedspecializedtemplatesfortheselessfrequentintents.Thisledtoaseamlessintegrationofintentdetermination(ID)andslotﬁlling(SF)basedSLUapproaches.Thisintegratedapproachactuallyyieldedimprovedend-to-endautomationratesascomparedtothepreviousdecoupledandsequentialapproaches.Forexample,Jeongetalproposedtomodelthesetwosystemsjointlyusingatriangularchainconditionalrandomﬁeld(CRF).Inthispaper,ratherthanfocusonspeciﬁctechniquesorfeaturestoimproveIDandSFaccuracy,ourgoalistoassessthecontinuedutilityoftheATIScorpusgiventhetwodecadesofresearchithassupported.Inthenextsection,webrieﬂydescribetheATIScor-pusandthendiscusstheevaluationmetricsforIDandSF.InSec-tion4,wepresentthestate-of-the-artdiscriminativetrainingeffortsforbothIDandSFforthetaskofATIS.Finally,inSections5and6wepresentourdetailedanalysesontheerrorswehaveseenusingIDandSFmodels,respectively,withcomparableperformancetothosereportedintheliterature.Wewillshowthat,bycategorizingtheer-roneouscasesthatremainafterN-foldcrossvalidationexperiments,ATISisstillusefulandsuggestsfutureresearchdirectionsinSLU.2.AIRLINETRAVELINFORMATION(ATIS)CORPUSAnimportantby-productoftheDARPAATISprojectwastheATIScorpus.ThiscorpusisthemostcommonlyuseddatasetforSLUresearch.Thecorpushasseventeendifferentintents,suchasFlightorAircraftcapacity.Thepriordistributionis,however,heav-ilyskewed,andthemostfrequentintent,Flightrepresentsabout70%ofthetrafﬁc.Table1showsthefrequencyoftheintentsinthiscorpusfortrainingandtestsets.Inthispaper,weusetheATIScorpusasusedinHeandYoungandRaymondandRiccardi.Thetrainingsetcontains4,978ut- 20 IntentTrainingSetTestSetAbbreviation2.4%3.6%Aircraft1.6%0.9%Airfare9.0%5.8%Airline3.4%4.3%Airport0.5%2.0%Capacity0.4%2.4%City0.3%0.6%DayName0.1%0.1%Distance0.4%1.1%Flight73.1%71.6%FlightNo0.3%1.0%FlightTime1.2%0.1%GroundFare0.4%0.8%GroundService5.5%4.0%Meal0.1%0.6%Quantity1.1%0.9%Restriction0.3%0.1%Table1.Thefrequencyofintentsforthetrainingandtestsets.UtteranceHowmuchisthecheapestﬂightfromBostontoNewYorktomorrowmorning?Goal:AirfareCostRelativecheapestDepartCityBostonArrivalCityNewYorkDepartDate.RelativetomorrowDepartTime.PeriodmorningTable2.AnexampleutterancefromtheATISdataset.terancesselectedfromtheClassA(contextindependent)trainingdataintheATIS-2andATIS-3corpora,whilethetestsetcontains893utterancesfromtheATIS-3Nov93andDec94datasets.Eachutterancehasitsnamedentitiesmarkedviatablelookup,includ-ingdomainspeciﬁcentitiessuchascity,airline,airportnames,anddates.TheATISutterancesarerepresentedusingsemanticframes,whereeachsentencehasagoalorgoals(a.k.a.intent)andslotsﬁlledwithphrases.Thevaluesoftheslotsarenotnormalizedorinterpreted.AnexampleutterancewithannotationsisshowninTable2.3.EVALUATIONMETRICSThemostcommonlyusedmetricsforIDandSFareclass(orslot)errorrate(ER)andF-Measure.ThesimplermetricERforIDcanbecomputedas:ERID=#misclassiﬁedutterances#utterancesNotethatoneutterancecanhavemorethanoneintent.AtypicalexampleisCanyoutellmemybalance?Ineedtomakeatransfer.Inmostcases,wherethesecondintentisgeneric(agreeting,smalltalkwiththehumanagent)orvague,itisignored.Ifnoneofthetrueclassesisselected,itiscountedasamisclassiﬁcation.ForSF,theerrorratecanbecomputedintwoways:ThemorecommonmetricistheF-measureusingtheslotsasunits.Thismetricissimilartowhatisbeingusedforothersequenceclassiﬁcationtasksinthenaturallanguageprocessingcommunity,suchasparsingandnamedentityextraction.Inthistechnique,usuallytheIOBschemaisadopted,whereeachofthewordsaretaggedwiththeirpositionintheslot:beginning(B),in(I)orother(O).Then,recallandprecisionvaluesarecomputedforeachoftheslots.Aslotisconsideredtobecorrectifitsrangeandtypearecorrect.TheF-Measureisdeﬁnedastheharmonicmeanofrecallandprecision:F−Measure=2×Recall×PrecisionRecall+PrecisionwhereRecall=#correctslotsfound#trueslotsPrecision=#correctslotsfound#foundslots.4.BACKGROUNDONUSINGDISCRIMINATIVECLASSIFIERSFORSLUWithadvancesinmachinelearningoverthelastdecade,especiallyindiscriminativeclassiﬁcationtechniques,researchershaveframedtheIDproblemasasampleclassiﬁcationtaskandSFasasequenceclassiﬁcationtask.Typically,wordn-gramsareusedasfeaturesaf-terpreprocessingwithgenericentities,suchasdates,locations,orphonenumbers.Becauseoftheverylargedimensionoftheinputspace,largemarginclassiﬁerssuchasSVMsorAdaboostwerefoundtobeverygoodcandidatesforIDandCRFsforSF.Totakeintoaccountcontext,therecenttrendistomatchn-grams(asubstringofnwords)ratherthanwords.Asdiscovered,datadrivenapproachesareverywell-suitedforprocessingspontaneousspokenutterances.Thedatadrivenapproachesaretypicallymorerobusttosentencesthatarenotwell-formedgram-matically,whichoccursfrequentlyinspontaneousspeech.Eveninbroadcastconversationswhereparticipantsareverywelltrainedandprepared,alargepercentageoftheutteranceshavedisﬂuencies:rep-etitions,falsestarts,andﬁllerwords(e.g.,uh).Furthermore,speechrecognitionintroducessigniﬁcant“noise”totheSLUcom-ponentcausedbybackgroundnoise,mismatcheddomains,incor-rectrecognitionofpropernames(suchascityorpersonnames),andreducedaccuracyduetosub-realtimeprocessingrequirements.Atypicalcallroutingsystemoperatesataround20%-30%worderrorrate;oneoutofeverythreetoﬁvewordsiswrong.GiventhattheresearchersinthisstudyalsodeterminedthatonethirdoftheIDerrorsareduetospeechrecognitionnoise,robustmethodsforspon-taneousspeechrecognitionarecriticallyimportantforsuccessfulIDandSFinSLUsystems.Tothisend,researchershaveproposedmanymethodsrangingfromN-bestrescoring,exploitingwordcon-fusionnetworks,andleveragingdialogcontextaspriorknowledge(e.g.,).4.1.IntentDeterminationForID,earlyworkwithdiscriminativeclassiﬁcationalgorithmswascompletedontheAT&THMIHYsystemusingtheBoostextertool,animplementationoftheAdaBoost.MHmulticlassmultilabelclassiﬁcationalgorithm.Hakkani-T¨uretal.extendedthisworkbyusingalatticeofsyntacticandsemanticfeatures.Discrimi-nativecallclassiﬁcationsystemsemployinglargemarginclassiﬁers(e.g.,supportvectormachines)includeworkbyHaffneretal.,whoproposedaglobaloptimizationprocessbasedonanoptimal 21 Correct-Estimatedabbdefghijklmnopqa.Abbreviation302b.Aircraft63c.Airfare641d.Airline372e.Airport1521f.Capacity15132g.City32h.DayName2i.Distance91j.Flight111623k.FlightNo26l.FlightTime1m.GroundFare133n.GroundService36o.Meal5p.Quantity8q.Restriction1Table3.Theconfusionmatrixforintentdetermination.channelcommunicationmodelthatallowedacombinationofhet-erogeneousbinaryclassiﬁers.Thisapproachdecreasedthecall-typeclassiﬁcationerrorrateforAT&T’sHMIHYnaturaldialogsystemsigniﬁcantly,especiallythefalserejectionrates.OtherworkbyKuoandLeeatBellLabsproposedtheuseofdiscriminativetrainingontheroutingmatrix,signiﬁcantlyim-provingtheirvector-basedcallroutingsystemforlowrejectionrates.Theirapproachisbasedonusingtheminimumclassiﬁcationerror(MCE)criterion.LatertheyextendedthisapproachtoincludeBoostingandautomaticrelevancefeedback(ARF).Coxproposedtheuseofgeneralizedprobabilisticdescent(GPD),correc-tivetraining(CT),andlineardiscriminantanalysis(LDA).Finally,Chelbaetal.proposedusingMaximumEntropymodelsforID,andcomparedtheperformancewithaNaiveBayesapproachwiththeATIScorpus.ThediscriminativemethodresultedinhalftheclassiﬁcationerrorratecomparedtoNaiveBayesonthishighlyskeweddataset.Theyhavereportedabout4.8%topclasserrorrateusingaslightlydifferenttrainingandtestcorporathantheoneusedinthispaper.4.2.SlotFillingForSF,theATIScorpushasbeenextensivelystudiedfromtheearlydaysoftheDARPAATISproject.However,theuseofdiscrimina-tiveclassiﬁcationalgorithmsismorerecent.Somenotablestudiesincludethefollowing:WangandAcerocomparedtheuseofCRF,perceptron,largemargin,andMCEusingstochasticgradientdescent(SGD)forSFintheATISdomain.Theyobtainedsigniﬁcantlyreducedsloter-rorrates,withbestperformanceachievedbyCRF(thoughitwastheslowesttotrain).AlmostsimultaneouslyJeongandLeeproposedtheuseofCRF,extendedbynon-localfeatures,whichareimportanttodisam-biguatethetypeoftheslot.Forexample,adaycanbethearrivalday,departureday,orthereturnday.Ifthecontextualcuesdisam-biguatingthemarebeyondtheimmediatecontext,itisnoteasyfortheclassiﬁertochoosethecorrectclass.Usingnon-localtriggerfeaturesautomaticallyextractedfromthetrainingdataisshowntoimprovetheperformancesigniﬁcantly.Finally,RaymondandRiccardicomparedSVMandCRFwithgenerativemodelsfortheATIStask.Theyconcludedthatdis-criminativemethodsperformsigniﬁcantlybetter,andfurthermore,itispossibletoincorporatea-prioriinformationorlongdistancefea-tureseasily.Forexampletheyaddedfeaturessuchas“Doesthisutterancehavetheverbarrive”.Thisresultedinabout10%rela-tivereductioninsloterrorrate.Thedesignofsuchfeaturesusuallyrequiresdomainknowledge.5.ANALYSISOFINTENTDETERMINATIONINATISInthissection,ourgoalistoanalyzetheerrorsofastate-of-the-artIDsystemfortheATISdomain,clustertheerrors,andthencate-gorizetheerrortypes.Thesecategoriesoferrortypeswillsuggestpotentialareasofresearchthatcouldyieldimprovedaccuracy.Allexperimentsandanalysesareperformedusingmanualtranscriptionsofthetrainingandtestsetstoisolatethestudyfromnoiseintroducedbythespeechrecognizer.5.1.DiscriminativeTrainingandExperimentsForthefollowingexperiments,weusedtheATIScorpusasdescribedpreviouslyinSection2.Sincethesuperiorperformanceofthedis-criminativetrainingalgorithmshasbeenshownbytheearlierwork,wehaveemployedtheAdaBoost.MHalgorithminthisstudy.Weusedonlywordn-gramsasfeatures.WehavenotoptimizedBoost-ingparametersonatuningsetnorlearnedweakclassiﬁers.Thedataisnormalizedtolowercase,butnostemmingorstopwordremovalhasbeenperformed.TheATIStestsetwasclassiﬁedaccordingtotheclassesdeﬁnedinTable1.TheIDerrorrateweobtainedwas4.5%,whichiscom-parableto(andactuallylowerthan)towhathasbeenreportedintheliterature.5.2.AnalysisofIntentDeterminationErrorsNext,wecheckedtheIDerrorswiththreetrainingandtestset-ups:1.AllTrain:usesallATIStrainingdatatotrainthemodel,anderrorsarecomputedontheATIStestset.Intotal,thismodelerroneouslyclassiﬁedonly40utterances(anerrorrateof4.5%).Theintentconfusionmatrixfortheseerrorsispro-videdinTable3. ...

Documents

Month: July 2021

WHAT IS LEFT TO BE UNDERSTOOD IN ATIS?

Community manager h/f – IFAG

2011-09-18 08:30:52 1/2 Melanie (#3)

Référentiel de formation des directeurs de police municipale

PROGRAMME DE FORMATION – IMF

FORMATION CONDUCTEUR POIDS LOURD – Febetra

Formation à distance – Venours

L’évolution du métier de caissière d’hypermarché : quels …

Formation Adobe Illustrator CC : Outils et ateliers avancés

Rapport d étude – defi-metiers.fr

Latest documents

Recent Comments

Archives

Categories

Docs Wikilivre