Heuristisk analys med Diderichsens satsschema. Tillämpningar för svensk text

Wilhelmsson, Kenneth

dc.contributor.author	Wilhelmsson, Kenneth
dc.date.accessioned	2010-03-26T14:01:51Z
dc.date.available	2010-03-26T14:01:51Z
dc.date.issued	2010-03-26T14:01:51Z
dc.identifier.uri	http://hdl.handle.net/2077/22028
dc.description	En heuristisk metod för parsning av svensk text, heuristisk schemaparsning, med implementa-tion beskrivs. Med fokus på huvudsatsanalys används en samling licensieringstekniker för att utesluta icke-primära kandidater till de längdbegränsade (eng: bounded) nyckelkomponenter som avgränsar fält och andra utrymmen i Diderichsens satsschema. Härigenom kan de funk-tionella konstituenter som är (potentiellt) obegränsade i längd (eng: unbounded), subjekt, ob-jekt/predikativ och adverbial, identifieras genom att i lägre grad använda explicit matchning av flerordsled och istället tillämpa olika heuristiska regler. För frastypsbestämning och av-gränsning av dessa konstituenter, när de är angränsande, presenteras först en ny segmente-ringsmetod, rangbaserad chunkning. Denna segmentering följs av en serie möjliga samman-fogningar som syftar till att nå ett antal nominala led som är kompatibelt med valensen hos satsens huvudverb. Målsättningen för denna metod är identifikation av hela nominala och adverbiella led, inklusive efterställda attribut. Detta avhandlingsprojekt baseras på Stockholm Umeå Corpus 2.0 som speglar olika genrer av svensk publicerad text. Dess tagguppsättning används också omodifierad i en ordklasstaggare som möjliggör hantering av valfri textinput. Den inre representationen av en textmening, under denna funktionella syntaxanalys som inte innehåller någon explicit språkdefinierande grammatikkomponent, är objektbaserad. Även om utdataformat och förutsättningar för korrekthetsutvärderingar varierar mycket för svenska parsningsprojekt, hävdas att ansatsen kan ge hög korrekthet, vilken kan förbättras om mer tid ägnas åt det manuella regelskrivandet. Avhandlingsarbetet inkluderar även två prototyptillämpningar som båda kräver hög korrekthet av den analysform som här produceras. Den första är en implementation i området ordbehand-ling där en användare ges möjlighet att automatiskt parafrasera skrivna textmeningar då syn-taxanalys av dessa visas. Den andra applikationen som presenteras tillhör området natural language query systems och genererar automatiskt frågor till en godtycklig införd text. Denna prototyp inbegriper textdatabasen från svenska Wikipedia och undersöker främst generering av hv-frågor som bildas genom spetsställning och mappning till frågeord. Frågegenereringen sker när en text öppnas och tillåter frågor från användaren med speciellt fokus på precion-värdet – hög korrekthet på svaren givet frågorna.	en
dc.description.abstract	A heuristic method for parsing Swedish text, heuristic schema parsing, is described and im-plemented. Focusing on main clause (primary) analysis, a collection of licensing techniques for removing non-primary verb candidates is employed, leaving e.g. the primary verbs, particles and conjunctions (bounded key constituents) that delimit the content of the fields in Diderichsen’s sentence schema. Hereby, the subsequent identification of constituents which do not have an upper bound on their length (subject, object/predicatives and adverbials) can be identified relying to a lesser on extent explicit pattern matching, and more on different heu-ristic rules. For phrase type identification and delimitation of these constituents, when adja-cent to each other, a novel chunking technique, rank-based chunking, is applied. Following this, a series of further rules merge chunks into larger ones, aiming at a final number of nomi-nal chunks compatible with the valency information of the main verb. The aim is to identify full nominal and adverbial constituents, including post-modifiers. The implementation uses the Stockholm Umeå Corpus 2.0, a corpus which is balanced for different genres in published Swedish text. SUC’s tagset is also used unmodified in part-of-speech tagging which enables the program to deal with input text. The functional parsing, which includes no explicit lan-guage-defining grammar component is carried out technically using an object-based represen-tation of clause structure. Although output formats and types of evaluations of correctness are very different in parsers for Swedish text, it is claimed that the manual approach presented can provide high accuracy, which can be improved given more time for development. The thesis work also includes two prototype applications, both requiring high accuracy of the sort of functional syntactic analysis described here. The first one is an implementation of automatic syntactic fronting in the area of text editing for Swedish, where the user is pre-sented with a syntactically analyzed copy of her writing, from which paraphrases easily can be generated. The second application is in the field of natural language query systems and produces questions with answers from an arbitrary declarative input text. This prototype in-corporates a text database from Swedish Wikipedia, and investigates primarily generation of WH-questions formed via fronting of unbounded primary constituents. The questions are gen-erated as a text is opened and thus permits users to only ask the available ones, thus aiming at a high precision value.	en
dc.language.iso	swe	en
dc.relation.ispartofseries	Gothenburg Monographs in Linguistics	en
dc.relation.ispartofseries	40	en
dc.subject	Diderichsens nordiska satsschema	en
dc.subject	Diderichsen’s sentence schema	en
dc.subject	positionsgrammatik	en
dc.subject	positional grammar	en
dc.subject	fältgrammatik	en
dc.subject	field grammar	en
dc.subject	licensieringstekniker	en
dc.subject	licensing techniques	en
dc.subject	Stockholm Umeå Corpus	en
dc.subject	Stockholm Umeå Corpus	en
dc.subject	schemaparsning	en
dc.subject	schema parsing	en
dc.subject	rangbaserad chunkning	en
dc.subject	rank-based chunking	en
dc.subject	syntactic fronting	en
dc.subject	spetsställning	en
dc.subject	parafrasgenerering	en
dc.subject	paraphrasing	en
dc.subject	frågegenerering	en
dc.subject	question generation	en
dc.subject	naturligt språk-frågesystem	en
dc.subject	natural language query systems	en
dc.subject	svenska WordNet	en
dc.subject	Swedish WordNet	en
dc.title	Heuristisk analys med Diderichsens satsschema. Tillämpningar för svensk text	en
dc.title.alternative	Heuristic Analysis with Diderichsen’s Sentence Schema – Applications for Swedish Text	en
dc.type	Text
dc.type.svep	Doctoral thesis	eng
dc.gup.mail	kw@ling.gu.se	en
dc.type.degree	Doctor of Philosophy	en
dc.gup.origin	Göteborgs universitet. Humanistiska fakulteten	swe
dc.gup.origin	University of Gothenburg. Faculty of Arts	eng
dc.gup.defenceplace	Lördagen den 24 april 2010, kl. 10.00, T307, Olof Wijksgatan 6, Göteborg	en
dc.gup.defencedate	2010-04-24
dc.gup.dissdb-fakultet	HF

Filer under denna titel

Namn:: gupea_2077_22028_1.pdf
Storlek:: 72.93Kb
Format:: PDF
Description:: Cover

Fil(er)

Namn:: gupea_2077_22028_2.pdf
Storlek:: 5.516Mb
Format:: PDF
Description:: Thesis

Fil(er)

Namn:: gupea_2077_22028_3.pdf
Storlek:: 175.6Kb
Format:: PDF
Description:: Spikblad

Fil(er)

Dokumentet tillhör följande samling(ar)

Doctoral Theses / Doktorsavhandlingar Institutionen för filosofi, lingvistik och vetenskapsteori
Doctoral Theses from University of Gothenburg / Doktorsavhandlingar från Göteborgs universitet

Visa enkel post