THE LINGUISTIC STRUCTURE OF WIKIPEDIA A multilingual analysis and comparison of the language used in Wikipedia articles
Abstract
Wikipedia is a great source of knowledge, but due to its open-collaboration nature, it presents some limitations.
Namely, the uneven distribution of content, the low overlap in topic coverage, the differences in
the comprehensiveness of articles, and the low number of editors. For this reason, the Abstract Wikipedia
project has been created; their objective is to construct language-independent (abstract) articles that can be
rendered in any language. In this thesis, we have computationally analysed the language used in Wikipedia
in order to find similarities between the language used in different articles. To do so, we have syntactically
parsed articles of Wikipedia in different languages using UDPipe 2.0 and gathered the languages’ recurrent
syntactic patterns using Grammatical Framework’s GF-UD. Then, we have compared the analyses with cosine
similarity in two ways: based on dependency relations and based on linguistic patterns. We have seen
that there is a basis for the Abstract Wikipedia project: there are syntactic similarities not only within one
language, but also within multiple languages. In addition, we have found that semantically-related topics
have a higher similarity than those which are not. Finally, we have gathered syntactic patterns of every
language and compared them, which can constitute the basis of the creation of the Renderers for Abstract
Wikipedia.
Degree
Student essay
Collections
View/ Open
Date
2022-06-20Author
Grau Francitorra, Patricia
Keywords
Abstract Wikipedia, Syntactic Analysis, Universal Dependencies, Grammatical Framework, UDPipe 2.0, Syntactic Patterns
Language
eng