Efficient Distributed In-Memory Processing of RDF Datasets

Sejdiu, Gezim

dc.contributor.advisor	Lehmann, Jens
dc.contributor.author	Sejdiu, Gezim
dc.date.accessioned	2020-10-27T13:27:52Z
dc.date.available	2020-10-27T13:27:52Z
dc.date.issued	27.10.2020
dc.identifier.uri	https://hdl.handle.net/20.500.11811/8735
dc.description.abstract	Over the past decade, vast amounts of machine-readable structured information have become available through the automation of research processes as well as the increasing popularity of knowledge graphs and semantic technologies. Today, we count more than 10,000 datasets made available online following Semantic Web standards. A major and yet unsolved challenge that research faces today is to perform scalable analysis of large-scale knowledge graphs in order to facilitate applications in various domains including life sciences, publishing, and the internet of things. The main objective of this thesis is to lay foundations for efficient algorithms performing analytics, i.e. exploration, quality assessment, and querying over semantic knowledge graphs at a scale that has not been possible before. First, we propose a novel approach for statistical calculations of large RDF datasets, which scales out to clusters of machines. In particular, we describe the first distributed in-memory approach for computing 32 different statistical criteria for RDF datasets using Apache Spark. Many applications such as data integration, search, and interlinking, may take full advantage of the data when having a priori statistical information about its internal structure and coverage. However, such applications may suffer from low quality and not being able to leverage the full advantage of the data when the size of data goes beyond the capacity of the resources available. Thus, we introduce a distributed approach of quality assessment of large RDF datasets. It is the first distributed, in-memory approach for computing different quality metrics for large RDF datasets using Apache Spark. We also provide a quality assessment pattern that can be used to generate new scalable metrics that can be applied to big data. Based on the knowledge of the internal statistics of a dataset and its quality, users typically want to query and retrieve large amounts of information. As a result, it has become difficult to efficiently process these large RDF datasets. Indeed, these processes require, both efficient storage strategies and query-processing engines, to be able to scale in terms of data size. Therefore, we propose a scalable approach to evaluate SPARQL queries over distributed RDF datasets by translating SPARQL queries into Spark executable code. We conducted several empirical evaluations to assess the scalability, effectiveness, and efficiency of our proposed approaches. More importantly, various use cases i.e. Ethereum analysis, Mining Big Data Logs, and Scalable Integration of POIs, have been developed and leverages by our approach. The empirical evaluations and concrete applications provide evidence that our methodology and techniques proposed during this thesis help to effectively analyze and process large-scale RDF datasets. All the proposed approaches during this thesis are integrated into the larger SANSA framework.	en
dc.language.iso	eng
dc.rights	In Copyright
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	Semantic Web
dc.subject	RDF
dc.subject	Big Data
dc.subject	RDF-Statistik
dc.subject	RDF-Qualitätsbewertung
dc.subject	SPARQL-Engine
dc.subject	RDF Statistics
dc.subject	RDF Quality Assessment
dc.subject	SPARQL engine
dc.subject.ddc	004 Informatik
dc.title	Efficient Distributed In-Memory Processing of RDF Datasets
dc.type	Dissertation oder Habilitation
dc.publisher.name	Universitäts- und Landesbibliothek Bonn
dc.publisher.location	Bonn
dc.rights.accessRights	openAccess
dc.identifier.urn	https://nbn-resolving.org/urn:nbn:de:hbz:5-59860
ulbbn.pubtype	Erstveröffentlichung
ulbbnediss.affiliation.name	Rheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.location	Bonn
ulbbnediss.thesis.level	Dissertation
ulbbnediss.dissID	5986
ulbbnediss.date.accepted	29.09.2020
ulbbnediss.institute	Mathematisch-Naturwissenschaftliche Fakultät : Fachgruppe Informatik / Institut für Informatik
ulbbnediss.fakultaet	Mathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coReferee	Auer, Sören

Dateien zu dieser Ressource

Name:: 5986.pdf
Größe:: 3.8MB
Format:: PDF

Dokument öffnen

Das Dokument erscheint in:

E-Dissertationen (3113)

Zur Kurzanzeige

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden: