Show simple item record

Strategies for a Semantified Uniform Access to Large and Heterogeneous Data Sources

dc.contributor.advisorAuer, Sören
dc.contributor.authorMami, Mohamed Nadjib
dc.date.accessioned2021-02-18T10:49:48Z
dc.date.available2021-02-18T10:49:48Z
dc.date.issued18.02.2021
dc.identifier.urihttps://hdl.handle.net/20.500.11811/8925
dc.description.abstractThe remarkable advances achieved in both research and development of Data Management as well as the prevalence of high-speed Internet and technology in the last few decades have caused unprecedented data avalanche. Large volumes of data manifested in a multitude of types and formats are being generated and becoming the new norm. In this context, it is crucial to both leverage existing approaches and propose novel ones to overcome this data size and complexity, and thus facilitate data exploitation. In this thesis, we investigate two major approaches to addressing this challenge: Physical Data Integration and Logical Data Integration. The specific problem tackled is to enable querying large and heterogeneous data sources in an ad hoc manner.
In the Physical Data Integration, data is physically and wholly transformed into a canonical unique format, which can then be directly and uniformly queried. In the Logical Data Integration, data remains in its original format and form and a middleware is posed above the data allowing to map various schemata elements to a high-level unifying formal model. The latter enables the querying of the underlying original data in an ad hoc and uniform way, a framework which we call Semantic Data Lake, SDL. Both approaches have their advantages and disadvantages. For example, in the former, a significant effort and cost are devoted to pre-processing and transforming the data to the unified canonical format. In the latter, the cost is shifted to the query processing phases, e.g., query analysis, relevant source detection and results reconciliation.
In this thesis we investigate both directions and study their strengths and weaknesses. For each direction, we propose a set of approaches and demonstrate their feasibility via a proposed implementation. In both directions, we appeal to Semantic Web technologies, which provide a set of time-proven techniques and standards that are dedicated to Data Integration. In the Physical Integration, we suggest an end-to-end blueprint for the semantification of large and heterogeneous data sources, i.e., physically transforming the data to the Semantic Web data standard RDF (Resource Description Framework). A unified data representation, storage and query interface over the data are suggested. In the Logical Integration, we provide a description of the SDL architecture, which allows querying data sources right on their original form and format without requiring a prior transformation and centralization. For a number of reasons that we detail, we put more emphasis on the virtual approach. We present the effort behind an extensible implementation of the SDL, called Squerall, which leverages state-of-the-art Semantic and Big Data technologies, e.g., RML (RDF Mapping Language) mappings, FnO (Function Ontology) ontology, and Apache Spark. A series of evaluation is conducted to evaluate the implementation along with various metrics and input data scales. In particular, we describe an industrial real-world use case using our SDL implementation. In a preparation phase, we conduct a survey for the Query Translation methods in order to back some of our design choices.
en
dc.language.isoeng
dc.rightsIn Copyright
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subjectSemantische Informationsintegration
dc.subjectDatenheterogenität
dc.subjectSemantic Web
dc.subjectSPARQL
dc.subjectOntologie
dc.subjectDatenbank
dc.subjectBig Data
dc.subjectNoSQL
dc.subjectSemantic Data Integration
dc.subjectKnowledge Graph
dc.subjectData Heterogeneity
dc.subjectDatabases
dc.subjectSemantic Data Lake
dc.subjectData Lake
dc.subject.ddc004 Informatik
dc.titleStrategies for a Semantified Uniform Access to Large and Heterogeneous Data Sources
dc.typeDissertation oder Habilitation
dc.publisher.nameUniversitäts- und Landesbibliothek Bonn
dc.publisher.locationBonn
dc.rights.accessRightsopenAccess
dc.identifier.urnhttps://nbn-resolving.org/urn:nbn:de:hbz:5-61180
ulbbn.pubtypeErstveröffentlichung
ulbbnediss.affiliation.nameRheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.locationBonn
ulbbnediss.thesis.levelDissertation
ulbbnediss.dissID6118
ulbbnediss.date.accepted28.01.2021
ulbbnediss.instituteMathematisch-Naturwissenschaftliche Fakultät : Fachgruppe Informatik / Institut für Informatik
ulbbnediss.fakultaetMathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coRefereeLehmann, Jens


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

The following license files are associated with this item:

InCopyright