<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel rdf:about="https://hdl.handle.net/20.500.11811/701">
<title>Publikationen</title>
<link>https://hdl.handle.net/20.500.11811/701</link>
<description/>
<items>
<rdf:Seq>
<rdf:li rdf:resource="https://hdl.handle.net/20.500.11811/13972"/>
</rdf:Seq>
</items>
<dc:date>2026-04-11T01:00:42Z</dc:date>
</channel>
<item rdf:about="https://hdl.handle.net/20.500.11811/13972">
<title>Leveraging Synthetically Generated Data for Real Estate Document Classification</title>
<link>https://hdl.handle.net/20.500.11811/13972</link>
<description>Leveraging Synthetically Generated Data for Real Estate Document Classification
Deußer, Tobias; Ramien, Gregor; Weber, Nico; Meidinger, Maximilian; Hahnbück, Max; Bauckhage, Christian; Sifa, Rafet
Document classification in regulated domains like law, finance, or real estate is hindered by the scarcity of labeled data and strict privacy constraints. This paper presents a pipeline for synthetically generating training data for document classifiers using a combination of domain-specific templates, large language models, and data augmentation techniques. Focusing on two key document types relevant to real estate workflows, &lt;em&gt;Child Support Certificate and Refurbishment Roadmap&lt;/em&gt;, we construct realistic multi-page documents and generate negative classes using LLM-generated distractors. We train a BERT-based classifier on this synthetic dataset and evaluate it on real-world OCR-extracted documents, achieving strong performance despite the absence of real documents in training. Our findings highlight the feasibility of using synthetic data to overcome annotation bottlenecks and pave the way for broader applications in privacy-sensitive industries.
</description>
<dc:date>2025-12-01T00:00:00Z</dc:date>
</item>
</rdf:RDF>
