Leveraging Synthetically Generated Data for Real Estate Document Classification
Leveraging Synthetically Generated Data for Real Estate Document Classification

| dc.contributor.author | Deußer, Tobias | |
| dc.contributor.author | Ramien, Gregor | |
| dc.contributor.author | Weber, Nico | |
| dc.contributor.author | Meidinger, Maximilian | |
| dc.contributor.author | Hahnbück, Max | |
| dc.contributor.author | Bauckhage, Christian | |
| dc.contributor.author | Sifa, Rafet | |
| dc.date.accessioned | 2026-03-17T06:12:42Z | |
| dc.date.available | 2026-03-17T06:12:42Z | |
| dc.date.issued | 12.2025 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.11811/13972 | |
| dc.description.abstract | Document classification in regulated domains like law, finance, or real estate is hindered by the scarcity of labeled data and strict privacy constraints. This paper presents a pipeline for synthetically generating training data for document classifiers using a combination of domain-specific templates, large language models, and data augmentation techniques. Focusing on two key document types relevant to real estate workflows, Child Support Certificate and Refurbishment Roadmap, we construct realistic multi-page documents and generate negative classes using LLM-generated distractors. We train a BERT-based classifier on this synthetic dataset and evaluate it on real-world OCR-extracted documents, achieving strong performance despite the absence of real documents in training. Our findings highlight the feasibility of using synthetic data to overcome annotation bottlenecks and pave the way for broader applications in privacy-sensitive industries. | en |
| dc.format.extent | 6 | |
| dc.language.iso | eng | |
| dc.rights | In Copyright | |
| dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | |
| dc.subject | Document Classification | |
| dc.subject | Synthetic Data | |
| dc.subject | Large Language Models | |
| dc.subject | Natural Language Processing | |
| dc.subject | Finance | |
| dc.subject | Machine Learning | |
| dc.subject.ddc | 004 Informatik | |
| dc.title | Leveraging Synthetically Generated Data for Real Estate Document Classification | |
| dc.type | Konferenzveröffentlichung | |
| dc.identifier.doi | https://doi.org/10.48565/bonndoc-809 | |
| dc.publisher.name | IEEE, Institute of Electrical and Electronics Engineers | |
| dc.publisher.location | New York, NY | |
| dc.rights.accessRights | openAccess | |
| dc.relation.doi | https://doi.org/10.1109/BigData66926.2025.11400789 | |
| ulbbn.pubtype | Zweitveröffentlichung | |
| ulbbnediss.dissNotes.extern | © 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. | |
| ulbbn.relation.conference | 2025 IEEE International Conference on Big Data (BigData) |
Files in this item
This item appears in the following Collection(s)
-
Publikationen (1)




