Show simple item record

Leveraging Synthetically Generated Data for Real Estate Document Classification

dc.contributor.authorDeußer, Tobias
dc.contributor.authorRamien, Gregor
dc.contributor.authorWeber, Nico
dc.contributor.authorMeidinger, Maximilian
dc.contributor.authorHahnbück, Max
dc.contributor.authorBauckhage, Christian
dc.contributor.authorSifa, Rafet
dc.date.accessioned2026-03-17T06:12:42Z
dc.date.available2026-03-17T06:12:42Z
dc.date.issued12.2025
dc.identifier.urihttps://hdl.handle.net/20.500.11811/13972
dc.description.abstractDocument classification in regulated domains like law, finance, or real estate is hindered by the scarcity of labeled data and strict privacy constraints. This paper presents a pipeline for synthetically generating training data for document classifiers using a combination of domain-specific templates, large language models, and data augmentation techniques. Focusing on two key document types relevant to real estate workflows, Child Support Certificate and Refurbishment Roadmap, we construct realistic multi-page documents and generate negative classes using LLM-generated distractors. We train a BERT-based classifier on this synthetic dataset and evaluate it on real-world OCR-extracted documents, achieving strong performance despite the absence of real documents in training. Our findings highlight the feasibility of using synthetic data to overcome annotation bottlenecks and pave the way for broader applications in privacy-sensitive industries.en
dc.format.extent6
dc.language.isoeng
dc.rightsIn Copyright
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subjectDocument Classification
dc.subjectSynthetic Data
dc.subjectLarge Language Models
dc.subjectNatural Language Processing
dc.subjectFinance
dc.subjectMachine Learning
dc.subject.ddc004 Informatik
dc.titleLeveraging Synthetically Generated Data for Real Estate Document Classification
dc.typeKonferenzveröffentlichung
dc.identifier.doihttps://doi.org/10.48565/bonndoc-809
dc.publisher.nameIEEE, Institute of Electrical and Electronics Engineers
dc.publisher.locationNew York, NY
dc.rights.accessRightsopenAccess
dc.relation.doihttps://doi.org/10.1109/BigData66926.2025.11400789
ulbbn.pubtypeZweitveröffentlichung
ulbbnediss.dissNotes.extern© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
ulbbn.relation.conference2025 IEEE International Conference on Big Data (BigData)


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

The following license files are associated with this item:

InCopyright