NLP-Based Complex Word Identification in German

Ohsten, Moritz

DC Field	Value	Language
dc.contributor.advisor	Tropmann-Frick, Marina	-
dc.contributor.author	Ohsten, Moritz	-
dc.date.accessioned	2025-08-08T07:03:27Z	-
dc.date.available	2025-08-08T07:03:27Z	-
dc.date.created	2024-10-04	-
dc.date.issued	2025-08-08	-
dc.identifier.uri	https://hdl.handle.net/20.500.12738/17983	-
dc.description.abstract	Das Ziel der vorliegenden Arbeit ist es, zu untersuchen, wie gut sich transformerbasierte Modelle, insbesondere BERT, bei der Erkennung komplexer Wörter in der deutschen Sprache im Vergleich zu traditionellen Verfahren des maschinellen Lernens bewähren und wie sich diese Modelle optimieren lassen. Die Arbeit stützt sich auf die Ergebnisse und den Datensatz des CWI 2018 Shared Task, der im Rahmen dieser Arbeit zunächst um zusätzliche linguistische Merkmale erweitert wurde. Zur Bestimmung der Relevanz dieser Merkmale sowie als Benchmark diente ein Random Forest Classifier (RFC). Anschließend wurden verschiedene BERT-Modelle gegenübergestellt und Optimierungstechniken wie Hyperparameter-Optimierung und Datenaugmentation auf das beste Modell angewendet. Dieses Modell sowie die optimierten Varianten dieses Modells wurden den Top-Systemen des CWI 2018 Shared Task gegenübergestellt, um die Leistung zu vergleichen. Die Ergebnisse zeigen, dass die BERT-Modelle besser abschnitten als die traditionellen Machine-Learning-Verfahren aus dem CWI 2018 Shared Task. Besonders die Hyperparameter-Optimierung und die Einbeziehung linguistischer Merkmale trugen zu einer signifikanten Leistungssteigerung bei. Dies verdeutlicht das Potenzial transformerbasierter Modelle für die Aufgabe der CWI im Vergleich zu traditionellen Verfahren.	de
dc.description.abstract	The aim of this thesis is to investigate how well transformer-based models, in particular BERT, perform in the recognition of complex words in German compared to traditional machine learning methods and how these models can be optimized. The work is based on the results and the data set of the CWI 2018 Shared Task, which was initially expanded to include additional linguistic features as part of this work. A Random Forest Classifier (RFC) was used to determine the relevance of these features and as a benchmark. Subsequently, different BERT models were compared and optimization techniques such as hyperparameter optimization and data augmentation were applied to the best model. This model and the optimized variants of this model were compared to the top systems of the CWI 2018 Shared Task to compare performance. The results show that the BERT models performed better than the traditional machine learning methods from the CWI 2018 Shared Task. Especially the hyperparameter optimization and the inclusion of linguistic features contributed to a significant increase in performance. This illustrates the potential of transformer-based models for the CWI task compared to traditional methods.	en
dc.language.iso	de	en_US
dc.subject	Natürliche Sprachverarbeitung	en_US
dc.subject	Komplex	en_US
dc.subject	Wort	en_US
dc.subject	Transformer	en_US
dc.subject	BERT	en_US
dc.subject.ddc	004: Informatik	en_US
dc.title	NLP-Based Complex Word Identification in German	de
dc.type	Thesis	en_US
openaire.rights	info:eu-repo/semantics/openAccess	en_US
thesis.grantor.department	Fakultät Technik und Informatik	en_US
thesis.grantor.department	Department Informatik	en_US
thesis.grantor.universityOrInstitution	Hochschule für Angewandte Wissenschaften Hamburg	en_US
tuhh.contributor.referee	Sarstedt, Stefan	-
tuhh.identifier.urn	urn:nbn:de:gbv:18302-reposit-217359	-
tuhh.oai.show	true	en_US
tuhh.publication.institute	Fakultät Technik und Informatik	en_US
tuhh.publication.institute	Department Informatik	en_US
tuhh.type.opus	Bachelor Thesis	-
dc.type.casrai	Supervised Student Publication	-
dc.type.dini	bachelorThesis	-
dc.type.driver	bachelorThesis	-
dc.type.status	info:eu-repo/semantics/publishedVersion	en_US
dc.type.thesis	bachelorThesis	en_US
dcterms.DCMIType	Text	-
tuhh.dnb.status	domain	en_US
item.advisorGND	Tropmann-Frick, Marina	-
item.creatorGND	Ohsten, Moritz	-
item.creatorOrcid	Ohsten, Moritz	-
item.grantfulltext	open	-
item.languageiso639-1	de	-
item.openairecristype	http://purl.org/coar/resource_type/c_46ec	-
item.openairetype	Thesis	-
item.fulltext	With Fulltext	-
item.cerifentitytype	Publications	-
Appears in Collections:	Theses