16.10.2025. Хамза Салем. Семинар ИНТЕЛЛЕКТУАЛЬНЫЕ СИСТЕМЫ и СИСТЕМНОЕ ПРОГРАММИРОВАНИЕ

Аватар автора
ИСИ СО РАН
Семинар ИНТЕЛЛЕКТУАЛЬНЫЕ СИСТЕМЫ и СИСТЕМНОЕ ПРОГРАММИРОВАНИЕ. Дата: 16.10.2025 Докладчик: Хамза Салем (Университет Иннополис) Тема: Алгоритмический фреймворк для точного извлечения основного содержимого с новостных веб-сайтов An Algorithmic Framework for Precise Main Content Extraction from News Websites Аннотация This thesis presents the design, implementation, and evaluation of a novel, open-source algorithm for Main Content Extraction (MCE) from web pages. The proposed algorithm operates on the Document Object Model (DOM) tree of an HTML document and employs a multi-criteria heuristic approach to identify the primary content node. It combines three key metrics: the node with the highest number of direct text-containing children, the node with the most text content that lacks text-bearing children, and the node closest to the middle depth of the DOM tree. This methodology is intentionally language-agnostic, relying on structural features rather than linguistic cues, making it particularly effective for multilingual content and languages with complex tokenization. The algorithm&performance was rigorously evaluated against two established content extraction tools, Readability and Boilerpipe, using metrics including precision, recall, F1-score, and accuracy. Results demonstrate that the proposed MCE algorithm significantly outperforms these existing solutions, achieving near-perfect scores (e.g., Precision: 99.96%, Recall: 99.69%, F1-Score: 99.80%, Accuracy: 99.65% on the...

0/0


0/0

0/0

0/0