EAFormer: Scene Text Segmentation with Edge-Aware Transformers

Fudan University
*Corresponding Author

Abstract

Scene text segmentation aims at cropping texts from scene images, which is usually used to help generative models edit or remove texts. The existing text segmentation methods tend to involve various text-related supervisions for better performance. However, most of them ignore the importance of text edges, which are significant for downstream applications.

In this paper, we propose Edge-Aware Transformers, termed EAFormer, to segment texts more accurately, especially at the edge of texts. Specifically, we first design a text edge extractor to detect edges and filter out edges of non-text areas. Then, we propose an edge-guided encoder to make the model focus more on text edges. Finally, an MLP-based decoder is employed to predict text masks. We have conducted extensive experiments on commonly-used benchmarks to verify the effectiveness of EAFormer. The experimental results demonstrate that the proposed method can perform better than previous methods, especially on the segmentation of text edges.

Considering that the annotations of several benchmarks (e.g., COCO_TS and MLT_S) are not accurate enough to fairly evaluate our methods, we have relabeled these datasets. Through experiments, we observe that our method can achieve a higher performance improvement when more accurate annotations are used for training.

Motivation

Motivation

Existing scene text segmentation methods ignore the significance of text edges in practical applications. For instance, accurate text masks, especially in text-edge areas, can provide more background information to inpaint text areas in the text erasing task.

In experiments, we observe that traditional edge detection algorithms, such as Canny, can well distinguish text edges. To fully exploit the merits of traditional edge detection methods to improve the segmentation performance at text edges, in this paper, we propose Edge-Aware Transformers (EAFormer) for scene text segmentation.

Architecture

Architecture

The proposed EAFormer consists of three modules: text edge extractor, edge-guided encoder, and text segmentation decoder. Given the input scene text image, the text edge extractor is used to obtain the edges of text areas. Then, the text image and detected text edges are input into the edge-guided encoder to extract edge-aware features. Finally, the text segmentation decoder takes the features generated by the encoder as input to produce the corresponding text mask.

Qualitative and Quantitative Analysis

Qualitative and quantitative analysis

EAFormer can perform better than previous methods at text edges, which benefits from the introduced edge information. In addition, for COCO_TS and MLT_S, we compare the segmentation results based on both the original and modified annotations.

Dataset Reannotation

Dataset Reannotation

The original annotations of COCO_TS and MLT_S are too coarse to train a text segmentation model with satisfactory performance. Even if the proposed method achieves better performance on these datasets, it is not sufficient to demonstrate the effectiveness of our method. To make the experimental results more convincing, we have re-annotated all samples of these two datasets and used the newly annotated datasets to conduct experiments.

BibTeX

@article{park2021nerfies,
  author    = {Haiyang Yu, Teng Fu, Bin Li and Xiangyang Xue},
  title     = {EAFormer: Scene Text Segmentation with Edge-Aware Transformers},
  journal   = {ECCV},
  year      = {2024},
}