Illegal Character Compilation Error – 非法字符编译错误

最后修改: 2022年 4月 27日

1. Overview


The illegal character compilation error is a file type encoding error. It’s produced if we use an incorrect encoding in our files when they are created. As result, in languages like Java, we can get this type of error when we try to compile our project. In this tutorial, we’ll describe the problem in detail along with some scenarios where we may encounter it, and then, we’ll present some examples of how to resolve it.


2. Illegal Character Compilation Error


2.1. Byte Order Mark (BOM)

2.1 字节顺序标记(BOM)

Before we go into the byte order mark, we need to take a quick look at the UCS (Unicode) Transformation Format (UTF). UTF is a character encoding format that can encode all of the possible character code points in Unicode. There are several kinds of UTF encodings. Among all these, UTF-8 has been the most used.


UTF-8 uses an 8-bit variable-width encoding to maximize compatibility with ASCII. When we use this encoding in our files, we may find some bytes that represent the Unicode code point. As a result, our files start with a U+FEFF byte order mark (BOM). This mark, correctly used, is invisible. However, in some cases, it could lead to data errors.


In the UTF-8 encoding, the presence of the BOM is not fundamental. Although it’s not essential, the BOM may still appear in UTF-8 encoded text. The BOM addition could happen either by an encoding conversion or by a text editor that flags the content as UTF-8.


Text editors like Notepad on Windows could produce this kind of addition. As a consequence, when we use a Notepad-like text editor to create a code example and try to run it, we could get a compilation error. In contrast, modern IDEs encode created files as UTF-8 without the BOM. The next sections will show some examples of this problem.


2.2. Class with Illegal Character Compilation Error


Typically, we work with advanced IDEs, but sometimes, we use a text editor instead. Unfortunately, as we’ve learned, some text editors could create more problems than solutions because saving a file with a BOM could lead to a compilation error in Java. The “illegal character” error occurs in the compilation phase, so it’s quite easy to detect. The next example shows us how it works.

通常情况下,我们使用先进的集成开发环境工作,但有时,我们会使用文本编辑器来代替。不幸的是,正如我们所了解的,一些文本编辑器可能会造成更多的问题,而不是解决方案,因为保存一个带有BOM的文件可能会导致Java的编译错误。“非法字符 “错误发生在编译阶段,所以很容易发现。下一个例子向我们展示了它是如何工作的。

First, let’s write a simple class in our text editor, such as Notepad. This class is just a representation – we could write any code to test. Next, we save our file with the BOM to test:


public class TestBOM {
    public static void main(String ...args){
        System.out.println("BOM Test");

Now, when we try to compile this file using the javac command:


$ javac ./

Consequently, we get the error message:


public class TestBOM {
.\ error: illegal character: '\u00bf'
public class TestBOM {
2 errors

Ideally, to fix this problem, the only thing to do is save the file as UTF-8 without BOM encoding. After that, the problem is solved. We should always check that our files are saved without a BOM.


Another way to fix this issue is with a tool like dos2unix. This tool will remove the BOM and also take care of other idiosyncrasies of Windows text files.


3. Reading Files


Additionally, let’s analyze some examples of reading files encoded with BOM.


Initially, we need to create a file with BOM to use for our test. This file contains our sample text, “Hello world with BOM.” – which will be our expected string. Next, let’s start testing.

最初,我们需要创建一个带有BOM的文件来用于我们的测试。这个文件包含我们的样本文本,”Hello world with BOM.”- 这将是我们的预期字符串。接下来,让我们开始测试。

3.1. Reading Files Using BufferedReader


First, we’ll test the file using the BufferedReader class:


public void whenInputFileHasBOM_thenUseInputStream() throws IOException {
    String line;
    String actual = "";
    try (BufferedReader br = new BufferedReader(new InputStreamReader(file))) {
        while ((line = br.readLine()) != null) {
            actual += line;
    assertEquals(expected, actual);

In this case, when we try to assert that the strings are equal, we get an error:


org.opentest4j.AssertionFailedError: expected: <Hello world with BOM.> but was: <Hello world with BOM.>
Expected :Hello world with BOM.
Actual   :Hello world with BOM.

Actually, if we skim the test response, both strings look apparently equal. Even so, the actual value of the string contains the BOM. As result, the strings aren’t equal.


Moreover, a quick fix would be to replace BOM characters:


public void whenInputFileHasBOM_thenUseInputStreamWithReplace() throws IOException {
    String line;
    String actual = "";
    try (BufferedReader br = new BufferedReader(new InputStreamReader(file))) {
        while ((line = br.readLine()) != null) {
            actual += line.replace("\uFEFF", "");
    assertEquals(expected, actual);

The replace method clears the BOM from our string, so our test passes. We need to work carefully with the replace method. A huge number of files to process can lead to performance issues.

replace 方法清除了我们字符串中的BOM,所以我们的测试通过了。我们需要谨慎地使用replace方法。要处理的文件数量巨大,会导致性能问题。

3.2. Reading Files Using Apache Commons IO

3.2.使用Apache Commons IO读取文件

In addition, the Apache Commons IO library provides the BOMInputStream class. This class is a wrapper that includes an encoded ByteOrderMark as its first bytes. Let’s see how it works:

此外,Apache Commons IO库提供了BOMInputStreamclass。这个类是一个包装器,它包括一个编码的ByteOrderMark作为其第一个字节。让我们看看它是如何工作的。

public void whenInputFileHasBOM_thenUseBOMInputStream() throws IOException {
    String line;
    String actual = "";
    ByteOrderMark[] byteOrderMarks = new ByteOrderMark[] { 
      ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE
    InputStream inputStream = new BOMInputStream(ioStream, false, byteOrderMarks);
    Reader reader = new InputStreamReader(inputStream);
    BufferedReader br = new BufferedReader(reader);
    while ((line = br.readLine()) != null) {
        actual += line;
    assertEquals(expected, actual);

The code is similar to previous examples, but we pass the BOMInputStream as a parameter into the InputStreamReader.


3.3. Reading Files Using Google Data (GData)


On the other hand, another helpful library to handle the BOM is Google Data (GData). This is an older library, but it helps manage the BOM inside the files. It uses XML as its underlying format. Let’s see it in action:


public void whenInputFileHasBOM_thenUseGoogleGdata() throws IOException {
    char[] actual = new char[21];
    try (Reader r = new UnicodeReader(ioStream, null)) {;
    assertEquals(expected, String.valueOf(actual));

Finally, as we observed in the previous examples, removing the BOM from the files is important. If we don’t handle it properly in our files, unexpected results will happen when the data is read. That’s why we need to be aware of the existence of this mark in our files.


4. Conclusion


In this article, we covered several topics regarding the illegal character compilation error in Java. First, we learned what UTF is and how the BOM is integrated into it. Second, we showed a sample class created using a text editor – Windows Notepad, in this case. The generated class threw the compilation error for the illegal character. Finally, we presented some code examples on how to read files with a BOM.

在这篇文章中,我们介绍了关于Java中非法字符编译错误的几个主题。首先,我们了解了什么是UTF以及BOM是如何被整合到其中的。其次,我们展示了一个用文本编辑器–本例中是Windows Notepad–创建的样本类。生成的类因为非法字符而出现了编译错误。最后,我们展示了一些关于如何用BOM读取文件的代码例子。

As usual, all the code used for this example can be found over on GitHub.