strip non-ASCII characters from a string using RegEx (Regular expressions)

In case you have a string data which has some non ASCII characters and want to strip off all those non-ASCII characters the following regular expression will help you.

[^u0000-u007F]+

Explanation

[^u0000-u007F]+ match a single character not present in the list below
Quantifier: + Between one and unlimited times, as many times as possible
u0000-u007F a single character in the range between the following two characters
- u0000 the literal character u0000 (case sensitive)
- u007F the literal character u007F (case sensitive)

^ is the not operator. It tells the regex to find everything that doesn’t match, instead of everything that does match.

The u####-u#### says which characters match.u0000-u007F is the equivilent of the first 255 characters in utf-8 or unicode, which are always the ASCII characters. So you match every non ASCII character (because of the not)

I had a string like the one below where there are many non standard chars

name 1= Chanel 51������������������������������������������������������������

Applying the replace all method in java as below

 String s=   "name 1= Chanel 51������������������������������������������������������������"
s = s.replaceAll("[^u0000-u007F]+","");
System.out.println(s);

would output the following to console

name 1= Chanel 51

Test it herehttps://regex101.com

ASCII Table for reference.

Ascii Table

Extended ASCII characters

EBCDIC and IBM Scan Codes

dmSherazi

Embedded Electronics HW/SW Engineer with expertise in Hardware & Software including PCB design and Analysis, firmware development and Android development apart from causual web-development (wordpress & Laravel) and to some extent photography.

strip non-ASCII characters from a string using RegEx (Regular expressions)

Leave a ReplyCancel reply

Share this:

Leave a ReplyCancel reply