strip non-ASCII characters from a string using RegEx (Regular expressions)

In case you have a string data which has some non ASCII characters and want to strip off all those non-ASCII characters the following regular expression will help you.

[^u0000-u007F]+

Explanation

  • [^u0000-u007F]+ match a single character not present in the list below
    Quantifier: + Between one and unlimited times, as many times as possible
  • u0000-u007F a single character in the range between the following two characters
    • u0000 the literal character u0000 (case sensitive)
    • u007F the literal character u007F (case sensitive)

^ is the not operator. It tells the regex to find everything that doesn’t match, instead of everything that does match.

The u####-u#### says which characters match.u0000-u007F is the equivilent of the first 255 characters in utf-8 or unicode, which are always the ASCII characters. So you match every non ASCII character (because of the not)

I had a string like the one below where there are many non standard chars

name 1= Chanel 51������������������������������������������������������������

Applying the replace all method in java as below

 String s=   "name 1= Chanel 51������������������������������������������������������������"
s = s.replaceAll("[^u0000-u007F]+","");
System.out.println(s);

would output the following to console

name 1= Chanel 51

Test it herehttps://regex101.com

ASCII Table for reference.

Ascii Table

Extended ASCII characters

 

EBCDIC and IBM Scan Codes

 
Dost Muhammad Shah

Dost Muhammad Shah

Dost Muhammad specializing in Embedded Design, Firmware development, PCB designing , testing and prototyping. He enjoys sharing his experience with others .Get in touch with Dost on Twitter or via Contact form

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.