Removing punctuation in JavaScript is a relatively easy task, but removing accents, leaving only the letters is a bit more challenging. Regardless of the situation, I have below some minimalist functions that can be used for both cases.
How to remove accents in JavaScript
To simply remove accents and cedilla from a string and return the same string without the accents, we can use ES6's String.prototype.normalize method, followed by a String.prototype.replace:
const str = 'ÁÉÍÓÚáéíóúâêîôûàèìòùÇç';
const parsed = str.normalize('NFD').replace(/[\u0300-\u036f]/g, '');
console.log(parsed);
Explanation
The normalize method was introduced in the ES6 version of JavaScript in 2015. It serves to convert a string into its standard Unicode format. In this case, we use the parameter NFD which can separate the accents from the letters and return their Unicode codes.
To get a better idea of how this conversion to Unicode works, see below:
// String Á in UTF-18 has 1 digit
'Á'.length; // 1
// String Á in Unicode has 2 digits: \u0041\u0301
'Á'.normalize('NFD').length; // 2
// If we try to represent Unicode, we'll obtain the following result
console.log('\u0041\u0301'); // Á
Then the method replaces all occurrences of diacritical characters, combining them in the Unicode sequence \u0300 - \u036F, another advantage of ES6 that was added to allow Unicode ranges in RegEx.
Removing all special characters in JavaScript
To remove the accents and other special characters like /?!(), just use the same formula above, only replace everything but letters and numbers.
const str = 'ÁÉÍÓÚáéíóúâêîôûàèìòùÇç/.,~!@#$%&_-12345';
const parsed = str.normalize('NFD').replace(/([\u0300-\u036f]|[^0-9a-zA-Z])/g, '');
console.log(parsed);
Explanation
To understand what happens in the code above, I suggest reading the previous paragraph where I talk about Unicode and the normalize method.
The only addition, in this case, was to create 2 groups in the regex through ([ group 1 ]|[ group 2 ])
and add to group 2 the regular expression [^0-9a-zA-Z]
, which means: anything that's not (^) 0-9, a-z or A-Z, is also replaced.
If you don't want to remove spaces, just add \s
:
str.normalize('NFD').replace(/([\u0300-\u036f]|[^0-9a-zA-Z\s])/g, '')
Replacing special characters
Another quite recurrent use case is the need to clear the accents and then replace special characters with some other one, e.g. "Any phrase" -> "Any-phrase".
There is a very good regular expression to replace characters that are not common letters or numbers, but this expression also removes accents.
'Here\'s à sentence'.replace(/[^\w\-]+/g, '-'); // Here-s-sentence
If we want to remove only the accents and then replace other special characters, we need to do sort of what was proposed in the first example:
'Here\'s à sentence'.normalize('NFD').replace(/[\u0300-\u036f]/g, '').replace(/[^\w\-]+/g, '-');
But maybe you also need to replace unnecessary hyphens, as in the case of "This is a sentence!!!" turning into "This-is-a-sentence---".
Here's a complete function that removes accents, replaces special characters with hyphens, also removing additional hyphens:
const replaceSpecialChars = (str) => {
return str.normalize('NFD').replace(/[\u0300-\u036f]/g, '') // Remove accents
.replace(/([^\w]+|\s+)/g, '-') // Replace space and other characters by hyphen
.replace(/\-\-+/g, '-') // Replaces multiple hyphens by one hyphen
.replace(/(^-+|-+$)/g, ''); // Remove extra hyphens from beginning or end of the string
}
console.log(replaceSpecialChars('This is a sentence!!!'));
If you want to use this same function to "slugify" a URL, just add toLowerCase()
at the end and it's done!
I think I covered all the more recurring cases when working with accents and special characters in JavaScript. I know that it's an additional challenge for many foreign languages not to have built-in methods to deal with special characters.