Arabic Case Insensitive In Database Systems: How To Solve Alef With and Without Hamza Problem

  • Create a custom collation in your database system
  • Add a normalized field to your table
  • Use regular expressions in queries
+----+--------------+
| id | name |
+----+--------------+
| 1 | احمد |
| 2 | أحمد |
| 3 | أسامه |
| 4 | أسامة |
| 5 | اسامه |
| 6 | اسَامه |
+----+--------------+
6 rows in set

1. Create a custom collation

This is the recommended solution for most cases. You won’t change any data in your database. All you are going to do is simply telling the DBS to treat these characters as one. The steps we will use here are based on MySQL documentation Adding a UCA Collation to a Unicode Character Set.

SHOW VARIABLES LIKE 'character_sets_dir';
+--------------------+----------------------------+
| Variable_name | Value |
+--------------------+----------------------------+
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------+----------------------------+
1 row in set (0.00 sec)
<collation name="utf8_arabic_ci" id="1029">
<rules>
<reset>\u0627</reset> <!-- Alef 'ا' -->
<i>\u0623</i> <!-- Alef With Hamza Above 'أ' -->
<i>\u0625</i> <!-- Alef With Hamza Below 'إ' -->
<i>\u0622</i> <!-- Alef With Madda Above 'آ' -->
</rules>
<rules>
<reset>\u0629</reset> <!-- Teh Marbuta 'ة' -->
<i>\u0647</i> <!-- Heh 'ه' -->
</rules>
<rules>
<reset>\u0000</reset> <!-- Unicode value of NULL -->
<i>\u064E</i> <!-- Fatha 'َ' -->
<i>\u064F</i> <!-- Damma 'ُ' -->
<i>\u0650</i> <!-- Kasra 'ِ' -->
<i>\u0651</i> <!-- Shadda 'ّ' -->
<i>\u064F</i> <!-- Sukun 'ْ' -->
<i>\u064B</i> <!-- Fathatan 'ً' -->
<i>\u064C</i> <!-- Dammatan 'ٌ' -->
<i>\u064D</i> <!-- Kasratan 'ٍ' -->
</rules>
</collation>
  • 'أ','إ','آ' is the same character as 'ا' (All the Alef forms are one character)
  • 'ه' is the same character as 'ة' (so "نسمة" is the same as "نسمه")
  • Tashkil characters are ignored completely as if they aren’t there
sudo service mysql restart
ALTER TABLE persons
MODIFY name VARCHAR(50)
CHARACTER SET 'utf8'
COLLATE 'utf8_arabic_ci';
SELECT * FROM persons WHERE name = "اسامة";+----+--------------+
| id | name |
+----+--------------+
| 3 | أسامه |
| 4 | أسامة |
| 5 | اسامه |
| 6 | اسَامه |
+----+--------------+
4 rows in set (0.00 sec)

2. Add a normalized field

This solution doesn’t require editing configuration files, and it is independent of the database system. It should work even if you changed the DBS for any reason. However, it will require adding an additional column to our table and some data processing. The idea is simple, add a new column and fill it with the Arabic text in a “normalized form”, then use the normalized column in your queries. Let’s see how we can do this.

function normalize_name($name) {
$patterns = array( "/إ|أ|آ/" ,"/ة/", "/َ|ً|ُ|ِ|ٍ|ٌ|ّ/" );
$replacements = array( "ا" , "ه" , "" );
return preg_replace($patterns, $replacements, $name);
}
normalize_name("أحمد");  // return: احمد
normalize_name("آمنة"); // return: امنه
normalize_name("أسامه"); // return: اسامه
normalize_name("مٌحَمَّد"); // return: محمد
1  احمد   احمد
2 أحمد احمد
3 أسامه اسامه
4 أسامة اسامه
5 اسامه اسامه
6 اسَامه اسامه
SELECT id, name FROM persons WHERE normalized_name = "اسامه";+----+--------------+
| id | name |
+----+--------------+
| 3 | أسامه |
| 4 | أسامة |
| 5 | اسامه |
| 6 | اسَامه |
+----+--------------+
4 rows in set (0.00 sec)

3. Use Regular Expressions in queries

In this solution, we won’t change any configuration files, nor add extra data to our database. However, Regexp search is slower than regular search, and you will lose the advantage of using indices which could affect the performance badly. I don’t know a way in regular expressions to ignore Tashkil in the database field. I don’t recommend using Regular Expressions for search functions except in tiny databases, but you might find it useful for special kind of queries.

"[ا|أ|إ|آ]سام[ه|ة]"
SELECT id, name FROM persons WHERE name REGEXP "[ا|أ|إ|آ]سام[ه|ة]";+----+------------+
| id | name |
+----+------------+
| 3 | أسامه |
| 4 | أسامة |
| 5 | اسامه |
+----+------------+
3 rows in set (0.01 sec)
function generate_pattern($search_string) {
$patterns = array( "/(ا|إ|أ|آ)/", "/(ه|ة)/" );
$replacements = array( "[ا|إ|أ|آ]", "[ه|ة]" );
return preg_replace($patterns, $replacements, $search_string);
}
generate_pattern("أسامة"); // return '[ا|إ|أ|آ]س[ا|إ|أ|آ]م[ة|ه]'
generate_pattern("أسامه"); // return '[ا|إ|أ|آ]س[ا|إ|أ|آ]م[ة|ه]'
generate_pattern("احمد"); // return '[ا|إ|أ|آ]حمد'

Conclusion

That will be it for all the methods I know to accomplish Arabic insensitive search. Arabic is a great and a very rich language, this requires special handling in layouts, programming and databases, creating challenges we have to deal with as Arab programmers. Sharing our experiences is the key to create a strong and helpful community. And to add our contribution to the worldwide community of developers. Hope you found this article useful.

 by the author.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Essam

Ahmed Essam

Web developer, Arabic Native, open-source enthusiastic, trying to do things the right way.