Now, Soundexes are algorithms that can be used to calculate the hash code of a given word. Phonetic hashing is a technique used to canocalize words that have different variants but same phonetic characteristics, that is, the same pronunciation. So, the words ‘color’ and ‘colour’ have the same code. The phonetic hashing method combines the same phonemes (smallest unit of sound) into one bucket and gives them the same hash code for all the variations. To attain this, we will get to know about a technique is called phonetic hashing. Therefore, we must reduce all forms of a particular word to one single and common word. For example a common Indian surname like ‘Srivastava’ is also spelled as ‘Shrivastava’, ‘Shrivastav’, or ‘Srivastav’. As a result, they are ultimately written differently. There are some words that have different pronunciation in different languages. Both are correct spellings, but they will give two different base forms, after stemming, when used in the form ‘humouring’ and ‘humoring’. For example, the word ‘humour’ used in British English, is spelled as ‘humor’ in American English. The same problem is with the pronunciation of same words in different patois. For example, if the word ‘allowing’ is misspelled as ‘alowing’, then we will have a redundant token as ‘alow’ after stemming the misspelled word, and lemmatization wouldn’t even work as it only works on correct dictionary spelling. Thus, we will need another technique to canonicalize the words correctly. There are scenarios where we can not canonicalize a word just by using stemming and lemmatization. Basically, canonicalization means reducing a word to its base form. The two methods that we talked about in the last part - stemming and lemmatization - are both parts of a technique called canonicalization. Also, the problem of spelling variations of a word that occurs due to different pronunciations (e.g. For example, spelling mistakes that happen by mistake as well as by choice (informal words such as ‘lol’, ‘u’, ‘gud’ etc.). In the previous part of the series ‘The NLP Project’, we learned all the basic lexical processing techniques such as removing stop words, tokenization, stemming, and lemmatization.Įven after going through all those preprocessing steps, a lot of noise is still present in the textual data.
0 Comments
Leave a Reply. |