1 Introduction
Let’s now move on to another large but very interesting topic area from the field of Data Science: Natural Language Processing
I already covered the topic of String Manipulation once at the beginning of my blog series on Data Science with Python. That was more about handling text columns with functions like:
In the following, we will delve deeper into the topic of text processing in order to be able to extract valuable insights from text variables using machine learning.
1.1 What is NLP?
Natural Language Processing (NLP), is a branch of artificial intelligence and generally defined as the automatic manipulation of natural language, such as speech and text, by software. Natural Language Processing interfaces with many disciplines, including computer science and computational linguistics, to bridge the gap between human communication and computer understanding.
NLP is not a new science in this sense and has been around for a very long time. In recent years, the need for and interest in human-machine communication has increased dramatically, so that with the availability of big data and increasingly powerful computers, the technology for NLP has also developed rapidly and is now accessible to a wide range of interested parties.
1.2 Future Perspectives for NLP
The global Natural Language Processing (NLP) market size is expected to grow from USD 11.6 billion in 2020 to USD 35.1 billion by 2026, according to statistics from MarketsandMarkets. The increasing adoption of NLP-based applications in various industries is expected to provide tremendous opportunities for NLP providers.
1.3 Application Areas of NLP
Let’s take a look at 11 of the most interesting applications of natural language processing in business:
- Sentiment Analysis
- Text Classification
- Chatbots & Virtual Assistants
- Text Extraction
- Machine Translation
- Text Summarization
- Market Intelligence
- Auto-Correct
- Intent Classification
- Urgency Detection
- Speech Recognition
I will start here with the basics and then from post to post cover more and deeper topics as described above.
2 Text Manipulation
Before we can take the first steps towards NLP, we should know the basics of text manipulation. They are basics but they are essential to take further steps towards model training.
2.1 String Variables
First of all, we assign a sample string to an object.
word = "Hello World!"
print(word)
2.2 Use of Quotation Marks
If you want to use quotation marks within a string, you should choose one of the following two options and stick to the chosen variant for the sake of consistency to avoid problems later on.
quotation_marks_var1 = 'Hi, my name is "Alice"'
print(quotation_marks_var1)
quotation_marks_var2 = "Hi, my name is 'Alice'"
print(quotation_marks_var2)
2.3 Obtaining specific Information from a String
Access the first character of a string:
word[0]
Access specific characters of a string via slicing:
word[6:12]
word[:5]
Obtaining the length of a string:
len(word)
Counting the number of specific letters (here ‘l’) within a string:
word.count('l')
Find the index of a specific letter:
word.find('W')
That’s right, the letter W of the word World is at index position 6 of our string. We do not only have to search for certain letters (there can be several identical letters in a string, in this case the index value of the first letter found would be output) but we can also output the index at which a certain word starts:
word.index('World')
2.4 String Manipulation
For the following examples, let’s take a look at this kind of example string:
word2 = "tHiS Is aN uNstRucTured sEnTencE"
print(word2)
Convert all characters to uppercase:
word2.upper()
Convert all characters to lowercase:
word2.lower()
Capitalize the first letter of each word:
word2.title()
Capitalize only the first letter of a sentence:
word2.capitalize()
You also have the possibility to reverse the upper and lower case of a string. Let’s take this example sentence for this:
word3 = "another FUNNY sentence"
print(word3)
word3.swapcase()
2.5 Arithmetic Operations
Mathematical operations are just as well possible with strings. See the following examples:
Addition of another string part:
print(word)
print()
print(word + ' What a sunny day!')
Have a string played back multiple times:
print(word * 5)
Or so a little prettier:
print((word + ' ')* 5)
With join we can insert a space between the individual letters
print(' '.join(word))
or reverse the order of the sting:
print(''.join(reversed(word)))
2.6 Check String Properties
In the following I will check some properties of the string.
Here again the string:
print(word)
Check if all characters of the sting are alphanumeric:
word.isalnum()
Check if all characters of the sting are alphabetic:
word.isalpha()
Check if string contains digits:
word.isdigit()
Check if string contains title words:
word.istitle()
Check if the complete string is in upper case:
word.isupper()
Check if the complete string is in lower case:
word.islower()
Check if the string consists of spaces:
word.isspace()
Check whether the string ends with a !:
word.endswith('!')
Check whether the string starts with an ‘H’:
word.startswith('H')
2.7 Replace certain Characters in Strings
Very often used in practice in the replacement of string parts:
print(word)
word.replace('World', 'Germany')
word.replace(word[:5], 'Good Morning')
2.8 For Loops with Strings
Finally, two examples of how to use for loops in combination with strings:
for char in word:
print(char)
for char in word:
print(str(word.index(char)) + ': ' + char)
3 Conclusion
In this post I introduced the topic of NLP and showed the basics of text manipulation. In the following I will start with the topic of text pre-processing.