How to remove duplicate words from given string using Regular expression with Python?

Santhosh Sudhaan
2 min readJun 8, 2021

What we’ll do:

Welcome to this blog, here in this blog we will write a python program to remove duplicate words from given string using Regular expression with Python.

Intro to Regex:

Regular Expression, or regex or regexp in short, is extremely and amazingly powerful in searching and manipulating text strings, particularly in processing text files. One line of regex can easily replace several dozen lines of programming codes.

Regex is supported in all the scripting languages (such as Perl, Python, PHP, and JavaScript); as well as general purpose programming languages such as Java; and even word processors such as Word for searching texts. Getting started with regex may not be easy due to its complicated syntax, but you will get to it, once you keep on practice.

regex = "\\b(\\w+)(?:\\W+\\1\\b)+";

The explanations of the above regular expression can be understood as:

  • “\\b” means a word boundary. Boundaries are needed for special cases. For example, in “My thesis is great”, “is” wont be matched twice.
  • “\\w” means a word character: (i.e.)[a-zA-Z_0–9]
  • “\\W+” means a non-word character: [^\w]
  • “\\1” matches whatever was matched in the 1st group of parentheses, which in our case is the (\w+)
  • “+” is used to match whatever it’s placed after 1 or more times

What’s happening:

First of all we import re which is the regular expression module. Then we start writing a function named “removeDuplicates()” which takes input string as a argument

Inside the function we write the regex checks to check for repeated words and we pass it to a sub function which is again provided to us by “re” module. In that we are passing the case where we don't need the repeated words along with the given input string as one of the parameter.

Then, we return the modified sentence in which duplicate words has been removed

Fine, now we pass input to the function “removeDuplicates()”, as parameter.

Finally we print the sentence which doesn't have duplicated words in it.

Source Code:

import re #using Regular Expression or Regex.def removeDuplicates(input):    #Regex to matching repeated words
regex = r'\b(\w+)(?:\W+\1\b)+'

#Ignoring all repeated words with re
return re.sub(regex, r'\1', input, flags=re.IGNORECASE)
# Test Case: 1
str1 = "How are are you"
print(removeDuplicates(str1))
# Test Case: 2
str2 = "Guvi is the the best platform to learn"
print(removeDuplicates(str2))
# Test Case: 3
str3 = "Programming is fun fun"
print(removeDuplicates(str3))

Output:

How are you
Guvi is the best platform to learn
Programming is fun

What we learnt:

Through this blog, you learnt What is Regular expressions(Regex) and how to remove duplicate words from given string using Regular expression with Python.

Thank you.

Also refer:

--

--