深度阅读

How to remove repeated characters in pandas?

作者
作者
2023年08月22日
更新时间
10.64 分钟
阅读时间
0
阅读量

To remove repeated characters in a Pandas Series of strings, you can use the Series.str.replace() method with a regular expression that uses backreferences. For example:

import pandas as pd

# Create a sample series
s = pd.Series(['aabbcc', 'ddddd', 'effffee', 'gggghhhh'])

# Remove repeated characters
s = s.str.replace(r'(\w)\1+', r'\1')

print(s)

The regular expression pattern (\w)\1+ matches any character that is immediately followed by one or more copies of itself. The parentheses create a capturing group, which can be referred to later using a backreference. The backreference \1 matches the same content as the first capturing group, effectively removing the repeated characters.

After running this code, the Series s will contain the following values:

0    abc
1      d
2    efe
3    ghh
dtype: object

Notice that the repeated characters have been removed from each string in the Series.

Note: This method only removes consecutive duplicates, so it will not remove duplicates that are separated by other characters.

相关标签

博客作者

热爱技术,乐于分享,持续学习。专注于Web开发、系统架构设计和人工智能领域。