C/C++ 使用正则表达式

文章目录

1. C/C++ 使用哪种正则表达式？
2. regex.h的使用
3. 不用pcre，自己写一个匹配正则表达式的函数
4. 尾声
1. 4.1. 只用一种东西，不明白它的道理，实在不高明

孟岩先生在《精通正则表达式》这本书的前言当中说道：”正则表达式具有伟大技术发明的一切特点，它简单，优美，功能强大，妙用无穷。对于很多实际工作来讲，正则表达式简直是灵丹妙药，能够成百倍地提高开发效率和程序质量。”

所谓正则表达式，就是一种描述字符串结构模式的形式化表达方法。在发展的初期，这套方法仅限于描述正则文本，故此得名“正则表达式（regular expression）”。

C/C++ 使用哪种正则表达式？

标准的C不支持正则表达式，但是可以使用的Philip Hazel的Perl-Compatible Regular Expression库（简称pcre）来实现正则表达式的功能。
C++11提供了正则表达式的支持，但是在C++11之前，要在C++当中使用正则表达式，一般来说用的是Boost当中的regex。但是Boost太庞大了，使用boost regex后，程序的编译速度明显变慢。测试了一下，同样一个程序，使用boost::regex编译时需要3秒，而使用pcre不到1秒。所以更加推荐pcre或者pcre的C++版本pcre++。

pcre的官方主页：http://www.pcre.org/
pcre++的官方主页：http://www.daemon.de/PCRE

如果不想使用第三方库，那么就可以直接使用linux系统提供的regex.h来自己实现正则表达式的功能，这也是本文的主要内容。

regex.h的使用

编译正则表达式

在将一个字符串与正则表达式进行比较之前，首先要用regcomp()函数对它进行编译，将其转化为regex_t结构。

1	int regcomp (regex_t compiled, const char pattern, int cflags)；

成功执行函数将返回0；

①regex_t 是一个结构体数据类型，用来存放编译后的正则表达式。
②pattern 是我们写好的正则表达式的字符串。
③cflags 有如下4个值或者是它们或运算(|)后的值：

REG_EXTENDED 以功能更加强大的扩展正则表达式的方式进行匹配。
REG_ICASE 匹配字母时忽略大小写。
REG_NOSUB 不用存储匹配后的结果。
REG_NEWLINE 识别换行符，这样’$’就可以从行尾开始匹配，’^’就可以从行的开头开始匹配。
一般来说我只用EG_EXTENDED这一个。

匹配正则表达式

当我们编译好正则表达式后，就可以用regexec 匹配我们的目标文本串了，

1	int regexec (regex_t compiled, char string, size_t nmatch, regmatch_t matchptr [], int eflags)；

函数执行成功返回０。
参数说明：
①compiled 是已经用regcomp函数编译好的正则表达式。
②string 是目标文本串。
③nmatch 是regmatch_t结构体数组的长度。
④matchptr regmatch_t类型的结构体数组，存放匹配子规则的文本串的位置信息。当regexec()函数成功返回时，从string+matchptr[0].rm_so到string+matchptr[0].rm_eo是第一个匹配的字符串，而从string+matchptr[1].rm_so到string+matchptr[1].rm_eo，则是第二个匹配的字符串，依此类推。
⑤eflags 一般设为0

这个程序如果匹配到第一个成功的正则表达式，那么就立即返回，所以如果想要把字符串中所有的都匹配上，那么需要加上一个while大循环。
要注意区分这句话的意思和regmatch_t类型的数组的区别，这个数组是匹配子规则的文本串，所谓的子规则就是放在捕获括号当中的规则。
比如：

matchptr[0]是匹配主规则的字符串
matchptr[1]是第一个匹配子规则的字符串
以此类推

test.cpp

#include <iostream>
#include <stdio.h>
#include <regex.h>
#include <sys/types.h>

using namespace std;

int main(int argc, char** argv) {
  int status, i;
  int cflags = REG_EXTENDED;
  const size_t nmatch = 5;
  regmatch_t pmatch[nmatch];
  regex_t reg;
 
  const char* pattern = "\\w+([-+.]\\w+)*@\\w+([-+.]\\w+)*.com;";
  regcomp(&reg, pattern, cflags);
  char* buf = "hello@gmail.com;world@gmail.com;test@gmail.com;";
  status = regexec(&reg, buf, nmatch, pmatch, 0);  
  if (status == REG_NOMATCH) {
    printf("no match\n");
  } else if (status == 0) {
    cout << "match: " << endl;
    for (unsigned i = 0; i < sizeof(pmatch)/sizeof(regmatch_t); ++i) {
      if ( pmatch[i].rm_so == -1 || pmatch[i].rm_eo == -1 ) {
           continue;
      }
      //打印出匹配的子规则字符串,因为pmatch[0]是匹配主规则的,所以要把它continue掉
      if (i == 0 ) { 
        string str(buf + pmatch[i].rm_so, pmatch[i].rm_eo - pmatch[i].rm_so);
        cout << "主规则的str---------" << str << endl;
        printf("\n");
        continue;
      }
    string str(buf + pmatch[i].rm_so, pmatch[i].rm_eo - pmatch[i].rm_so);
    cout << "str---------" << str << endl;
    printf("\n");
   }
  }
  regfree(&reg);
  return 0;
}

编译之后看看执行结果：

[xiongjun@ubuntu ~/Desktop]% ./test
match: 
主规则的str---------hello@gmail.com;

明明是hello@gmail.com;world@gmail.com;test@gmail.com;
但是只匹配到最开始的:hello@gmail.com

我们换一个表达式，使用正则表达式的捕获括号：

1 2	const char* pattern = "(\\w+([-+.]\\w+))@(\\w+([-+.]\\w+)).com;";

我们来看看输出：

[xiongjun@ubuntu ~/Desktop]% ./test
match: 
主规则的str---------hello@gmail.com;

str---------hello

str---------gmail

这样清楚了吧，regexec匹配的是整个字符串的第一个能匹配上的。如果使用了正则表达式当中使用了捕获括号，那么就将捕获的字符串放入rematch_t类型的数组当中。
记得不要用超过数组长度nmatch的数量的捕获括号，要不然数组可是收集不到字符串（不过我认为10个绰绰有余了。）。

那么我来改动一下我的代码，加上一个大循环来读取所有的匹配并打印出来。

#include <iostream>
#include <stdio.h>
#include <regex.h>
#include <sys/types.h>
#include <string.h>

using namespace std;

int main(int argc, char** argv) {
  int status, i;
  int cflags = REG_EXTENDED;
  const size_t nmatch = 5;
  regmatch_t pmatch[nmatch];
  regex_t reg;
  //加上iOffset来设置buf的偏移量
  unsigned iOffset = 0;

  const char* pattern = "(\\w+([-+.]\\w+)*)@(\\w+([-+.]\\w+)*).com;";
  //const char* pattern = "\\w+([-+.]\\w+)*@\\w+([-+.]\\w+)*.com;";
  regcomp(&reg, pattern, cflags);
  char* buf = "hello@gmail.com;world@gmail.com;test@gmail.com;";
  unsigned iLen = strlen(buf); 

  while (true) { 
    if (iOffset >= iLen) {
      break;
    }
  //这里给buf加上偏移量
  status = regexec(&reg, buf + iOffset, nmatch, pmatch, 0);  
  if (status == REG_NOMATCH) {
    printf("no match\n");
    break;
  } else if (status == 0) {
    cout << "match: " << endl;
    for (unsigned i = 0; i < sizeof(pmatch)/sizeof(regmatch_t); ++i) {
      if ( pmatch[i].rm_so == -1 || pmatch[i].rm_eo == -1 ) {
           continue;
      }
      //打印出匹配的子规则字符串,因为pmatch[0]是匹配主规则的,所以要把它continue掉
      if (i == 0 ) { 
        string str(buf + iOffset + pmatch[i].rm_so, pmatch[i].rm_eo - pmatch[i].rm_so);
        cout << "主规则的str---------" << str << endl;
        printf("\n");
        continue;
      }
    string str(buf + iOffset + pmatch[i].rm_so, pmatch[i].rm_eo - pmatch[i].rm_so);
    cout << "str---------" << str << endl;
    printf("\n");
   }
   //for循环结束后要把iOffset自增, pmatch[0]适配了主规则
   iOffset += pmatch[0].rm_eo;
  }
}
  regfree(&reg);
  return 0;
}

编译之后执行一下可以看到：

[xiongjun@ubuntu ~/Desktop]% ./test
match: 
主规则的str---------hello@gmail.com;

str---------hello

str---------gmail

match: 
主规则的str---------world@gmail.com;

str---------world

str---------gmail

match: 
主规则的str---------test@gmail.com;

str---------test

str---------gmail

将三条邮箱全部打印出来了。

释放正则表达式

1	void regfree (regex_t *compiled)；

当不再需要已经编译过的正则表达式时，应该调用函数regfree()将其释放，以免产生内存泄漏。

不用pcre，自己写一个匹配正则表达式的函数

根据以上的经验，当自己的C/C++代码需要使用正则表达式的时候，就能不借助第三方库，直接自己写一个函数了：
myregex.h:

#ifndef _MYREGEX_H
#define _MYREGEX_H

#include <vector>
#include <string>

using std::vector;
using std::string;

int MatchString(const char *pRegex, const char *pString, vector<string> &mainResult, vector<string> &subResult); 

#endif

myregex.cpp

#include "myregex.h"
#include <string.h>
#include <sys/types.h>
#include <regex.h>
#include <iostream>

using std::cout;
using std::endl;

/*
 * pRegex是正则表达式,pString是需要匹配的字符串,vecResult存放结果
 */
int MatchString(const char *pRegex, const char *pString, vector<string>& mainResult, vector<string> &subResult) {
  //将字符串形式的正则表达式pRegex编译成regex_t形式
  regex_t sReg;
  if ( 0 != regcomp(&sReg, pRegex, REG_EXTENDED) ) {
    return -1;
  }
  int iRet = 0;
  //由于这个regexec只能匹配一次,也就是说如果输入的字符串有匹配的那么就返回成功,并不是匹配所有的字符串才返回成功,所以这里还需要加上while循环来循环调用regexec
  //iOffset这个偏移量就是和while循环搭配使用
  unsigned iOffset = 0;
  unsigned iLen = strlen(pString);
  unsigned nmatch = 10;
  regmatch_t szMatch[nmatch];
  while (true) {
    if (iOffset >= iLen) {
      break;
    }
    iRet = regexec(&sReg, pString + iOffset, sizeof(szMatch)/sizeof(regmatch_t), szMatch, 0);
    if (iRet != 0) {
      if (iRet == REG_NOMATCH) {
        cout << "no match" << endl;
        break;
      } else {
        cout << "other error" << endl;
        break;
      }
    } else { 
      for (unsigned i = 0; i < sizeof(szMatch)/sizeof(regmatch_t); ++i) {
        if ( szMatch[i].rm_so == -1 || szMatch[i].rm_eo == -1 ) {
           continue;
           iOffset += szMatch[0].rm_eo;
           break;
        }
        //主规则的匹配放入单独的数组当中
        if (i == 0) {
          string str(pString + iOffset + szMatch[i].rm_so, szMatch[i].rm_eo - szMatch[i].rm_so);
          mainResult.push_back(str);
          continue;
        } else {
          //string s(cp, n) s是cp字符串前n个字符的拷贝
          string str(pString + iOffset + szMatch[i].rm_so, szMatch[i].rm_eo - szMatch[i].rm_so);
          subResult.push_back(str);
        }
      }
      //for循环之后自增iOffset
      iOffset += szMatch[0].rm_eo;
    }
  }
  //释放正则表达式
  regfree(&sReg);
  return iRet;
}

写一个函数测试一下：
main.cpp

#include <iostream>
#include "myregex.h"

using namespace std;

int main(int argc, char** argv) {
  vector<string> mainResult, subResult;
  const char* pattern = "(\\w+([-+.]\\w+)*)@(\\w+([-+.]\\w+)*).com;";
  char* buf = "hello@gmail.com;world@gmail.com;test@gmail.com;";
  MatchString(pattern, buf, mainResult, subResult);
  for (auto it=mainResult.begin(); it!= mainResult.end(); ++it) {
    cout << *it << endl;
  }
  for (auto it=subResult.begin(); it!= subResult.end(); ++it) {
    cout << *it << endl;
  }
  return 0;
}

编译一下

1	g++ -std=c++11 -o main main.cpp myregex.cpp

执行查看结果

[xiongjun@ubuntu ~/Desktop]% ./main
hello@gmail.com;
world@gmail.com;
test@gmail.com;
hello
gmail
world
gmail
test
gmail

也可以把这个myregex.o编译成动态库libmyregex.so，方便下次使用

1 2	g++ -fPIC -c -o myregex.o myregex.cpp g++ -shared -fPIC -o libmyregex.so myregex.o

尾声

有关重复造轮子的问题：我并不觉得自己写一个正则表达式的函数是重复造轮子，因为这么一个轻量级别的函数，自己实现起来本来就非常方便，根本不需要引用第三方库来增加代码体积。而且如果什么东西都是使用现成的库，那么当库出现问题的时候就只能够靠着官方来解决，再者，自己仔细学习这些细枝末节，也能体会到乐趣所在。借用林语堂的名言，被侯捷老师引用过的：

Adair's Home

书写|为了更好地思考