inexact match

B.The problem

The problem is like following and pay attention to the last line of explanation. They want to solve this trivial

problem and it is one of the simplest DFS.

In China, many people use their mobile phone as their address book. They may store contacts with UNICODE or ASCII

code. So, the searching problem is a bit complicated.

We assume following conditions:

1. The index for ASCII code is using its own ASCII string.

2. The UNICODE string is using their "pinyin" string and especially in embedded system the storage resource is

very tight and the developer may not want to store index as "pinyin" along with UNICODE string. Since there are

only limited "pinyin" combinations, they are willing to consider that conversion from UNICODE to "pinyin" is fast.

3. The order between a ASCII code string and a UNICODE string with same letter is that ASCII goes before UNICODE.

i.e. The "pinyin" for my name of UNICODE is "huang qing zhe". If I also input an ASCII code string "huang qing zhe",

I will expect to find ASCII code before UNICODE. And the UNICODE is before ASCII string of "goo" since ASCII has

a letter "g" which is after "pinyin" of "h".

4. An inexact match is performed by accepting user's input string and should return all matched string.

i.e. UNICODE string with "pinyin" of "huang qing zhe" will be considered to be match with "hqz", "huqizh",

"huangqz", or "hqzhe". And the following is NOT matched: "haqz", "hugqingzhe" etc.

5. However, this kind of grammar has one ambiguity: i.e. Suppose we have an ASCII string "shang hai shi". And the

searching string is "shs". There can be two explanation: one is that it should be matched since "shs" are all

starting letter for all words. The other is that it is not matched since "sh" is matched with first word "shang".

And letter "s" will not be matched in second word "hai" any more.

This kind of ambiguity is usually solved by pre-defined priority of searching pattern.

C.The idea of program

Originally Mr. Li suggested me to draw the possible DFA. But I feel it is a kind of more difficult than coding.

The dictionary is input by a file named "nick.txt" and it should be the same directory along with executable file.

The input file format is like following:

[number of words in name] [flag for UNICODE, 1==UNICODE] [string1, string2 ...]

In this second edition, I want to express the algorithms in some pseudo-code for explanation purpose. Therefore you will find some useless pseudo-code at top of code.

#include <iostream> #include <fstream> #include <string.h> using namespace std; const int LongestWordLength=10; const int MaxWordNumber=10; const int MaxNameNumber=100; char* fileName="nick.txt"; char* unico2pinyin(char* str); //let's define our language as following: enum LetterType { AlphaType, DelimiterType, EndOfStringType, ErrorType }; LetterType letter2Type(char ch) { if (ch>='a'&&ch<='z') { return AlphaType; } if (ch==' ') { return DelimiterType; } if (ch=='\0') { return EndOfStringType; } return ErrorType; } //let's define the states for our state machine //and it is a kind high-level abstract machine enum StateType { InitiateState, ProcessState, AcceptState, RejectState }; //isEOF stands for the sourcestring EOF StateType stateMachine(StateType currentState, bool isMatch, bool isEOF) { if (isEOF) { return AcceptState; } if (!isMatch) { return RejectState; } //so from here, we can assume all state below is "isMatch==true" //and both AcceptState and RejectState is our final States, so //the automata will halt at both states. switch (currentState) { case InitiateState: return ProcessState; break; case ProcessState: return ProcessState; break; } } //then our job is to define what is a match //to me it is almost the simplest Depth-First-Search problem //it is the "real" state machine we are using. Since it is a //special case of DFS---with exactly two choices in each state. /* struct State { int wordIndex1; //this is for the word int letterIndex1; //this is within word, the index of letter bool match1; int wordIndex2; int letterIndex2; bool match2; int stringIndex; //this is the source letter index }; */ struct State { int wordIndex; int letterIndex; bool match; }; struct MyWord { bool isUNICODE; int letterCount; char myword[LongestWordLength+1]; bool isSmaller(MyWord& other); void display(); char* toString(); }; struct MyName { int wordCount; MyWord mywords[MaxWordNumber]; void display(); bool match(char* str); bool doMatch(State& state, char* str, int index); }; class MyBook { private: MyName mynames[MaxNameNumber]; int nameCount; public: void initialize(); MyBook(); //void addNewName(MyName& newName); void addNewName(bool isUnico, int wordCount, char newWords[][LongestWordLength]); int findFirst(bool isUnico, char* str); void display(); void findMatch(char* str); }; int main() { char buf[LongestWordLength]; MyBook book; book.initialize(); book.display(); printf("now let's find the match\n"); book.findMatch("shs"); do { printf("input your searching string>"); scanf("%s", buf); if (strcmp(buf, "exit")==0) { break; } book.findMatch(buf); } while (true); return 0; } void MyBook::findMatch(char* str) { char* ptr; for (int i=0; i<nameCount; i++) { ptr=mynames[i].mywords[0].toString(); if (mynames[i].match(str)) { printf("find a match:"); mynames[i].display(); } } } bool MyName::doMatch(State& state, char* str, int index) { State subState; bool result1, result2; if (!state.match) { return false; } if (str[index]=='\0') { return true; } //first case subState.wordIndex=state.wordIndex; subState.letterIndex=state.letterIndex+1; if (subState.letterIndex==mywords[subState.wordIndex].letterCount) { //subState.match=false; //this choice is dead and we only need to test next jump method result1=false; } else { subState.match=mywords[subState.wordIndex].myword[subState.letterIndex]==str[index]; if (subState.match) { result1=doMatch(subState, str, index+1); } else { result1=false; } } //now the jumping if (state.wordIndex==wordCount) { result2=false; } else { subState.wordIndex=state.wordIndex+1; subState.letterIndex=0; subState.match=mywords[subState.wordIndex].myword[subState.letterIndex]==str[index]; if (subState.match) { result2=doMatch(subState, str, index+1); } else { result2=false; } } return result1||result2; } //Let's simulate it with a state machine bool MyName::match(char* str) { State state; int index=0; //this is the very first letter, and it must be a match!!! //here we can define it as "state 0" state.letterIndex=0; state.wordIndex=0; if (mywords[0].myword[0]!=str[0]&&str[0]!='\0') { return false; } else { state.match=true; } index=1; return doMatch(state, str, index); } MyBook::MyBook() { nameCount=0; } void MyBook::initialize() { ifstream in; int count, unicoFlag; char names[MaxWordNumber][LongestWordLength]; in.open(fileName); do { in>>count>>unicoFlag; for (int i=0; i<count; i++) { in>>names[i]; } addNewName(unicoFlag==1, count, names); } while (!in.eof()); } void MyName::display() { if (mywords[0].isUNICODE) { printf("unicode and corresponding pinyin is:"); } for (int i=0; i<wordCount; i++) { mywords[i].display(); printf(","); } printf("\n"); } void MyWord::display() { if (isUNICODE) { printf("%s", unico2pinyin(myword)); } else { printf("%s", myword); } } void MyBook::display() { for (int i=0; i<nameCount; i++) { mynames[i].display(); } } void MyBook::addNewName(bool isUnico, int wordCount, char newWords[][LongestWordLength]) { char* ptr; int index; if (isUnico) { ptr=unico2pinyin(newWords[0]); } else { ptr=newWords[0]; } index=findFirst(isUnico, ptr); //shift array to insert for (int i=nameCount; i>index; i--) { mynames[i]=mynames[i-1]; } mynames[i].wordCount=wordCount; for(i=0; i<wordCount; i++) { mynames[index].mywords[i].isUNICODE=isUnico; strcpy(mynames[index].mywords[i].myword, newWords[i]); mynames[index].mywords[i].letterCount=strlen(mynames[index].mywords[i].toString()); } nameCount++; } //this is the simplest conversion function, I just want to simplify my //operation. char* unico2pinyin(char* str) { return str; } char* MyWord::toString() { if (isUNICODE) { return unico2pinyin(myword); } else { return myword; } } bool MyWord::isSmaller(MyWord& other) { char* myself, *theOther; myself=toString(); theOther=other.toString(); //same type, simple comparison if (isUNICODE&&other.isUNICODE||!isUNICODE&&!other.isUNICODE) { return strcmp(myself, theOther)<=0; } else { //here is different type //different type and different "class" if (myself[0]!=theOther[0]) { return strcmp(myself, theOther)<0; } else { //same "class", the unicode is bigger in index return !isUNICODE; } } } int MyBook::findFirst(bool isUnico, char* str) { //you can use binary search too //char* ptr; //bool afterUnico=false; MyWord temp; temp.isUNICODE=isUnico; strcpy(temp.myword, str); for (int i=0; i<nameCount; i++) { if (temp.isSmaller(mynames[i].mywords[0])) { return i; } /* //just skip until first non-unico or end of address book if (afterUnico) { if (!mynames[i].mywords[0].isUNICODE) { return i; } else { ptr=unico2pinyin(mynames[i].mywords[0]); if (strcmp(str, ptr)>0) { return i; } } } else { //in case we have only unico in address book if (mynames[i].mywords[0].isUNICODE) { ptr=unico2pinyin(mynames[i].mywords[0].myword); if (ptr[0]==ch) { afterUnico=true; continue; } } else { if (mynames[i].mywords[0].myword[0]==ch) { return i; } } } */ } //this is the case of end of address book return i; }

running result: (This is the running results of 10 seconds. The position of each quantum is

3 1 huang qing zhe
2 0 xiang dong
3 1 zheng shu zhao
2 1 yong jun
4 0 xiao xiao yu er
5 1 this is a unicode test
1 0 ascii
2 1 many test
3 1 yang zi wen
3 0 hong yang yang
3 0 tian an men
2 0 zhang qing
3 1 li yi hong
3 1 xiao yong jun
3 0 xiao jian dong
3 1 li qing qing
2 0 xu jin
2 1 huang yong
3 1 huang chang zheng
3 1 zhu chun ming
3 0 zhu chun min
3 0 zhu jian min
3 1 ni xiang li
3 0 wu hui min
3 1 li teng long
3 0 chen zhi xiong
3 1 hong xiang yang
3 0 su cong wei
3 1 su dong po
2 1 yang yang
2 1 yang tao
3 1 yang jian zhong
3 1 huang hong dao
3 1 wu qi wei
3 1 jiang jie shi
3 0 mao ze dong
2 1 zhu de
3 1 ren bi shi
3 0 du yu min
2 0 xiao sun
2 0 zhang yan
3 0 li ke xin
3 0 wo shi shui
3 0 huang shi ren
3 0 xu jia hui
3 0 bei jin shi
3 0 shang hai shi
3 1 yang zhen ning
3 1 li zheng dao
3 1 wu de w
4 0 a pei a wang
2 0 cheng long
3 1 ma jun ren
2 0 liang zhan
3 1 li deng hui
3 0 su jin qiang
3 1 huang xi jin
3 0 huang tian ci
2 0 tian ci
3 0 zhang yun lei
3 0 ma jia jun

The following is the running result:

a,pei,a,wang,
ascii,
bei,jin,shi,
chen,zhi,xiong,
cheng,long,
du,yu,min,
hong,yang,yang,
huang,tian,ci,
huang,shi,ren,
unicode and corresponding pinyin is:hong,xiang,yang,
unicode and corresponding pinyin is:huang,xi,jin,
unicode and corresponding pinyin is:huang,hong,dao,
unicode and corresponding pinyin is:huang,chang,zheng,
unicode and corresponding pinyin is:huang,yong,
unicode and corresponding pinyin is:huang,qing,zhe,
unicode and corresponding pinyin is:jiang,jie,shi,
li,ke,xin,
liang,zhan,
unicode and corresponding pinyin is:li,deng,hui,
unicode and corresponding pinyin is:li,zheng,dao,
unicode and corresponding pinyin is:li,teng,long,
unicode and corresponding pinyin is:li,qing,qing,
unicode and corresponding pinyin is:li,yi,hong,
ma,jia,jun,
mao,ze,dong,
unicode and corresponding pinyin is:ma,jun,ren,
unicode and corresponding pinyin is:many,test,
unicode and corresponding pinyin is:ni,xiang,li,
unicode and corresponding pinyin is:ren,bi,shi,
shang,hai,shi,
su,jin,qiang,
su,cong,wei,
unicode and corresponding pinyin is:su,dong,po,
tian,ci,
unicode and corresponding pinyin is:this,is,a,unicode,test,
wo,shi,shui,
wu,hui,min,
unicode and corresponding pinyin is:wu,de,w,
unicode and corresponding pinyin is:wu,qi,wei,
xiang,dong,
xiao,sun,
xiao,jian,dong,
xiao,xiao,yu,er,
xu,jia,hui,
xu,jin,
unicode and corresponding pinyin is:xiao,yong,jun,
unicode and corresponding pinyin is:yang,zhen,ning,
unicode and corresponding pinyin is:yang,jian,zhong,
unicode and corresponding pinyin is:yang,tao,
unicode and corresponding pinyin is:yang,yang,
unicode and corresponding pinyin is:yang,zi,wen,
unicode and corresponding pinyin is:yong,jun,
zhang,yun,lei,
zhang,yan,
zhang,qing,
zhu,jian,min,
zhu,chun,min,
unicode and corresponding pinyin is:zheng,shu,zhao,
unicode and corresponding pinyin is:zhu,de,
unicode and corresponding pinyin is:zhu,chun,ming,
now let's find the match
find a match:shang,hai,shi,
input your searching string>shs
find a match:shang,hai,shi,
input your searching string>zcm
find a match:zhu,chun,min,
find a match:unicode and corresponding pinyin is:zhu,chun,ming,
input your searching string>zhch
find a match:zhu,chun,min,
find a match:unicode and corresponding pinyin is:zhu,chun,ming,
input your searching string>