Well, duh, you say.
Tagging is not only not trivial, but the most important part of the text. The way that you tag it greatly determines what you get out of your corpus. I put away the corpus work a bit while trying to get the data to visualize, but now need to get things into the indexer. The first step is to tag for parts of speech. I would like to use the original XML files because they are already tagged for some of the meta data I need (author, comments, etc). Getting parts of speech, however, is more complicated. I have looked at lots of different taggers (Stanford POS tagger, CLAWS, XG Tagger, GATE, etc) and they all have their pros and cons. The ones that are not so complicated are too difficult to manipulate. The ones that I can really manipulate take a lot of pre-tagging to tag. What I am going to work on today are configuration files for the XGTagger. This ones seems optimal for my needs, but will take a bit of work to get it going. Considering my deadline, I better get working!! More later 🙂
speaking of tags 😛