Hello Crazy World, It Is I…

and you thought I was gone…left the building…escaped out the back door? Well, think again mighty brain trust…. I am back like a bad dream just waiting to twist your little brain in to a pretzel. What do you think about Wislawa? great, huh… to bad we all must die. You know I think sleeping is over rated. Why did God make such a wonderful creation only to have us stop working so we could rest… hate sleeping!

You know the woman that originally created this blog was pretty damn incredible. We didn’t have the same political views and could very well have had a wonderful argument over a volume of topics. I did however bring her back to life though her original blog and historical posts…. no she’s not dead, well at least I do’t think she is (not old enough). I wonder what she will think if she ever ventures on to the site? will she like it, will she hate it, will she laugh or will she cry…. I have no idea but either way… I am impressed with her.

“It’s hard watching people change, but it’s even harder remembering who they used to be” – have no idea

How do we answer the ‘how’?

We will leave here today and our language will have changed by the interaction that has taken place.’ –Nev Shrimpton

This was the closing thought one of our corpus linguists left us with after his very interesting seminar on Friday. And, while it is an exaggeration for emphasis, it is also true. Communication is a constant state of negotiation, and language in continuous flux. Those with whom we come into contact modify our language. We speak a certain way with a certain group, and even with ourselves. And while that thought in itself is interesting, even more so is the ‘how’. How does our language change (variation)? How do we use language to communicate in different situations and with different people? How do we do this when taking into consideration the ‘invisible readers’ of blogs and people outside our real-world sociolects (often limited geographically to a select group of speakers)? I believe that blogs are a exceptional object of research to answer this ‘how’. Blogs are social. We have established that they form social networks. The clustering/small world effects allow us to look for variation in regards to perceived general audience as well as to perceived social network. So again, how do we answer the ‘how’. Several ways, I would say.

Social network analysis:

    Where are people positioned in their network? How fluid are those positions? How often (if at all) do their interact with members of other networks?

Corpus Linguistics:

    1. Are different networks using their blogs in different ways? To begin to find this out, I want to identify the registers of different networks. I believe this is key. Are some more speech like than others? Are some more matter-of-fact, some more questioning? Where do they fall on the continuum of speech and writing? Does this differ between the different types of weblog networks? To find this out, I must tag for parts of speech. I will use grammatical patterns, rather than semantic, to determine register.

There are other important and interesting things to look at when using corpus methods. For example, you can use look at pronouns and nouns to measure referring expressions. I think this can be quite interesting, especially when considering that following discourse over different weblogs is not an easy task. This, of course, cannot be done purely from the corpus. You need to take into account whether or not the noun is new or given information. I think whether or not it is also a link will also be significant.

Semantic patterns are also very interesting and will play an important role in determining the register of a group. While this can be done with keyword lists, I think a much better and more useful way *is* with tOKo. You not only get the unique patterns, but their social relations as well. This makes intuitive guesses much less about intuition and more about measurement.

Sociolinguistic:

    How do their positions relate to language maintenance and variation (is there a relationship between the fluidity of placement and variation?)? What about other social variables? Does ‘real-world position’ (i.e. professor rather than a grad student in an academic network) make a difference? Gender? Geography?

About using XML files: The XML files I have at the moment are already tagged for author and URL, which will make exploring social and linguistic relationships easier. I want to add tags which will allow me to explore on different levels; not least, grammatical and syntactic.

Building the corpus!

Well I made it this far! Now, how do I get the program to read the tags right! Slow progress (I guess better than no progress!)

(ROOT
(S
(INTJ (UH ell) (, ,) (UH duh))
(, ,)
(NP (PRP you))
(VP (VBP say))
(. .)))

(ROOT
(S
(NP (NNP Tagging))
(VP (VBZ is) (RB not)
(ADVP (RB only))
(ADJP (RB not) (JJ trivial))
(, ,)
(PP (CC but)
(NP
(NP (DT the)
(ADJP (RBS most) (JJ important))
(NN part))
(PP (IN of)
(NP (DT the) (NN text))))))
(. .)))

(ROOT
(S
(NP
(NP (DT The) (NN way))
(SBAR (IN that)
(S

Tagging is not trivia!

Well, duh, you say.

Tagging is not only not trivial, but the most important part of the text. The way that you tag it greatly determines what you get out of your corpus. I put away the corpus work a bit while trying to get the data to visualize, but now need to get things into the indexer. The first step is to tag for parts of speech. I would like to use the original XML files because they are already tagged for some of the meta data I need (author, comments, etc). Getting parts of speech, however, is more complicated. I have looked at lots of different taggers (Stanford POS tagger, CLAWS, XG Tagger, GATE, etc) and they all have their pros and cons. The ones that are not so complicated are too difficult to manipulate. The ones that I can really manipulate take a lot of pre-tagging to tag. What I am going to work on today are configuration files for the XGTagger. This ones seems optimal for my needs, but will take a bit of work to get it going. Considering my deadline, I better get working!! More later 🙂

speaking of tags 😛