Finished the paper!!!

Yes, it is official! I finished the paper and submitted it to the ICWSM conference. It is the first methodological paper I have written and it feels a bit strange to write something that was more descriptive and with little to no results. It does, however describe the why and how of my current pilot study (which I *hope* to have finished by the end of Jan. – Finishing this paper gave me quite a lot of energy!) Although, i can admit that I will take time off for Christmas and decorating our tree!!

this is the abstract from the paper:

Following Conversational Traces:
Part I: Creating a corpus with the ICWSM dataset

This paper will present the methodology behind the creation of a linguistic corpus based on a subset of the 2007 International Conference on Weblogs and Social Media dataset. Social network analysis methods were used to identify traces of conversation in a small group of political bloggers. Posts from these bloggers were tagged for parts of speech and indexed into a corpus using Xairia. From this corpus, the political blogger subset will be investigated for register and referential information. Referential information, especially with regards to new and given information, will be compared against network placement both to identify network innovators as well as to compare network placement as a catalyst for innovation. The final section, Further Research, will describe the pilot study generated from this subset, as well as outline the modifications necessary for the creation of a corpus from the entire ICWSM 2006 dataset (currently in progress).

How do we answer the ‘how’?

We will leave here today and our language will have changed by the interaction that has taken place.’ –Nev Shrimpton

This was the closing thought one of our corpus linguists left us with after his very interesting seminar on Friday. And, while it is an exaggeration for emphasis, it is also true. Communication is a constant state of negotiation, and language in continuous flux. Those with whom we come into contact modify our language. We speak a certain way with a certain group, and even with ourselves. And while that thought in itself is interesting, even more so is the ‘how’. How does our language change (variation)? How do we use language to communicate in different situations and with different people? How do we do this when taking into consideration the ‘invisible readers’ of blogs and people outside our real-world sociolects (often limited geographically to a select group of speakers)? I believe that blogs are a exceptional object of research to answer this ‘how’. Blogs are social. We have established that they form social networks. The clustering/small world effects allow us to look for variation in regards to perceived general audience as well as to perceived social network. So again, how do we answer the ‘how’. Several ways, I would say.

Social network analysis:

    Where are people positioned in their network? How fluid are those positions? How often (if at all) do their interact with members of other networks?

Corpus Linguistics:

    1. Are different networks using their blogs in different ways? To begin to find this out, I want to identify the registers of different networks. I believe this is key. Are some more speech like than others? Are some more matter-of-fact, some more questioning? Where do they fall on the continuum of speech and writing? Does this differ between the different types of weblog networks? To find this out, I must tag for parts of speech. I will use grammatical patterns, rather than semantic, to determine register.

There are other important and interesting things to look at when using corpus methods. For example, you can use look at pronouns and nouns to measure referring expressions. I think this can be quite interesting, especially when considering that following discourse over different weblogs is not an easy task. This, of course, cannot be done purely from the corpus. You need to take into account whether or not the noun is new or given information. I think whether or not it is also a link will also be significant.

Semantic patterns are also very interesting and will play an important role in determining the register of a group. While this can be done with keyword lists, I think a much better and more useful way *is* with tOKo. You not only get the unique patterns, but their social relations as well. This makes intuitive guesses much less about intuition and more about measurement.


    How do their positions relate to language maintenance and variation (is there a relationship between the fluidity of placement and variation?)? What about other social variables? Does ‘real-world position’ (i.e. professor rather than a grad student in an academic network) make a difference? Gender? Geography?

About using XML files: The XML files I have at the moment are already tagged for author and URL, which will make exploring social and linguistic relationships easier. I want to add tags which will allow me to explore on different levels; not least, grammatical and syntactic.

Building the corpus!

Well I made it this far! Now, how do I get the program to read the tags right! Slow progress (I guess better than no progress!)

(INTJ (UH ell) (, ,) (UH duh))
(, ,)
(NP (PRP you))
(VP (VBP say))
(. .)))

(NP (NNP Tagging))
(VP (VBZ is) (RB not)
(ADVP (RB only))
(ADJP (RB not) (JJ trivial))
(, ,)
(PP (CC but)
(NP (DT the)
(ADJP (RBS most) (JJ important))
(NN part))
(PP (IN of)
(NP (DT the) (NN text))))))
(. .)))

(NP (DT The) (NN way))
(SBAR (IN that)

Tagging is not trivia!

Well, duh, you say.

Tagging is not only not trivial, but the most important part of the text. The way that you tag it greatly determines what you get out of your corpus. I put away the corpus work a bit while trying to get the data to visualize, but now need to get things into the indexer. The first step is to tag for parts of speech. I would like to use the original XML files because they are already tagged for some of the meta data I need (author, comments, etc). Getting parts of speech, however, is more complicated. I have looked at lots of different taggers (Stanford POS tagger, CLAWS, XG Tagger, GATE, etc) and they all have their pros and cons. The ones that are not so complicated are too difficult to manipulate. The ones that I can really manipulate take a lot of pre-tagging to tag. What I am going to work on today are configuration files for the XGTagger. This ones seems optimal for my needs, but will take a bit of work to get it going. Considering my deadline, I better get working!! More later 🙂

speaking of tags 😛

Seminar streams and writing processes

Sometimes being a working, single mom can be difficult. Especially when kids get sick and deadline are looming. Today, however, I was able to merge my two worlds by being home with my sick daughter (who is going crazy with my stapler at the moment) and watching Thereses seminar through the live stream at the same time. I think this is the first live stream I have watched and, to be honest, I was quite pleasantly surprised by the quality. There were no lags, no jumps and bumps. It was like I was there! We do, however, need to establish a better way of interacting with the seminar speakers. It is difficult to ask questions. There is a chat, but I can never remember the URL. Skype is also an option that we have tried, but we need to work out the sound aspects a bit better.

I am also putting the finishing touches on my rough draft to be distributed in the morning. I have really enjoyed writing again! i have enjoyed working my way through the process in the last year I have done a lot of presentations in which you do the research, write outlines and cue cards about what to say, create pretty little powerpoints, but then the full synthesis gets lost in the planning of the next project. This time I am completing the process and writing a paper. It is the first chapter of my thesis and will look at the methodology of creating a linguistic corpus from a weblog dataset. This paper is written for a specific conference, and will have to be lengthened and the focus slightly changed for my thesis, but recording the creation process in such a way is a great start!!

The seminar that I was supposed to present the findings this paper is based on was going to be last Friday. Due to the flu, however, it has been postponed to this Friday at 13.30.

reaching out

i really admire the amount that my university reaches out to students in Umeå. there are different days all through the year when students are invited in to listen or experience or experiment. this friday is one such day and i have the honor of representing the Department of Modern Languages to 4 groups of 12 and 13 year olds. we will begin by watching a movie (gotta love imovie) which shows what we are working on now, interesting projects combining the humanities and IT – world wide, and finishes by proposing future research in keeping with the theme of the students being the agents of future research, they will move through 5 stations: machimina, blogging, hypertexts, gaming, and a poetry project. very much looking forward to interacting with the students!

Amateur lovers

(Oh! I hope this title does not earn me spam out the wazoo!!)

After reading Cathys comment, I started thinking about the amateur versus the professional, what makes the amateur write in a blog over time, what makes the professional? Do they have the same dedication, does earning money with something make you less passionate about it? (good question for the phd student) and then i read jennies HUMlab post about amateurs as lovers,

He talked about innovation and how innovation can be found in the interplay between art and media. He also mentioned the dichotomy professional and amateur and how these terms are beeing challenged by a new group, pro-amateurs, a hybridisation where innovation easily can be found. Further, he explained the meaning of the term amateur deriving from the latins amator translated as lover. Amateurs, Brown claimed, are searching new ways, experimenting and challenging simply because they love doing it. – jennie of humlab

searching new ways, experimenting and challenging simply because they love doing it. to me, this describes a blogger, someone who is passionate (usually about a particular topic) and shares his or her passion with an audience. the size of the audience or the size of the reward is inconsequentialit is the act of blogging itself – not least the benefits of working though a problem, receiving feedback, and connecting with other amateurs (lovers) is a reward in itself (to get a bit cliche).

To cathy, The person who responded was referring to someone who was not paid to write about something she loved doing. Blogging, to her, was a work of heart. To her, journalism was professional writing, and everything else amateur.

Somehow, being an amateur all of a sudden feels great.

The Internet is destroying the apostrophe!!

(I guess that sarcasm is difficult to express through a blog).

Down here in HUMlab, we are made up of quite a few linguists – or budding linguists – who are interested in CMC and articles like this one from the Toronto Star about how the Internet is killing off punctuation is just one more in a line of scare pieces about what the Internet is doing to language. I could sum it up for you, but a weblog I have recently found and really like, mikes web log, does such a great job that I thought I would just point you there. Enjoy.