Lemmatization of English words in sentences in XML format by Python
Python 2.7, NLTK 3.0
The input XML file look likes this:
<?xml version="1.0" encoding="UTF-8"?>
<sentences version="1.0">
<item id="1" asks-for="cause" most-plausible-alternative="1">
<p>my body cast a shadow over the grass . </p>
<a1>the sun be rise . </a1>
<a2>the grass be cut . </a2>
</item>
<item id="2" asks-for="cause" most-plausible-alternative="1">
<p>the woman tolerate the woman friend 's difficult behavior . </p>
<a1>the woman know the woman friend be go through a hard time . </a1>
<a2>the woman felt that the woman friend take advantage of her kindness . </a2>
</item>
...
</sentences>
Python Code
#This setting is only necessary for error about 'encoding utf-8'
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import xml.etree.cElementTree as ET #library for XML processing
from nltk.tokenize import word_tokenize #library for word tokenize
from nltk.stem import WordNetLemmatizer #library for word lemmatize
wordnet_lemmatizer = WordNetLemmatizer()
tree = ET.parse('input.xml') #parse the XML tree from input.xml
root = tree.getroot() #get root element of the tree
for item_of_root in root: #for each item
for sentence in item_of_root: #for each sentence in the item
words = word_tokenize(sentence.text) #divide sentence to words
sentenceNew = "" #contatiner for new lemmatized sentence
for word in words: #for each word in the sentence
lamWord = wordnet_lemmatizer.lemmatize(word, pos='v') #lemmatize the words
sentenceNew += lamWord + ' ' #put the lemmatized word to the contatiner
sentence.text = sentenceNew #store the new sentence to the tree
tree.write('output.xml') #ouput the lemmatized tree to file
Reference
The ElementTree XML API – Python 2.7.12 Documentation
Dive Into NLTK, Part I: Getting Started with NLTK
Dive Into NLTK, Part II: Sentence Tokenize and Word Tokenize
Dive Into NLTK, Part IV: Stemming and Lemmatization