Lemmatization of English words in sentences in XML format by Python

Lemmatization of English words in sentences in XML format by Python

Python 2.7, NLTK 3.0
The input XML file look likes this:


<?xml version="1.0" encoding="UTF-8"?>
<sentences version="1.0">
<item id="1" asks-for="cause" most-plausible-alternative="1">
<p>my body cast a shadow over the grass . </p>
<a1>the sun be rise . </a1>
<a2>the grass be cut . </a2>
</item>

<item id="2" asks-for="cause" most-plausible-alternative="1">
<p>the woman tolerate the woman friend 's difficult behavior . </p>
<a1>the woman know the woman friend be go through a hard time . </a1>
<a2>the woman felt that the woman friend take advantage of her kindness . </a2>
</item>
...

</sentences>

Python Code


#This setting is only necessary for error about 'encoding utf-8'
import sys
reload(sys)
sys.setdefaultencoding(&quot;utf-8&quot;)

import xml.etree.cElementTree as ET #library for XML processing

from nltk.tokenize import word_tokenize #library for word tokenize

from nltk.stem import WordNetLemmatizer #library for word lemmatize
wordnet_lemmatizer = WordNetLemmatizer()

tree = ET.parse('input.xml') #parse the XML tree from input.xml
root = tree.getroot() #get root element of the tree

for item_of_root in root: #for each item
for sentence in item_of_root: #for each sentence in the item
words = word_tokenize(sentence.text) #divide sentence to words
sentenceNew = &quot;&quot; #contatiner for new lemmatized sentence
for word in words: #for each word in the sentence
lamWord = wordnet_lemmatizer.lemmatize(word, pos='v') #lemmatize the words
sentenceNew += lamWord + ' ' #put the lemmatized word to the contatiner
sentence.text = sentenceNew #store the new sentence to the tree

tree.write('output.xml') #ouput the lemmatized tree to file

 

Reference

The ElementTree XML API – Python 2.7.12 Documentation

Installing NLTK Data

Dive Into NLTK, Part I: Getting Started with NLTK

Dive Into NLTK, Part II: Sentence Tokenize and Word Tokenize

Dive Into NLTK, Part IV: Stemming and Lemmatization

 

 

Leave a Reply