By Maurizio Farina | Posted on October 2017 | DRAFT
This tutorial is an overview about NLP and NLU.
ListFeeds.com is an aggregator of feeds. Our goal is built an application able to query feeds using Location.
For this reason ListFeeds engine is built on Crawler and NLP libraries to grab all feeds from different datasource and extrating text to localize the feed.
This post describe which libraries and tooklit analyzed to achieve this target.
"DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data." from DBpedia web site
Ontology: currently there are 685 classes described by 2,795 different properties; For a complete list refere to DBpedia Ontoloy Class web page; here just an example:
Important to higlhlight DBpedia has introduced infobox to enhance Wikipedia information stored(from Wikipedia extraction). Thanks to this infobox (Mappings) is possible to add property to the Wikipedia items.
Starting from here is possible to find all Mappings for italian language. Selecting Museo i can see all Property and their onthology. This the Museo template page for a complete explanation. The importat news for us is to search in wikipedia for Museo Mapping and retrieving all "Museo" records from Wikipedia and access all properties descrived in the mapping template.
A bit of Theory¶
NLP is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, dialog systems, or some combination thereof.
Natural language understanding (NLU) is a subtopic of natural language processing in artificial intelligence that deals with machine reading comprehension.
Stanford CoreNLP provides a set of human language technology tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and syntactic dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get the quotes people said, etc.
Unfortunetly Stanford CoreNLP doens't support italian but the Tint project contains models for italian language.
The following Link explains why NLP and NLU are so complicated for a software.
|NER resources||A curated list of resources dedicated to Natural Language Processing|
|Italian DBpedia Group||1,5M entities of which 500.000 classified using onthology: 263.000 persona, 144.000 locations, 29.000 movies and so on.|
|Italian DBpedia download site||RDF using turtle serialization format already ready|