October 2021. This is used for personal purposes only - it is an exercise in taking some years of experience and applying it to extracting data from the web. Why? Covid and retirement occurred practically simultaneously. It started out as a distraction from boredom and some isolation, and turned into more than that. I try to turn the tables on the web model by taking what is available in as many ways as it is offered, but saving it to my own domain, subject to my design and use.

Retrieving and Posting Content

The user interface uses a drag-drop and/or copy-paste interface. Three kinds of content items can be used. The sources of these are...
3 types of sources:
1. Platform content
2. User content
3. Web content

Type 1 - Platform content.
  • Look for a 'Share' or 'Embed' option. Drop or paste the URL or the 'code' this process provides. These are the domains/platforms which are accepted...
                                                                                                                                   
    All trademarks, logos and brand names are the property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names,trademarks and brands does not imply endorsement.
  • RSS Feeds are found under the RSS icon . Drop or paste the URL to the XML file this process provides.

Type 2 - User content from desktop/phone
  • Drag/drop or copy/paste the following:
    • pdfs
    • images
    • text
    • mp3
    • mp4
    • mov

Type 3 - Web content (images/text/web page url/web search)
  • Images. Any images from the clipboard or images dragged from a web site should be accompanied by source attribution.
    (Google image search provides an excellent share option for this.)
  • Text. Any text dragged from a web site should be accompanied by an attribution to it's source
  • Web page url. Drop or paste a url. The app has an array of generally-accepted schemas which are referenced during the scraping process. Keywords which are extracted during scraping process are incorporated into the tag database.
  • Web search for content. Boolean web searches can be used to find web content -- the app utilizes the Bing search api. These searches can be saved as pre-defined queries for re-use.

    An example of a pre-defined query ('National parks hiking') and the search results showing multiple 'cards' (see below) can be created by selecting a category from the pulldown for any one of the items


  • Tag searches. The UI allows one or more tags to be combined to search the card database or used in a google search.

Storing and Viewing the content

Content items are stored in a 'content record', which can contain multiple content items stacked together in 'card' like objects
Cards must be assigned to a user-defined category.
Any user can create a category and specify it as open to all users (the default) or private

Card properties:
  • Each card can have tags (from a tag database) assigned to it. Keywords which are extracted from a web site during the scraping process are incorporated and added to the cards' tag list. Tags can be manually added to the tag database and to a card, as well as removed.

    Example of slicing and dicing tags for a category 'Real Estate':

  • Each card and any items within a card can have UI threaded comments from application users.

    Example of a threaded conversation within a card:

  • Each card and item within a card can be open to all app users or restricted to specific app friends.
  • All text content within cards is indexed and is immediately searchable, results are returned if the given user doing the searching has rights to the given item
  • A card can be marked as a 'favorite'
  • By default all card content is viewed by users via the latest posts from everyone. Views for a specific category, posts from your friends, from your 'favorites', or from your private cards only are also available.


Examples of content

When you drag-and-drop a url into the app it extracts all the relevant textual elements and semantic data. See below for examples.

Some rules:

If the web site is behind a pay-wall the app can extract the data and save it (if the user posting it is signed in.) This data can only be used for personal retrieval purposes and must not be used in a way that would break any pay-wall contract.

All 'article body' data is very likely copyrighted material -- any sharing/printing must be for personal reasons only and must show the copyright owners. See below for examples of likely copyrighted material.


An example of cards that are using embedded content from Flickr, Youtube, Facebook, and page content from easyreadernews.com, sfgate.com, nytimes.com and opb.org...these are not showing any copyrighted material


An example of several cards showing data extracted by the app from news sites...these are showing copyrighted material (shown here for demonstration purposes only)


More examples showing data extracted by the app from news sites...these are also showing copyrighted material (shown for demonstration purposes only)


Semantic items extracted by a custom process for Vrbo offerings


An example of items containing semantic content extracted from various web sites


An example of content extracted by the app from several RSS feeds. Note you can remove individual items from a feed if you posted that item. Thats because RSS feeds can go on and on.


Embedded video content from youtube, uploaded mp3 files, page content and tags.


Embedded videos from TV news, page content and tweets.


Web pages, uploaded text and images, and an embedded Google Book


Uploaded content, 2 embedded items from 'reddit.com', and embedded tweets from abcnews.com and fox8 in New Orleans


An example of an alltrails.com embedded map


Sharing content


The app architectural model allows for individual use or multi-user use.


Getting meaningful semantic information by parsing the structures of web pages

Semantic data is stored within HTML pages using several different methods. Typically more than one method is used within a page. These methods are:
  • ld+json
  • microdata
  • Open Graph
  • HTML header meta tags
  • HTML5 Semantic tags
Semantic data is defined using a framework stored at Schema.org. Schema.org is an organization which regulates and defines a heirarchy of semantically named items and structures.

All browsers that have broadly-based usage support semantic data.

Why use semantic data?
It is in the interest of web sites to include this semantic data if they wish to be found and used by their customers using search engines such as Google, Bing. It is in the interest of search engines to have this data so they can produce meaningful results for the user.

In addition, there are other companies who have services (i.e., not search engines) that use this data. Some examples of those where they directly use the semantic structure at schema.org:
  • an audio internet appliance that speaks recipe ingredients and instructions to you, and steps you through the process. Click here to see this semantic definition.
  • tourism and hospitality providers that return restaurant opening hours, a menu, location, access to reservations, customer ratings, price range, etc, click here
  • retail middle-men who have customers that are buying a product and want to compare similar products and prices, to see user ratings, user reviews, product attributes, photos, etc, click here
  • Bookclub applications click here

Working with semantic coding methods

ld+json

  • "ld" stands for "Linked Data". Within the HTML code of a web page this linked data, using named definitions from that schema, is stored in json format. It can be found within a "script" tag:
    <script type="application/ld+json">
    json data
    </script>
  • json is a protocol used to efficiently store structured data in a textual format.

Microdata

Microdata is a bit older than ld+json, it also is defined using a framework found at Schema.org. Microdata items are stored within HTML elements (i.e., <div>, <span>, <meta> etc.) inserted as an attribute named "itemprop". That attribute:
  • stores the name of the schema item, and
  • is accompanied by a "content" attribute which stores the value of the schema item.

Open Graph

Facebook's Open Graph stores its items within meta tags. When someone posts a web link to Facebook, Facebook will get at a minimum three of those meta tags from that link
When Facebook initially released Open Graph 50,000 web sites had added those meta tags within one week. This has become the de facto standard for most web sites. Almost all web sites now have at least these 3 of those metatags. Shows the economic and social power of semantics in linking items.

Meta tags

The oldest HTML tags for storing and exposing metadata within a web page are 'meta' tags. These are found in the HTML header. These <meta> HTML elements are typically used to store some information such as the character set used on the page, the keywords of the document, the viewport, the author, and many other provider-specific implementations (such as Facebook Open Graph tags.)

HTML Semantic tags


Other sources:

Some websites also provide 'keyword' tags embedded within their semantic data. These keywords can be single words or phrases of words. The app will take these and turn them into referential tags.


** A note about semantic data for product price and availability:
If you extract the price and availability for a product from a web page (both ld+json and microdata frameworks provide this information in a structured way) and intend to produce the data for your own purposes, please note that many businesses require that you cannot just get it once and let it remain static. You must be able to dynamically refresh that price and availability anytime you 'offer' it. Businesses such as Target, Amazon and Bestbuy require this, and they provide a web service to get that updated information. An example is Target's "Redsky" service.

These three things from a given web site:
  • Links
  • Images
  • Text
along with the embedded semantic information as explained above, are pretty much all of the worthwhile information there is on a web page that is not domain specific.


Important:
If an image, text or link on a web page is retrieved via the above frameworks or via RSS you can assume the owner of that webpage has decided to allow those assets to be freely distributed. Otherwise you must assume all other assets are copyrighted by the owner of that webpage, and their rights should be attributed.



I use a partial data download from the Microsoft Research web site for use in a 'tag' database, and that requires these two attributions....
Zhongyuan Wang, Haixun Wang, Ji-Rong Wen, and Yanghua Xiao, An Inference Approach to Basic Level of Categorization, in ACM International Conference on Information and Knowledge Management (CIKM), ACM – Association for Computing Machinery, October 2015.

Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Zhu, Probase: A Probabilistic Taxonomy for Text Understanding, in ACM International Conference on Management of Data (SIGMOD), May 2012.

Contact me