These CIOs are jumping on a fast-moving trend — mining small data sets to deliver valuable insights quickly to the right people at the right time.
A lot of security processes failed during the breach of Target’s systems during last year’s holiday season, but one surprising revelation was that the retailer actually did receive security alerts about the malware in its system. Yet because the security team was bombarded with alerts — estimated at hundreds per day — it couldn’t adequately prioritize them.
Both within the security field and in other areas, that’s a problem that a lot of companies face in the big data era: The top priority is to dig out useful insights from the wealth of data that’s coming at them from multiple sources. They’re trying to get the right data to the right person at the right time; in fact, the goal is to deliver insights in near real time. The result is a discipline that many in IT are calling “fast data.”
“We don’t have a formal definition of it,” says Chirag Dekate, an analyst at research firm IDC. “But it relates to any kind of big data workload that requires massive brainpower and complex algorithms to extract data from it. The need has arisen because of the urgency of analyzing data as the data is coming in.”
Another IDC analyst, Steve Conway, explains the challenge this way: “You have to get rid of everything extraneous and do it quickly. Some of it needs to be in real time, like credit card fraud detection. Because if you don’t do it in real time, you don’t catch it.” There’s also an element of predictive analysis involved in fast data, because you can identify patterns with more recent — and thus more accurate — information.
Download this story and lots more in Computerword’s December digital magazine!
Indeed, the need is so widespread that different approaches to fast data are emerging in different industries, from financial services to e-commerce. But the goal in every case is to take advantage of better insight sooner. Here’s a look at the phenomenon known as fast data, the technology behind it and some of the companies that are putting it to work.
Symbiotic technology unleashed
New database technologies — Hadoop, NoSQL, HBase and others — are helping the fast data effort come to fruition by enabling greater flexibility in reading data. Oliver Downs, chief scientist and CTO at Globys, a Seattle-based developer of software that helps service providers analyze customer experience, says what those database technologies bring is “the ability to shape data in usable form when you read it rather than when you write it.” The result: It’s possible to analyze data sooner than it once was.
However, it’s not just advancements in database technologies that are making fast data possible. “This is not a new type of application,” says Tony Baer, principal analyst at Ovum.
Tony Baer, principal analyst, Ovum Research [2014]
Tony Baer
“We used to call it ‘complex event processing,’” he adds. But that approach required proprietary software and expensive servers, which limited usage. In contrast, one of today’s technologies, Hadoop, “is linearly scalable, and you can throw lots of hardware at it and use memory very effectively,” he says. Roll into that the lower cost of flash memory, adds Baer, and “now we can process data very fast, and do more sophisticated processing than when you were bound by I/O.”
That said, fast data involves some infrastructure considerations, according to analysts. Companies are more likely to use Linux than Windows for fast data projects, and they’re more likely to use InfiniBand networking than traditional Ethernet because the former is faster. And those who want faster analysis might even use dedicated FPGA chips and network processors to watch network traffic, Dekate says.
On the back end, the machines used to understand and analyze workloads will need denser memory because they will need to focus on streaming data rather than stored data. Increasingly, organizations are using high-performance server clusters to accommodate fast data. As is to be expected, given their need for security, risk management and pricing accuracy, companies in the financial services industry have been among the first to embrace fast data. But fast data is taking hold in many other sectors of the economy.
Real-time risk assessment
Western Union is best known for something that most people under 50 have never seen: the telegram. Yet the company is now a $5.5 billion global payment services provider, and in 2013 moved some $82 billion in 242 million consumer-to-consumer transactions and 459 million business transactions. According to David Thompson, executive vice president and CIO at the Englewood, Colo.-based company, the data that Western Union collects from 174 kinds of data sources is increasing at the rate of 100TB per year.
Those data sources are both internal and external. Western Union has relationships with more than 1 million “front-line associates,” which range from large companies to cash-dispensing bodegas. It collects data from approximately 500,000 retail agent locations, and also cross-checks information with government agencies that distribute lists of sanctions, politically exposed individuals (who might be subject to extortion threats) and other information.
David Thompson, executive vice president and CIO, Western Union [2014]
For example, there’s an element of business risk involved when someone in the U.S. wires a large amount of money to, say, India. Western Union has to determine whether there’s enough cash in that location to complete the transaction, or whether the recipient should be sent to a different location.
That’s a customer service issue, but there are also risks to watch out for, such as the potential for fraud. “If we become aware of potential fraud, such as elder abuse, or we suspect that a recipient is engaged in a scam, we can ask the sender in real time if they know the recipient personally,” Thompson says.
“We collect information about transactions, about senders, about receivers, and enhance that data with transactional history,” he says. “We have a lot of moving parts, and I have to enable real-time information around transactions, credit limits and other policies such as international anti-money-laundering regulations.” Thompson explains that in October 2012, he started putting together a system “that could respond to the needs of the business in an economical way.”
It went live in May 2013. Thompson uses Informatica and Tibco software to deliver the data sources into a single Hadoop repository and Tableau for visualization. Cloud service provider Cloudera hosts the system. Western Union started out with a 64-node cluster but upgraded to a 128-node cluster nine months after going live because “we kept finding more and more business uses and bringing in live transactional data,” he says.
High-Performance Computing Meets Fast Data
As part of its research on high-performance computing, IDC periodically highlights enterprises that are doing innovative work in that area, and those undertakings occasionally overlap with fast data projects.
In a recent example of high-performance computing being put to use for a fast data initiative, PayPal uses “real-time stream analytics for behavioral and situational modeling related to online fraud detection,” says IDC analyst Chirag Dekate. Rather than using Hadoop, as some enterprises might for such an undertaking, PayPal developed a custom algorithm that runs on hardware from Silicon Graphics International and Hewlett-Packard and is capable of analyzing more than 4 billion records. “They saved more than $710 million in a single year” by identifying fraudulent transactions, says Dekate.
In another example, IDC analyst Steve Conway cites a company in the real estate business that needed to replace a system that recalculated the price of its inventory monthly. Hoping to gain a competitive advantage with more accurate pricing, the company turned to high-performance computing and adopted a more complicated pricing algorithm that looks at parameters such as location and turnover. It now assesses the value of 2 million properties three times a day and sets its rents and sale prices accordingly.
Similarly, notes Dekate, insurance provider Geico aggregates a variety of data sources to calculate insurance rates across the U.S. “They can rely on real-time information feeds to improve their coding systems. If accidents have increased in a certain neighborhood, it can detect that and change the pricing,” he says. That benefits Geico in two ways: The company can react to market conditions more efficiently, and its agents can quote an accurate price in real time when a prospective customer calls.
Howard Baldwin
If the company had employed a data warehouse for the same effort, Thompson says, “we would have needed massive amounts of storage, as well as analysts and technicians to code SQL against the database.” But because it’s folly to run live queries against production systems, the traditional method would have been to take snapshots of data every few days and run queries against those snapshots. “But it would have been expensive to keep replicating the data,” he says. “With Hadoop, we have one large instance of all our data in real time, and we can run instantaneous queries without impacting the production system.”
Western Union benefits in multiple ways. “We get better consumer protection, because we’re ensuring that transactions are evaluated for risk. We now have a better view into adjusting fraud-prevention models, so we have fewer losses,” Thompson says. “Best of all, I’m finally delivering real-time information into the hands of people who need it.”
Faster disaster response
An insurance claims center after a disaster is like an isolated restaurant when a tour bus rolls in — a setting that was calm a moment earlier turns chaotic quickly. And these days, the chaos is compounded by the fact that customers can contact their insurers through so many different channels.
“During a typical month, we process approximately 700 claims,” says Werner E. Kruck, chief operating officer at Ormond Beach, Fla.-based Security First Insurance. “In the aftermath of a hurricane, that can swell to tens of thousands within days.”
In recent years, as the number of customers contacting agents or claims representatives via social media started to increase, Kruck says he grew concerned that his company “might have a difficult time [responding to] an influx of social media messages received outside of our traditional business systems” following a natural disaster. His concern was confirmed after tornados struck Tuscaloosa, Ala., and Joplin, Mo., in early 2011 and again when Hurricane Sandy devastated the Eastern Seaboard in October 2012.
Like many fast data pioneers, Security First Insurance realized that it needed a customized solution. So the company co-developed software it calls SMC4, which stands for “social media capture, control, communication and compliance.” According to Kruck, the application runs on a public cloud from business partner Integritie and incorporates IBM Enterprise Content Management and IBM FileNet.
“It integrates social media technology and email communication into a single interface, making it easier and faster for our company to assist our customers following a storm,” says Kruck.
“In the past, if all our emails went into public folders, they weren’t visible to everyone,” he says. “With this platform, they’re both visible and searchable. We can effectively control the flow of communication by designating people or roles in the organization that are authorized to create and submit responses.” “When you’re answering email traditionally, you’re handling them in order — first in, first out,” says Kruck. “But with SMC4, we can prioritize messages automatically by reading sentiment and getting a sense of urgency. Is it a routine question? Is there frustration? Profanity? We use multiple levels of customized dictionaries, and they continue to evolve.”
A recent American Red Cross survey indicated that 75% of citizens affected by a disaster would expect a response to a social media message within three hours. “Those became our client expectations,” says Kruck. “Our objective is to respond as closely to real time as possible. We want people to be glad they insured with us and not our competitors.”
Faster verification
Heritage Auctions hosts both online and in-person auctions for a wide variety of memorabilia, and it uses fast data to verify the identity of each online auction participant. “We want the brand-new customer who’s wealthy enough to pay $2.2 million for Francis Crick’s Nobel Prize to have a friction-free signup and bidding experience,” says Brian Shipman, CIO at the Dallas-based company. “At the same time, we don’t want a 9-year-old in another country to be able to bid anonymously.”
To address what Shipman calls a real threat, Heritage Auctions developed a threat matrix that relies on data and analysis from a variety of internal and external sources. As for external sources, Heritage currently uses credit reporting service TransUnion and has plans to work with TeleSign, a provider of mobile identity services.
TransUnion provides credit reports in real time; TeleSign can determine whether a bidder’s phone number is from a land line, a mobile phone or even a “burner” phone set up for temporary use. Bidders have to register their name, address and phone number on the Heritage system. “We can send a text message containing a personally identifiable number to the phone number on file to confirm that the phone actually belongs to the person registered,” Shipman says.
Internally, Heritage analyzes data to further verify not only the identity, but also the financial worthiness of bidders. “We want to know the potential exposure at any given time,” says Shipman. “How recently did they sign up with us? How much have they spent with us previously? Do they owe us money for previous transactions? Has their address recently changed?” All that data is ranked in real time and sent to an analyst to determine on a 30-point scale whether a bidder is suspicious. If there are questions about a bidder’s background, the individual could be banned from the auction until his identity can be verified.
Heritage’s infrastructure includes auction software from Iasi, Romania-based MigratoryData, whose system is capable of handling 12 million concurrent users and dealing with latency issues that may stem from concurrent bidders using different client devices. The MigratoryData tool also uses hierarchical storage management, Shipman says, so that current data is moved to more expensive flash memory while data that isn’t accessed as much resides in slower storage, like hard drives. “This is a far less expensive solution than using all-flash storage, while giving us the benefit of data that must always be as fast as possible,” Shipman says.
Early days
It’s still early days for fast data, as indicated by the number of users building fast data systems themselves. But Ovum’s Baer says he has seen a lot of progress. For example, “the introduction of Yarn in Hadoop 2.0 allows you to schedule different workloads, so different parts of a hardware cluster can handle workloads logically rather than physically,” he says.
Baer estimates that big data and fast data are in approximately the same place as data warehouses and business intelligence were in 1996. “In the next two years, you’ll see a lot of new [analysis] tools coming out,” he says. Before long, he adds, fast data won’t just be a nice thing for companies to have — it will be commonplace.
Best Microsoft MCTS Certification, Microsoft MCITP Training at certkingdom.com
Leave a Reply