On the left is an outline of the side profile of a human head drawn with a single continuous line. To the right is the text 'Lingsoft talks AI', with 'AI' in uppercase letters and bold font.

Lingsoft has been using AI in its business for 40 years. Lingsoft Talks AI is a series of articles in which our own experts give their views on AI and its use and potential in the language sector.

Data is at the heart of language technology. Millions of words, sounds and phenomena teach AI to understand and produce language. However, not all data is equal: reliable language intelligence requires high-quality, responsibly processed data.

How data guides language technology development at Lingsoft

According to Lingsoft’s Data Lead Tiina Lindh-Knuutila, language data is like fuel for AI. The development of language technology is based on artificial intelligence having access to a massive number of linguistic examples, such as text and speech, that reflect real language use.

“Speech recognition needs data in both sound and text format, while machine translation requires texts that are as precisely targeted as possible. However, not all text or audio is usable data – the language material must be accurate, consistent and modern,” explains Lindh-Knuutila.

Lingsoft uses open data sets and also purchases data if necessary. Models that use material provided by customers are also being developed in cooperation with those customers. A single dataset can include millions of rows, and individual words must appear several times in order for the model to really learn.

High-quality and secure data makes language intelligence more reliable

AI cannot be more intelligent than the data fed into it. Incorrect or incomplete material leads to incorrect learning and, according to Lindh-Knuutila, this is immediately visible in the results.

“Poor data is incorrect: it may contain spelling errors, missing punctuation or inconsistencies. Good data, on the other hand, mirrors reality and accurately describes phenomena.”

Lingsoft always keeps part of the data separate for testing purposes, which makes it possible to ensure that the model works in practice. Lingsoft’s strengths include long-standing experience and functional practices in language technology. Reliable data management and ethical processing are also part of daily routines.

Finnish data is valuable

Independent data management in small language areas is both a competitive advantage and a shared responsibility. Finnish data remains in Finnish hands at Lingsoft, and this is a significant difference compared to global actors.

“Small languages need common rules. They support users and the language. This also ensures that technology is developing in the right direction. The risks increase when we work with generative artificial intelligence, so responsible cooperation and regulation are really important,” emphasises Lindh-Knuutila.

“Even good AI cannot be trusted blindly. That's why a person is always involved in the process.”

– Tiina Lindh-Knuutila, Data Lead, Lingsoft

Human in the loop

Although AI handles huge amounts of data, a person is always in control at Lingsoft. Lindh-Knuutila reminds readers that AI can make mistakes even if the background data is perfect.

“We never use fully automated solutions. Humans are responsible for validating and curating the material, and AI should support people, not replace them. We wouldn’t even want to turn all creative or thought-intensive tasks over to machines.”

Responsible development work using synthetic data

The amount of data will grow exponentially in the future, and this means that opportunities will also develop. Lingsoft is also studying the use of synthetic data, which refers to data that resembles real data but is produced artificially. Synthetic data can be used to develop language models efficiently without any data protection risks.

“A human can only process a limited amount of information and the using artificial intelligence and synthetic language data represents a huge opportunity in that sense – as long as the focus remains on responsibility,” says Lindh-Knuutila.

Categories:

More news from Lingsoft